Introduction to Deep Learning, Keras, and TensorFlow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody glad you could make it it's good to be here I will be your tour guide for the next hour or so to go through some of the concepts of deep learning before getting into that just with a show of hands how many people are earning you two deep learning okay I guess you're in the right place and the others are experts so they will be correcting me as I go through this I'm going to skip some of the slides but they'll be online on SlideShare plus that'll be recorded and I'm going to I guess probably skip over some of the more sort of mundane details not mundane the use cases which are not mundane in this some of the history so I'm gonna try to get the stuff that you can understand the concepts because after you understand the concepts the api's well guess what it's just the code that implements you know the concepts and so if you've tried it the other way and if you've been lucky that way you've been more successful than me because I did try doing it that way and I got nowhere so with that in mind here's just a little quick overview of some of the things you're going to be talking about and they kind of come together and clusters some of these concepts so it's not that it's exactly sequential it's more this is the kind of stuff that we'll be going over and so with that I think this slide is a little bit out of date I'm not sure I don't remember where I got it but you don't have to learn all of these things to do deep learning we're gonna concentrate on the red dot and the data science part is a little bit off you can actually do deep learning with data science even with our studio there's an interface to be able to do Kerris and tensorflow and it looks very similar and what it does is in brief there's this bridge class that essentially delegates the work to python so is actually very clever the the guy who wrote our studio I think he wrote that interface with the author of Karras and the other thing is something that's not here something called reinforcement learning and if you've heard of those systems where they play a million games sir million times against themselves like alphago that's reinforcement learning so that's one thing that would probably be a good thing to have in there and just so you know the the original one was alphago I don't remember the exact names of the next one there was there's alphago zero and then alpha zero and it's interesting because alphago was had some human collaboration and alpha zero was purely completely all software no human interaction so the system learned how to play go then I played the original alphago and I think it won 75 out of 100 matches and the time it took from starting to finishing those games any guesses some of you might know for hours it just crushes the competition it's kind of exhilarating in a way I don't know what that's gonna go but anyway so here's one thing I put in because last year was the first time Gartner put deep-learning separate and I was a little bit surprised because deep-learning has been kind of the driving force in AI the last you know five or six years so glad to see that machine learning is a little bit to the right so I guess as farther ahead to go where ever it's gonna go and so let's see what it does this year I think they usually come out in October these people put together the first AI conference the summer of 1956 at Dartmouth and in case you're wondering John McCarthy he just happened to be the inventor of Lisp there's Claude Shannon who happened to be the inventor of information theory also called the God The Da Vinci of the 20th century I almost said Godfather but that's Geoffrey Hinton you'll get to him later and there's also Marvin Minsky one of the Giants over there at MIT and so what I really like about it is that they thought that they would get it all done by the end of the summer that's optimism for you and so basically in the 50s but at that time you could sort of think of it as having penciled in if you will traditional AI which was based on expert systems that were very popular in the 80s and they're still useful and versus machine learning deep learning which was about lots of data lots of inexpensive computing power algorithms and of course for deep learning deep neural networks and so that's it for the history you can read about more of it if you want online this is the main thing we're gonna look at spend a little few minutes on this and you might be wondering what is this thing for while its label of course you have the input layer there's an output layer these hidden layers so the idea is to come up with a set of numbers for the edges so that you get a neural network that models the data well whatever that means and once you're done with that you freeze the model and then you test it with your test data and if the percentage accuracy is roughly the same you've got a good model of course the Devils in the details and so the question what you're wondering is how do you figure out how many layers to have that's not obvious it's actually an example of something called a hyper parameter which is something that you set before the training process so the number of hidden layers hyper parameter the number of nodes in the layers hyper parameter the initial values for the weights hyper parameter and there are lots more of them we'll get to them so let's just to keep make it simple let's just assume that the initial weights are random numbers for a normal n 0 1 distribution it's not that important for our purposes tonight and so what you do with these frameworks is provide input data and it's in the form of a vector and it's numeric and so in between the two layers you see all those edges you represent those weights with a matrix so when you have a new set of input numbers you multiply with a matrix and then the next one and the next one and then you get to the end and let's just pretend that there's only one node at the end because the example I'm gonna use we'll get into more detail later it's like something like housing so you may have seen spreadsheets Excel spreadsheets where you have rows of data and there are many attributes that you can have for features for a house number square feet bedrooms all that stuff I think I saw one spread so you'd have 30 features so you pass in those features the values for each row and then when you get to the end you get a number and you compare that with the actual cost of the house that's in the row in the spreadsheet they're going to be different and so going from left to right is forward propagation what we need to do or the frameworks do for us is somehow go the other direction is called back prop or backward error propagation in such a way that we modify the weights to make it better because we want to minimize the error so we'll see more of this too what happens is we have a cost function that's based on the parameters of the network this is all done for us and what we try to what we do or what it does for us is find a way to get toward whatever the minimum is for that curve that's surface and if it's in multiple dimensions we can't see it so it uses gradient descent if you haven't heard of that before essentially partial derivatives and it computes using the chain rule partial derivatives and the product and numbers and it involves something called a learning rate which is also hyper parameter which kind of controls the radio which you move forward excuse me and so what happens is basically right over here compute this number and then update the weights they could increase they could decrease it would be zero and then do that with the next layer back so you get two all the way to the beginning you've modified the network and so when you do that with a next row so for example something like Amnesty has 60,000 rows for training and if you each time you go through all the set of rows that's called an epoch so it's actually quite common to go through the data set 20 times so that means you've gone forward and back 1.2 million times so you expect this could it have something of value when you're done with that and when you you're done again you get the test the test data which is about 10,000 rows and if there's difference in the percentage accuracy is significant it's probably a case of overfitting which is not unique to deep learning it happens with machine learning and other systems and has anyone not heard of overfitting okay I have to explain it good so with that in mind here's what I'm gonna do there's basically three large categories of algorithms one type is clustering k-means k nearest neighbor in case you've heard of that there's also something called mean shift which an alternative where you don't have to specify the number of clusters and we're not gonna do that tonight but we're going to look at classifiers which are things that we'll try to figure out which object or thing is in at the end of this whole process from a list of things that you already know for example you might have like a dog cat fish bird and so you've got some images and you want to figure out what's in there oh that's what that system whether that's a classifier there's also ones where they only have two outputs true/false spam not spam we'll the stock price go up or down so it's binary there's also in the case of M nest you have ten different digits so you're gonna have a classifier trying to figure out which of those ten digits the other category is called regression and those are the kind that are basically continuous values so instead of saying is the stock price going to go up or down what's the stock price going to be what's the temperature gonna be barometric pressure heart beat heart rate that sort of stuff so with that in mind let's take a look at I'm gonna do a very simple example you may not have seen this before you won't have to remember this because you'll be doing this kind of stuff with frameworks but just to give you an idea of what's happening I think you'll be impressed later with what the framers do so here's an example we're gonna try to figure out pardon my lame artwork here we've got these red dots during the upper half of the plane and then the blue dots on the lower half and so the dividing line is y equals zero notice there's a unit vector everyone knows I'm familiar with vectors okay good so it's a unit vector pointing in the direction of the data that we want which is determine is a random point going to be a red dot or not so how do we convert the diagram into a network there it is again by manual artwork here you notice the value 0 1 where does that come from 0 1 0 for X 1 for Y that's going to be the weight so what we do and what the systems do is they take the inner product of whatever value for x and y is supplied which is going to be a point in the plane so X my inner product with 0 1 what does that give us x times 0 0 always y times 1 we compare Y with 0 which is the threshold value and when will that fire when y is bigger than equal to 0 we knew that this has to be the same pretty simple straightforward trivial ok listen to it 4 times now take a look at this one you see at the bottom that's still the same thing and now we've got B seat indeed four lines they are actually going to be half planes the intersection is going to be the square each one of them has an inward pointing normal vector perpendicular and if we just take those numbers we're gonna have XY input but we're gonna have four nodes 4 neurons each one corresponding to a line every one with me okay so here's what it looks like remember a was 0 1 I just moved that up to the top and then B C and D the numbers on the left side now just not worry about the numbers in the middle the threshold values for the moment but we've got ones coming out so when all four of those threshold values are met or exceeded is going to emit a 1 and each 4 multiplied by 1 the sum total is 4 then we'll know that that point is inside that rectangle actually the square that make sense so the only part that might be a little tricky is those threshold values and I'll just tell you what it is for the first quadrant it's a little bit different for the other ones but for horizontal lines you need to take the negative of the y-intercept for vertical lines the negative of the x intercept so going from a counterclockwise the intercepts are the values 0 1 1 and 0 so the negatives are 0 negative 1 negative 1 0 right ok so let's look back we're gonna put it right there so you see from the top to bottom 0 negative 1 negative 1 0 let's test this let's take the origin we're going to include the boundaries that the perimeter as part of the included as red dots so 0 0 if we supply x and y of 0 0 well anything times 0 is 0 and it's all 0 so we're going to get a column of numbers to the left of the middle nodes are gonna be zeros do they all equal or exceed the threshold value yes one is emitted four of them we get four it's a red dot let's try one more let's try one one so now what happens when we put 1 and 1 for x and y what happens is if the inner product you're just adding the two weights from x and y to a particular node so so in this case it's 1 negative 1 negative 1 and 1 and in all four cases that number equals or exceeds the threshold value so ones are emitted and we get a 4 and it works right everyone convinced you can try more values if you want but I know it works so I'm gonna leave it at that so before we get to farther on this one what if we had a triangle then we'd have three nodes three ones and a three what about a Pentagon it'd be five nodes five ones and a five now you have to figure out what the threshold values are and the the weights there because it's good to be the perpendicular vector pointing inward inside that shape a little bit of work if you want to do it what have we had an end gun well then it would be n nodes and ones and an end what if we had two rectangles one worth of current one and then one and say the upper left upper right corner someplace what would happen well it turns out that vectors are invariant under translation so those four vectors will work for the other rectangle that we have somewhere else so that means what we do is we replicate those four nodes the weights are the same the ones are the same and then there's another node with a four but you have to get the threshold values based on what I was saying before that makes sense sort of so what if you have this really weird shape that isn't a polygon well one thing you could do is to take the left and right extreme take the difference say divide it up into a hundred partition of a hundred segments and then what you could do is construct line segments for the top and the bottom you have a - a polygon with 200 sides which we talked about what if someone says I have a polygon that's not convex no problem because every closed polygon can be decomposed into a set of closed convex polygons in two plane if you need that so basically we're done so this is you've now sort of completed exercise set number one and this is also kind of interesting because these operators are basically the corners of the square and that one that the bottom is very interesting is the XOR and it turns out that Marvin Minsky from the gang of five and forgot his name I always forget his name the guy who invented logo Seymour Papert they wrote a paper in the late 60s proving that when you have X source not linearly separable with one layer and that could have also been part of the reason there was the AI winter I remember Marvin was part of that group so it wasn't like he had an axe to grind but I think it was sort of I could imagine the conversation really interesting theoretical stuff but kind of useless because he can't even do this fortunately things have changed we've got algorithms we've got a lot of other things otherwise we wouldn't be here tonight so if you want an exercise so that I haven't done myself just but it might be worth doing that manually if you want to get a little bit more practice try this one this one's a confidence builder you're doing it in 3d I'd yet to do that one too but I'll get around to it so again instead of having an inward pointing vector for a line you're gonna have an inward pointing vector for a plane and there are six planes they're four vertices at the top four at the bottom so you have those to deal with and you're gonna have the intercepts and all that other stuff so that's what frameworks will do for you so you don't have to sit down and do it manually imagine if you had a polygon with 10,000 sighs think of the work that you would have to do not even in Dustin Hoffman Rain Man could do this kind of stuff you know a thousand times a day so it's really spares us a tremendous amount of work having these frameworks and now a couple of things to keep in mind this network doesn't learn there's no back propagation no cost function nothing and it's because the so called oh the cost function remember I kept saying it's either gonna be MIT a 1 and I didn't say zero but if it doesn't fire it's a zero zero one is binary and the interesting thing is we need something that if you look at the it's going to be a segment like this and something like that one zero or for you would be that the other way if we connect those two things smoothly sort of kind of approximating what would the shape be kind of an S shape what function does that bring to mind sigmoid what's interesting is that the sigmoid function gives us intermediate values instead of all or nothing and that's what you need when your network is going to learn because you're gonna be tweaking those numbers when I say you I mean you were the framework what's interesting is because of those continuous values it's really like an analog device may be a little counterintuitive but that's what we need so let's take a look at something that's kind of the opposite instead of separating things into one group and another we're gonna try to get a cluster of numbers and try to fit something so they're opposite not separate but approximate and so linear regression has been around I think about 200 years I think Carl Carl Gauss was who started that and so this is the simple case we're not looking at all the other ones where it could be quadratic in cubic just a nice little cluster of numbers and this is not curve fitting has nothing to do with that the ideal line might be might intersect all of them most some or none what we want is to find a line that is the least far away from the points based on the vertical distance of those points from that line so you take the difference for each of the the y-coordinate for each of those points to that line square it so there's no negative values in cancellation add them up divide by the number of points what does that give you this quadratic function everything is now negative it's going to be actually if I can just go there so if you if you look at this it kind of looks like it's the best fitting line somehow because if you move it up or down that's changing the value of B either increase B or decrease if you rotate it you're increasing or decreasing M those two variables are independent so they would be like in the plane and whatever combination of M and B that you get will produce an error value and obviously the air is not going to be 0 but otherwise and have to be all the points that have to be on the line so the error other than that optimal line is going to be bigger than 0 and this quadratic so I'll spare you the suspense that's what it looks like that's a convex surface so it has either a goal will react to a more global minimum you don't have to worry about saddle points that'll come up later or local minima local maximum that kind of stuff so the point at the bottom gives you the value of M and B for that best fitting line so imagine two perpendicular planes intersecting parallel to these axes you get two parabolas they intersect at that point why do you wanna look at the parabolas because you can take the partial derivative everybody remember how to do derivatives the slope of the tangent to a curve it will be zero at that point in both the M and the B so it's going to be partially with respect to M and B set it equal to zero there's a closed form solution basically done let's pretend we didn't know that and we had a value of M and B that would put us somewhere in that curve how would we go in general terms from whatever NB value put us over here to the bottom to the minimum how would we get there by something called gradient descent imagine that point that there's a tiny little sphere and you release it with what path is it going to take whatever way this maximum descent think of yourself being on the side of a hill in the mountain range and you want to go downward which were you gonna go where it's steepest it's a greedy algorithm that's really all there is to it in essence of course in practice there are things that come up and just make a mental note of this point because that'll come up a little bit later and so as an example I mentioned real estate so let's say horizontal axis is the number of square feet vertical is the cost of a house very coarse-grained approximation what we want is something with more features like that I came up with 6 as I mentioned there are some data sets that have 30 of them so those are the numbers in the spreadsheet for each rows the values for a house and the rightmost number is the actual cost so remember you feed in all the values for a given row the values of those features go through this network and then compare the result with the cost that's in the spreadsheet we did knocked up a little bit about that before so just more equations i see how it generalizes we have instead of y equals MX plus B we have x1 to xn and then B is the bias the intercept so just to go back so now when we're taking those numbers again we go through all the way to the end and there's one thing that I conveniently neglected to tell you this is a linear system so by analogy if you take the number 2 times 3 times 4 times 5 what's that 120 so if you're ready a program are you going to use put in 2 times 3 times 4 times 5 every time you need it are you just going to use 120 obviously the latter well the analogy I'm trying to make is when we take that first matrix we can immediately multiply by the next matrix and all the way down to the end and produce one matrix which collapses this whole system to the input and output we want to prevent that from happening so we need to introduce non-linearity and that's done by activation functions one of which is sigmoid another one is tan H another one is rel d others there's ALU there's the exponential one there's there's rel u6 where it cuts off at six that's specific to tensorflow and all these systems have all the other ones so what happens is you have the numbers the vector coming in multiplied by that first matrix then that new vector each one of the values you pass through the activation function to get a new vector then you multiply by the next matrix does that make sense just by analogy if you go driving on the highway there's nobody around you can drive at a constant speed if you go in a parking lot where their speed bumps it interrupts your flow you slow down you move up you can't go straight through or tollbooths or whatever analogy helps you with the concept so we can't just immediately go all the way through to the end that's what the activation function does and also enables us to find those computations with the smaller numbers so that we adjust the weights it's all about the weights that's what counts and there's no a per Aryan priori way of knowing which weights are the best that's why you write these systems and you experiment with the different number of layers and so forth so when we get to the end which we're assuming is just a one node we have a cost function kind of like the one that was there that's got me a squared error and then you take these partial derivatives it's called the gradient because the derivative the slope of a tangent only applies to two dimensions when you're in multiple dimensions you've got different axes and so it's going to be a partial derivative for each of those axes and that is a vector of values that's where you go so when you see the diagrams online about going to the minimum it kind of zigzags it's because of that there is no straight line down rarely unless the numbers just work out that way so now we've got the idea cost function we need that we need the gradient descent method there's like five or six of them also the Hydra parameter and the learning rate you must have those three things to do back propagation we also need an activation function that prevents it from collapsing now it has to be nonlinear so that's the minimum and then if you have at least two hidden layers that's deep learning if you have at least ten hit layers that's called very deep learning seriously I thought it was gonna be like 500 when if this is such a small number oh by the way the state-of-the-art with neural networks 2011 somebody came up with a six eden leader neural network that was state-of-the-art and then things kind of blew up and there was a competition in 2012 I think it was Alex net 150 hidden layers then Microsoft thousand layers and then they have these massive networks I have no idea how they come up with it how do they test it out of it how long does it take Jeffrey Jeff Dean he's the head of Google brain usually Legendary's the word that precedes his name he really is wicked smart I was at the may trade last year it's coming up next weekend I think he mentioned well you probably want to avoid training neural networks that takes more than three or four days Thanks so that's another factor and then you get to TP use with tensor flow of the tensor processing unit and on and on all that kind of stuff so but this is the basic fundamental idea forward propagation back propagation go through an epoch multiple epochs shift the data around they shuffle it this that the other and then to get the best number you can how did come up with that go to Kegel go to github borrow what other people have done start with one layer experiment with it you know work with Python if you prefer or with Java or you can do Scala and you can use Karis I recommend Karis it's a lot more intuitive as you'll see in a few minutes and then when you really want to get the horsepower you can use tensor flow or pie chart as well that's another one that's popular with some people so these are just the equations I'm skipping skipping and okay so oilers function does anyone remember or yours constant or who doesn't remember I should say that's remember when you studied math there was l OG logged out space 10 and then there was L n that's base E this numbers 2.718 whatever right so what's interesting is this is the only nonzero function differentiable function in the plane that equals its own derivative and it has a lot of applications and a lot of systems there is sigmoid if you multiply everything by e to the X is e to the ax divided by e to the X plus 1 so you can see that it the denominator is just a little bit bigger so it's going to be between zero one monotonically increasing it does and you'll hear the term squashing you can take any set of numbers pass them through that they'll be like probabilities because they'll be between 0 1 the soft max max function is similar except the difference is that the numbers that you pass in will will also be between 0 & 1 but the Sun will equal 1 and that's important especially for CNN switch we'll see you later and so here's tan H and now this is the darling of the day the year I guess well you very simple to compute there's a point there at 0 where it's continuous but not differentiable so that's a little bit not to worry it all works and this is kind of the this is not completely correct I spotted it a while ago but I have an update of the slide that's the softmax so essentially instead of saying X 1 over X 1 plus xn and then X 2 raise everything so that it's e to that power does that make sense if you drop the YZ that's just the proportional wait for the set of numbers of course if you take x1 2 all the way to xn that could be 0 but that won't happen when it's the exponent because it has to be at least 1 the Sun and just real quick look at this isn't Python look at the middle one what's the 10h activation function it's called tan h very nice and convenient the first one 1 over 1 plus e to the negative power that's essentially that and then r lu is the max of 0 and the dot product so kind of simple history ford and there are other ones as well you can check them out online so I mentioned this already where Lu is the is the one now what about the cost functions we saw this before that's the simplest one and here's another one that has a saddle point because at one direction it's a minimum in the other way it's a maximum and there are techniques for getting away from those sorts of things remember if in three dimensions it's easy to see if it's a hundred dimensions obviously you're not gonna be able to see it because you can't draw it so there's something called momentum and there's Nesterov momentum which is built into tensor flow and you can specify a value it's kind of like the way I think of it when I first made sense was you're in an airplane and there's turbulence and pardon me and so you you're wondering how long can they take this before I vomit and then the pilots which is on that extra power and you get out of there need relax so that's kind of what you need to do the momentum to get out of there but the thing is how do you know that it's the saddle point maybe it is really the global minimum so you compare okay you're going to give it a momentum you go out but oh wait that's actually increasing it should be decreasing oh we're going to go back so this is kind of game and there's like five or six of them they each better than the one before and yeah there's add rmsprop out of grad I don't know the names but they're all built-in so hyper parameter and here's another one this is the cross-entropy function which is not really intuitive but it's a measure of kind of the the extent to which to probability distributions differ I know that doesn't make a lot of sense but it works and you can use it treat it as blackbox until eventually it starts making sense that's kind of how I did it and there's gonna be a lot of that when you're do plugging in left and right and trying things so it feels like seat-of-the-pants programming it's deep learning is about heuristics and you try something it does or doesn't work and then you have an idea to do something and yeah somebody what would have ever do that try it and you do it and then it works really well and then you go to the big standard data sets and you get better performance than they do and then you've read a paper and you put it on an archive everybody goes great we got a new technique that's basically how it works so there's not a lot of documentation those papers can be difficult to read so there you have it and selecting a cost function just general kind of rules if it's mean squared error that you use a mean square for regression binder cross-entropy categorical cross entropy this may not make a whole lot of sense right now but there are some guidelines for selecting that the cost function now something else I wanted to tell you about the the data generally you try to keep things normalized it just works better so for example with in CNN's pixel values are between 0 and 255 you divide by 255 is between 0 & 1 one of the big-time sinks with machine learning is feature extraction figuring out which ones are more important than the ones that are less important you might have a hundred features and five of them are really important another five are kind of sort of and then the other 90 that longtail might be almost negligible so a lot of time spent figuring that out cleaning the data no duplicates no incorrect data no missing data if you have data that's incorrect what's the correct value what do you do sometimes you just replace it by the average sometimes you put zero sometimes you drop the room what if that represents an outlier is the outlier significant well if it's like the stock market you bet it is so you have to have a good solid understanding of the work with someone who does who has domain expertise with the data can you drop a column or add a column it's not obvious so do you go through all this process and then what you do with deep learning is for each of those features you normalize this actually standardized but it's you will see normalized being used it means something that's slightly different and so you transform the data so it is n01 meaning it's a Gaussian distribution mean 0 standard deviation 1 well you do that with the data that it's all sort of level playing field how do you figure out which features are more important than the others if it's all n 0 or 1 this is my favorite part the answer is what do you do nothing because deep learning does the feature extraction for us that's the beauty of deep learning that's why deep learning thrives on data there's no such thing as too much data for deep learning however if you don't have enough data maybe have more columns than rows what do you do for example with image recognition well there are some standard machine learning algorithms that you can use you could do something like k-means you could use SVM support vector machine so knowing what to do when and how and what means acquiring a certain amount of knowledge of deep learning as well as machine learning to understand the nuances because some of the things that happen are not intuitive for example I didn't go into it but there is something called the drop rate Geoffrey Hinton so if you have I was talking about overfitting so that means that some of the noise is treated as though over signal so how do you fix that one technique you just drop nodes you have 20 percent 30 40 seems a little crazy but it works so the things to do don't necessarily align with your intuition and that's why having the experience of different situations is what will help to guide you and that's what's time-consuming is getting that information there's no one place that has it and if you find it please tell me some other thing I mentioned well here's the dropout rate just some of the things in there these aren't really that important right now but later on if you want to go over this again and a dropout rate we can skip that how many hidden nodes we kind of went through there cnn's versus are intense i'm not going to go into a lot of detail with our n ends they're a lot more complicated not intuitive and difficult to train and more difficult to describe especially the sort of the main thing right now with one of them with our ends is LST m/s long short-term memory that gives you the ability to keep history is kind of like our n ends hat our stateful entities cnn's are stateless if you want to make an analogy so for example when you have a self-driving car you've got images coming in each one is processed as the CNN coalition or network now if you want to make sure you don't collide with anything you've got to keep track of the history of where something is moving that's where you have the LS TM so you have image stuff you identify the LST M gives you the history and then you managed to avoid the collisions and other things in theory that doesn't always work because there was a car about a month ago driving or was it 55 or 65 miles an hour that he hit a stationary firetruck does anybody remember that so when you're a self-driving vehicle is following another vehicle and that vehicle moves out of the way it sees the stationary object it's like on the highway the expressway you know those signs up there or you can ignore them they're not moving that was essentially the logic from what I understand so the solution apparently all these systems have that flaw and Elon Musk is convinced that it can all be done in software some people say needs to be done using more Hardware more sensors so I guess time will tell how that works out so the thing with CNN's as I mentioned is mainly for image processing but also for audio and about 60% of all CL neural networks are CNN's so this is probably worth your while to learn and I'll just give you the basic sort of minimalistic scenario and then all the variations are the more interesting ones the ones that you would actually do when you're solving a problem what happens is that there's this filter process it's a convolution followed by a rail you followed by max pooling now the filter process is kind of interesting you don't have to come up with the numbers but typically you have an image you'll have a 3x3 filter the system generates it usually you request give me eight three by threes usually it's the power two or sixteen or whatever and so you have nine numbers and you match it up on the top left corner with the image so you kinda it's like an inner product of two three by three vectors so you have 9 products eight sums you get one number you move that filter across and generate these numbers to be populated another array of numbers and if you go over one at a time that is the stride stride can be one or more horizontal vertical they're independent and you can also since it's going to be smaller you can also pad it with zeros and then that's done before the filtering by the way little detail there and so you end up with something called a feature map and the idea it's actually based on the way our eyes work which is different parts of our I can recognize different shapes Cemil vertical line or horizontal or maybe like an oval shape that's the idea it's emulating that not the neuron stuff just your actual eye so that's how it was modeled so you come up with these feature Maps they're not images you could treat them as that you will see something however because of the numbers that generally are integers between negative 2 & 2 you could end up with values that are negative so that's where Lu comes in negatives replaced by 0 and before getting too far with that here's an example so you see that green square and you see this one here only the top road there's the one in the product with the 42 the result is 42 it doesn't start there but that's you know kind of partially through that process that's what I was trying to explain before and so that is a very simple filter probably kind of useless you need more stuff and here are some examples that will sharpen your image this one will blur the blur is because they're all the same values so you kind of has to move across then it's like a neighborhood of this point and it takes into account its neighbors so it smooths the peaks but it also makes the image a little bit dollar and so if you need to that's what you use and here's detecting edges emboss these filters are in Photoshop and there's other things other tools these are it these are the filters are using and others so what would happen if you had just the one in the middle what would that be it would be kind of like the identity filter because it would just pick up just whatever is in that particular cell and replicate that what have you had a one in the middle and a negative one on the left well when you have two consecutive adjacent pixels of the same color the sum would be what zero so you're going across 0 0 0 0 what does that mean you hit a boundary otherwise this all the same color there's nothing in there's just one single same consistent color so those even that simple little filter can help you detect edges so there's vertical horizontal and that is cumulative so the edge stuff that's detected the next layer in the neural network will then kind of figure out oh there's these polygons or ellipses and then it starts getting into the features putting them together till it actually recognizes you know that's a head it's an arm and then finally it's a man sitting at a table with a cat on the table or whatever it is so that's the part with you know the first two parts use max pooling again simplest scenario 2x2 subdivision take the largest number and that gives you something that's half as wide and half as tall you're throwing away 75% of the values why does this work I give you an analogy with compression algorithms for binary files there are two types there's lossless and there's lossy what is jpg it's lossy but it works so that's kind of the idea however put a big asterisk next to this because Geoffrey Hinton who was involved in coming up with this he said and I'm not I don't have the exact quote but pretty close he said this success of max pooling has been a disaster for convolutional neural networks he's one of those soft-spoken brilliant contrarians and he's been raised so many times that when he says something he's probably on to something and so it's something called capsule networks we'll look at that a little bit later so now you know oops before we get there so we do the filters we get the feature Maps do the rail view max pooling and then do it again and then the so the the filters are for extracting features then we have to do something to do the classification that's another part that's the fully connected layer so because of the processing those feature Maps have to be stretched out into one dimensional vectors they're all strung together each one of those points is a neuron each one of those neurons is connected to the output so here it happens to be four of these things but in like an amnesty to be the digits 0 to 9 so these are like buckets and that's where the softmax comes in so you got the whole thing connected softmax to the out then I'm skipping details do you also have a modified version of backpropagation that we described earlier it's a little bit different because max pooling isn't a differentiable function so you have to it does some internal stuff to keep track it all works and again it's applying all these images and updating the values for those filters to get better feature Maps to get that whole best-connected max pool then there's a set of numbers between zero one whose sum is one take the maximum it's a dog or it's a three remember there are only approximations it doesn't come out as 100% probability it might be 80% but it works even so even though it's not close to 100% on average the aggregate there's a high percentage of success and so the idea is coming up with convolutional networks that are better know the two years ago somebody won a competition before I mentioned something about trying things so what these guys did they took that max pooling they did that immediately no processing they throw a 90-70 5% of the image they won the contest that was one of the things they did among other things so it's these kind of simple none and not necessarily intuitive combination of different things that you do and it works based on your data set so there's no really a priori way of knowing what will be the best that's where the creative work comes in and of course all the processing time that's involved so it can be very time consuming so you get a team one person does infrastructure one does algorithms one does modeling and then you have a fourth person maybe and then you kind of pull your knowledge together that usually works better than flying solo with those petitions so that kind of gives the view of the first part of convolutional neural networks and so at this point just want to pause to say one thing you might be thinking you know this deep learning stuff maybe they just got lucky a few times but most of us just kind of fluff that's a fair assessment I have two things to say first there's something called the universal approximation theorem which states that any continuous function in the plane can be represented arbitrarily closely by a neural network for those of you who remember when we had Taylor series it's a polynomial expansion of a continuous function or differentiable and then there's Fourier series it's a combination of sine and cosine for partial differential equations you know those boundary value problems which I really liked and so now we have this I was actually quite surprised but it really does work now the thing is that there's there are a lot of continuous functions in the plane remember a subset of continuous functions is the differentiable ones there's plenty of those in fact there is an uncountably infinite number of continuous functions in the plane each of which can be represented arbiterly closely by some neural network which tells you that the expressive power of neural networks is immense if that doesn't convince you that's okay last summer I think it was there was a company that startup they created a barcode scanner for blind people and the sort of state of the art up until then was a $1,300 device theirs was $20 using a deep learning and you didn't do much it's only $20 and they trained it and apparently it would read scan the barcodes and then after a while what happened was this little barcode scanner learned how to read the dietary informations you know the ingredients the percentage nobody trained it nobody plant maybe was you know version two of the product I don't know no one could explain it my answer is the power of deep learning so this is sort of a nice little feel-good story some of you might be thinking well first we've got the barcode scanner and then it's Skynet - and I'm not worried so anyway that's just a little anecdotal sort of background to give you the idea that maybe there is something to this stuff and I think I may have taken this out maybe I don't have it in here yeah the I mentioned capsule networks they are an alternative to the CNN's meaning without the max pooling so you take that out and instead of having individual hidden layers they're kind of grouped together in containers or capsules and there's this routing mechanism and voting pattern algorithm rather and there's code on github for this and so the purpose is to try to capture the relationship between the hole in the part so for example if you have a face two eyes a nose and a mouth then there's something I called the Picasso face you know where the nose is in the mouth the mouth is up on the eye if you look at a standard CNN it's translation invariance so it goes oh yeah there's the mouth and a nose in two years so it's a face capsule networks prevent that from are not as prone to being deceived by that they will detective are it's not a face why am I saying this well because there's something called generative adversarial networks I don't know if you've heard of those they are there are ways to take an image modify the pixels a very in a way that's imperceptible to the human eye and yet defeat any neural network there are lots of techniques to defend against that all have been defeated and so why is that important well if you're in your self-driving vehicle and it's a stop sign and it thinks it's a speed limit that could be catastrophic and it gets worse a few months ago somebody put up a paper an archive with an algorithm that describes how to mess up the image by modifying one pixel so it's obviously very important and capsule networks are more resistant to that sort of deception if you will they can also do some other things that are better than the other networks however they're difficult to train they're slower I think more complicated they're not perfect there are flaws and so on is over Geoffrey Hinton has been working on us since 2011 so he's been very tenacious about it and you can find stuff online about that so that's generative adversarial networks and the interesting thing has originally it was done by Ian good fellow four years ago to generate synthetic data so if you don't have enough data generate this nice stuff and mix it all in together and then that was kind of the concommittant effect if you will nobody expected that I'm not sure who came up with that idea but there you have it so Carris is written by someone who's working at Google and it's a layer on top of tensorflow and it's as I mentioned before it's more intuitive it's a lot easier really when you're first starting because the API is you don't have to know understand what's going on with the graph underneath although it's on top of care of tensor flow and also the a no as well and another one I forgot I think no it's the CNT there we go CNT k so it's well I'll just show you never would be talking about models for last 45 minutes so for Karis we import sequential that's just like a container there there's another type not not to worry about the other kind it's a functional model layers dense activation we talked about activation dense means they're all connected so between two layers every nodes connected to every other node that's dense look what we have sequential dense through the input shape see when you have an image this 28 by 28 remember you I said you have to stretch it out to one dimension vector one dimensional vector 28 squared is 784 so that's the input data the pixels numbers between 0 and 255 activation rel you another dense layer activation soft max you already kind of know what this is doing sort of and then the model summary oh by the way you put this in a you know ABC dot dy and then you type python ABC dot py and you run it and this is what you get for the summary it tells you you see the dense layer or what's an activation blah blah blah gives you the parameters even that little neural network there 25000 parameters it's not unusual to have 500,000 parameters 10 million and that's the stuff with the activation function with the cost function and all that stuff that's all the work is being done that's really number crunching and here's another one we talked about CNN's there's sequence at the top dense I just mentioned drop out we talked about that flattened one dimensional vector activation convolution import Kampf 2d the convolution I was telling you about the filters max pooling add a delta the optimizer the gradient descent technique input shape that says we're gonna have 32 by 32 image there's gonna be three channels I didn't go into it before but it's separated into RG and B the model is sequential look what we're doing add com2 d 32 filters 3x3 padding is the same input shape is that up there add activation rel you it's you know all of this already so let's take a look at tensor flow it's kind of a deferred computation graph is what it is if you've looked at a estie's abstract syntax trees on steroids it's involves using tensors which are multi-dimensional arrays and so what happens is that this is a little bit non-intuitive it just more stuff about what it can do the typical stuff that you expect and I'll show you a little bit about tensor board in a couple minutes so here the use case is pretty much standard stuff no surprises going a little quickly here and so what you have I mentioned there's a notion of the graph edges nodes operations lazy execution a session in order to actually make something happen you have to invoke a session and it's run method and then stuff happens that's why it's deferred there's also eager execution which just makes it look more Python esque and that's more recent it's not available in the standard download is pip install ten serve law if you want the eager execution pip install TF - nightly you get that that's in 1 for 1 the latest version of tensorflow I think right now 1.6 and if you want it for the GPU pip install TF - nightly tech - GPU so what happens is I mentioned already pretty much the same thing you have a TF session object and you invoke the run method for example next slide I'm just summarizing what I described here about the different order tensors generally you won't go past four dimensional tensors and those are actually what you use with CNN's there are some people working with really large systems there are five dimensional tensors that they use but that's not it's unlikely that you're gonna be doing that and so we have three types constant placeholder and variable try not to use constants because they get saved as part of the graph and they can bloat it so you tend to prefer to use variables and also they can be variables can be shared this is some little details not to worry you don't have to memorize it and so it sees the next one here a little bit okay so look what we have we import tensorflow STF that's the standard TFS tensorflow we have a constant TF constant notice assess TF dot session and then what do we do we print SAS dot run of that constant that we define which is a zero dimensional tensor so the result is three and then you have to close it do you like this there's a little bit slightly simpler way this saves you one line you do with TF dots SSS print that doesn't really save you a lot you'll see it'll be better when we get to the eager execution a little bit about arithmetic the operators are full English words instead of the symbolic operators and you get the results that you would expect when you perform those operators we can skip over this a little bit about doing some other calculations notice we have some built-in functions approximating pi as that value we have our SAS defined let's go from the bottom up what is first we're going to do div TF div of TF sine PI over 40 F cosine PI over 4 sine over cosine is tangent of PI over 4 radians 45 degrees so that is gonna be what 10 one what's cosine of PI radians negative 1 what's 2 sine of PI radians so it should be 1 negative 1 0 not quite the last two are correct but there you have a approximate value but we also are using an approximation for the value for pi so in case you need to do highly precise calculations just keep that in mind why does the tangent work by the way that was approximate to write because sine and cosine of 45 degrees are equal so the number is going to be the correct value plus air some e the bottom will be correct value plus e 1 so it's just coincidental serendipity or something like that so here we have the part where we can feed in numbers we can a placeholders if you give written C programs it's kind of like doing something like int X semicolon and then later on you go x equals 3 so you declare it and then you define the value that's that's kind of the idea so we have a feed dictionary and look what we do over here at the bottom sass dot run of course see we pass in this feed dictionary because C is the product of a and B so we have to pass in values Mb that's how it works and based on that idea this probably makes more sense too because we have what are we doing here we're defining W X B what does that suggest we're gonna do W times X plus B linear regression and so what we do is we define W X that's called W X and then Y is W X plus B this is all deferred nothing's executed here see the feed dictionary WX W is fixed X didn't have a value so we had to pass it in value and then we get that result if we want Y W X plus B have to pass in something for X and something for B well that's what that middle line is doing in the green so we get that so you can see the start here where you can do linear regression you pretty much anything you want but you have to build up the stuff and compare that with Kerris in two slides you saw what a convolutional neural network looks like here we're just doing a line right so you I'm not I'm trying to be impartial because I think part of the purpose of a presentation like this one is to show you the different things that are available so that you can make a more informed decision about what you want to do what you can based on the constraints that you're in so here's an example the line in the middle I didn't mention global variables initializer that's another method that has to be invoked to initialize all these things that have been defined to declared at the beginning but haven't been initialized you have to invoke that if you do this line in the middle the file Raider it creates it goes into that directory puts in a file and it saves what session graph and so when you go into your browser and I think I have it here dancer board there's the graph I mean pretty trivial but it shows you and then you can highlight things and expand and draw this very nice when you're doing trying to do some debugging and gives you the information on the left and we didn't give these names but you can do that and thing is if you have 50 nodes it's gonna get a little bit messy because the graph could be quite complex or a hundred what do you have 500 what you can do is define components where okay all these things are inside of this one component and then all the others and another component for example there's another benefit to doing that beyond just the cleaner graph which is not intuitive but when you hear it it'll make sense when you separate things into separate components tensorflow can execute those components on different CPUs and GPUs in parallel you get that for free that's nice another incentive for doing that so in case you need to be concerned about that so just going back to this let's see what do we have eager execution as I mentioned before you have to the specific download when you install it there it is right there I already told you you have to you need to Python 3x and I already kind of told you what it does so you have that line TFE you enable eager execution now we have X defined as a one-dimensional array tensor multiplied by itself we get 4 which is pretty much what we wanted in the first place so which do you prefer eager execution or the regular style obviously this is going to be better now as far as performance I don't if you have a really large massive system one with traditional tensorflow the other with eager execution what the difference is in execution time but that's something you can try so here's a little bit of tensor flow and convolutional neural network we have a Python function where you know we're passing on all this stuff and it will construct the neural network this is kind of like the the decorator pattern if you're familiar with that decorator yeah I think it is a decorator better so what we do here we have noticed TF late the other stuff is of course not shown and with the layers come to D notice we have input layer and this means that it's 28 by 28 there's one channel this negative one is just a syntactic thing for Python let's not worry about that filters 32 kernel size 5 by 5 padding well you now one thing I didn't mention before about the variations with the kernel size of the filter size is 3 by 3 is kind of standard but you could do 5 by 5 obviously and you can do one by one those guys that won the contest with the RAL you with doing the max pooling first they also did three by three five nine five one by one and then they merged it all together isn't that intuitively obvious I don't think I would have thought of that but that shows you kind of the idea trying different sorts of things see how they work generally there are odd sizes so that there will be one center point that's a tiny little detail so now we have the pooling layer at the bottom after the comm one we do what pool size 2x2 strides to so it's the 2x2 subdiv but it goes over to vertical horizontal not so bad and then a little bit more there's another convolutional layer and then there's another pool just like we did before so that this code is actually not bad and there's more stuff if you want to see the full code it's right here and actually what I did do is I did have Ganz I just had it in a different place so we have what panda on the left the weird thing on the middle and a panda on the right but look at the precision there it's a Gibbon with 99% accuracy remember you you don't see do you see any difference between those the left and the right I don't but look at all that stuff and remember you can do one pixel modification and defeat the neural networks so I think I've actually already shown you all of this and OH up until recently the focus was on static images you can also create your own if you want and there's github link and get the code and also with em nest and here's the part I wanted to show you it's very hard to see so what people have done recently is applying Gans to audio so you can corrupt that sound file in the sense the it'll say something else than what you expected I think he gives me meaning to fake news so in my kind of warped mind I'm thinking we're gonna get to the day fake images fake news we're gonna wake up one morning and go what's my name I don't know it's fake everything's fake I mean sorry off on a tangent here and so there's also a nice little link over there if you go there I have no affiliation with anything just I found this you upload two images and it does the convolution and there's a lot of public stuff that's there the people have done really nice and you can upload your own and I won't go into the ones that I did but I took I'm not an artist so I took my SVG code that sends some JavaScript generated some images and then I took some celebrities and I kind of merged them together some of them were nice some of them are a little lame but you can try it on your own and see and there's a ton of stuff you can learn if this felt like a fire hose is just a trickle that's what somebody told me actually about my presentation about three months ago lots of stuff that you can do and you know the T model for learning you doing something in depth and then you kind of horizontal I call it the pyramid model where you got a pile of sand and if you want to go add another foot or morphs vertically you got to pour a ton of sand because it spreads out so you're learning horizontally and vertically that's very time-consuming especially if you're doing it on your own to find stuff so I recommend you there's Udacity stuff udemy videos there's videos on YouTube Kago blog post doodle to this little that go to meetups talk to people share the knowledge and get some reinforcement because it really is a lot about reinforcement and repetition as well as the you know the technical details but it's the familiarity is a very significant part of that and just about done last two slides just a few of the books ever written the Reg X book is coming out I think in bay and I do some training and that is basically it hope you got something out of it thanks for your attention thank you so much also so if folks have any questions I can bring you the microphone you know raise your hand start your in the back only questions that I know the answer to please that's the requirement okay sweets I'll review the microphone say thank you for the a nicer presentation can you speak a little bit more about like one pixel can school I have not read the algorithm and sort of worried that when I read it I'll get scared that is so simple anybody can do it it's sort of half facetious but if you read the paper online I don't know the details but apparently he has succeeded in constructing such an algorithm I don't I don't know the details but someday I will make myself read it awesome any other questions if you raise your hand gentleman in the back oh yes so also I will post the slides on SlideShare within like the next two weeks and this was recorded thank you Mark and we'll post it on YouTube as well within the next two weeks yep anything else all right so actually let me check your as well I think we had two more questions let me see here well somebody asked how to test a neural network in the software world it's either true or false very good question I should know the answer I do not there is some there are a few blog posts you can find online that address that specifically and so yeah I'll figure that one out - all right and final question is the neural network training process similar to methods like newton-raphson method to compute square roots yes where you iterate in a wait for closed values you can use Newton's method that is actually one technique that's used this is all the gradients that are computed this first order derivatives but you can use quadratic kind of rate and it's a second order iteration forgot the exact name but yes those there are those techniques awesome okay any more questions all right well thank you so so much thank you and thank you Bri for coming all right preciate [Applause]
Info
Channel: H2O.ai
Views: 16,279
Rating: undefined out of 5
Keywords:
Id: URERdVb-lpg
Channel Id: undefined
Length: 83min 46sec (5026 seconds)
Published: Mon Mar 19 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.