A friendly introduction to Deep Learning and Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi and welcome to a friendly introduction to deep learning my name is luis serrano and this is me and I work at Udacity so let's start with a question what is machine learning this is the question we're going to answer now so let's start with an example we have a human and we have a cake and our goal is to tell the human to get the cake so we can do this very easily with one instruction go get the cake then the human realizes what to do she goes and gets the cake let's go let's say we have the same problem but now we want to solve it with the robot so for a robot is not that easy we can just say go get the cake right we have to give it a full set of instructions so first instruction turn right then go ten steps turn left then go for our steps and then get the cake so that's an OK solution but first of all it's a little complicated it second of all it's it's not general if the robot is like in a different position if it's here then we need a completely different set of instructions right turn right go three steps turn left go five steps graph cake so we want a solution that is general we want to like we did with the human we want to take one instruction and give it to the computer and no matter what the location of the computer we want to some find the cake so this is what we do we're going to say calculate the distance to the cake so the robot calculates it and it's five then we're going to tell the robot to move around itself and pick a direction where the distance decreases so the door robot goes here looks like that the distance is four point five then tries a different direction now it's five point three tries a different direction it's five point eight so realizes that the best thing to do was to move right so it takes a step in that direction this is the four point five and then the next instruction is repeat what you just did until you find the cake so the computer is going to look around itself keep finding the direction that minimizes the distance the most and gets to the cake and this is in a nutshell what machine learning is let's look at us another example is maybe a little more complicated we're on top of a mountain is called Mount Everest and we want to don't know how to gap how we don't know how to get there but we know that we want to descend the mountain but it's also very foggy so we can only see around us we can't really see the whole mountain so what do we do well we look at our surroundings and we try to figure out which direction we can descend the most so it's this one so we take a step in that direction then again we look at our surroundings look at what direction we can send the most take a step in that direction and then do it again look at our surroundings where we send the most take a step in that direction until we successfully descend the mountain this algorithm is called gradient descent and if you like math what we really did was we looked around us we took the derivative or the gradient of our height function and then the gradient pointed at the direction where we decrease the most and then we take a step in that direction in the previous cake example we took the derivative off on of our distance to the cake and that points to the direction that we decrease this is the most and we keep doing that's why it's called gradient descent so let's look at the problems that we just solved first problem let's get the cake and the second problem was descent for mountain and then let's just think of any machine learning problem which is solve any problem what we're trying how about how do we solve the cake problem we minimize the distance to the cake how do you solve the mountain problem we minimize the height so in order to solve any machine learning problem what we want to do is we want to find a metric that tells us how far we are from the solution just like how far away from the cake or how far we are from from the floor we want to find to find it the error and we want to minimize the error and that's the key of most machine learning problems you take your problem you define the error and you minimize it using gradient descent going around yourself and figuring out of which directions you decrease the air the most and by calculating some derivative this is how we solve most problems machine learning so for example we teach computer how to play go or we can teach computer how to play jeopardy or a very very cool application is we can teach a car how to drive itself or many more you can teach a computer how to detect spam mail how to recognize faces in a picture things finance applications or medical applications many applications in general at the core of all these applications is this concept that I'm going to teach you today neural networks which is the basis of deep learning so what is the first thing that comes to your mind when you think neural network I know you but when I first heard neural network this is the first thing that came to my mind it was just some robot somewhere with claws and like a brain and some liquid and something scary futuristic but luckily then I started seeing actual neural networks and I realized that they are a lot scarier than that ah this is an actual neural network there are these little nodes all over the place and edges between them and things called layers and hidden layers very ugly ah but I've been working with neural networks and trying to understand them for a while and I finally see them in a different way this is how I see a neural network when I think for their own network I think of a kid playing in the sand and there are some some red shells and some blue shells ah and you tell the kid ok can you draw a line to separate those red shells from those blue shells and and then that's exactly what the goal of neural networks is is you have some data in the form of some red points and some blue points and you want to tell the computer hey can you can you can you split the data to the computer well yes that's what I did I drew line or or some work may be complicated solutions but the idea is to teach a computer how to split data and how do we do this well in the same way that we found that cake or that we decrease descended from the mountain so let's look at a small example we have this is our data we have three blue points and three red points and we're going to teach the computer how to split the data so like in the cake we start in some random position right the random position is draw a line somewhere and now I'm going to randomly say the points on top are gonna be I'm gonna think they're blue and the point on the bottom are red and now how did I do like how good is that solution we can see that's not ideal right because it did correctly classify four of the points to blue to red but it mistook two of the other points like the point of the very left it thinks it's blue when it's red and two pointed a very right it thinks it's red when it's actually blue so we need to find an error function we need to tell the computer how badly it's doing like like how far is it from the cake or how far is it from the floor and it seems like one sensible error function is how many points on my misclassifying right like how many errors do I have here I have two errors so I'm cold so I can degrade in the scent and move this line around and to see where I can just decrease the air the most and then maybe I can manage to decrease this this error by one by classifying correctly one point and then I can do it again correctly classify the other point and now I'm hot because I have found my solution which is zero errors so this is almost the correct solution there's a very very small subtlety that makes it actually impossible for us to solve this problem with that error function I'm going to show you what the subtlety is here's our problems again we have the get cake problem the descent from the mountain problem and there our third problem and split the data and remember how did we get the cake by calculating the distance to the cake and minimize it minimizing it how did we descend from the mountain by calculating the height and minimizing it and now we think that we can solve the split data problem by minimizing the number of errors but now here is a problem the distance to the cake is a continuous function so if I move around a little bit in tiny steps the distance is going to decrease in very very tiny amounts matter of fact it's like I can find the derivative that's the whole point the same thing with descending from the mountain if I move around and tiny tiny steps to hide decrease in very tiny step so I can actually see where I can decrease the most or calculate the gradient here it's not the same thing with the number of errors it's actually discrete function I can be I can be like okay there are two errors and then I move around the line a little bit and then I still have two errors and I move it a little more and I still have two errors and then all of a sudden it jumps all the way to one or it jumps all the way to three or to zero but that's some that's called a discrete function it's continuous and in particular I can find the derivative if you remember that the the basis of gradient descent is finding a derivative in the picture what I'm trying to do is basically use gradient descent to descend from flight of stairs instead of for mounting so I look around myself and if it's foggy and I can only see my proximity it almost feels like I'm in a flat land like I'm not I can't really figure out in which direction I descend the most so I get confused so we need an error function that is continuous so let's try to build it the way we're going to try to build this function by the way this is called logistic regression so it's an algorithm then we're going to build this function is the following we're going to give every point a penalty the points that are misclassified so that red point on the left and the blue point on the right they get a big penalty because they're they're wrong and the points that are correctly classified they still get a penalty except a much smaller penalty so you can see how like the penalty here is how big the point is and then I add these penalties together and that's my error function okay and now I can move my line around which is the equivalent of looking around me and on seeing where it can descend the amount in the most and we can say let's move the line to this position and now I've seen that yes I have changed every penalty some of them have slightly increased but the big ones have decreased by Lotso when I add them I get a much much smaller error so now the point is to minimize this error and now we can use gradient descent because now the function is continuous let's make a let's look at a more visual representation of this so in the left I'm descending from mountain in the right I'm trying to split my data and my error function on the right is the sum of the penalties of the points and on the left is the height but you can see that I've made in such a way that it adds to the same thing now I move my line around which is the equivalent of looking around me in the in the mountain I pick the direction where I descend the most I move in that direction I've managed to decrease my error again I look around all the directions I can go I take a step in one direction that decreases the error or the height the most on now I've been able to decrease my function to a minimum thus finding my my best possible solution but now I think I left a detail behind right I didn't tell you how to build this error function so now I'm going to tell you how to build it so let's remember we're trying to do we're trying to find a line that splits the data but actually I want more than a line I want to hold probability function so I want to be able to color every single point in the plane ah points so you see at the top and the very right are very very blue they get less and less and as blue as they wish the line somewhere in the line is the 50/50 line so they're like kind of not blue not red and then below the line is a red area and then the very very bottom top bottom right the area is very very red and then the color is the likelihood of the point you're going to find so if I plot a point somewhere in the top it's very likely to be blue if I plot a point somewhere in the bottom right it's very very likely to be red and if I plot a point somewhere close to the 50/50 line then as it's called 50/50 then the part these are 50% that's blue and 50% that is red so that's a probability function and how do we build a probability function well let's saw let's mark every point in the plane by its blueness so let's take our favorite line and draw it in the middle as zero line and then let's take some spacing and just translate this line by 1 by 2 by 3 and then by minus 1 minus 2 minus 3 and we've been able to to give every point in the plane some number from minus infinity to infinity and that number is the amount of blueness so a very blue point will get a large number like 27 or 28 point 5 and and a very very red point is going to get something minus 100 and the points that are sort of neutral get something close to zero but this is not a probability function because probability function probabilities are go between 0 & 1 and these numbers go between minus infinity and infinity right what I really want is something like the plane at the right I want my zero line to be my sort of 0.5 line now in my 50/50 line and then as I go up in the blue area I want numbers to get closer and closer to 1 and as I go down in the red area I have my numbers closer and closer to 0 because in the red area here you're very unlikely to be blue and whereas in the blue area you're very likely to be blue so what we're going to use to turn the graph in the left to the graph and the right is we're going to use an activation function so an activation function is a function that takes your entire number line into the interval 0 1 so on the top you see the number line you know 0 is in the middle then 1 2 3 to the right and minus 1 minus 2 minus 3 cetera to the left and every point gets mapped to some point in the interval 0 1 so for example the 0 gets mapped 0.5 the 5 already gets mapped pretty close to 1 3 point 9 9 the minus 10 gets mapped pretty close to 1 to 0 0.0001 and in general every point huge points get mapped to to something very close to one and huge negative points get mapped to very close to zero and the numbers around zero get close to get mapped to numbers around 0.5 I like to see this function in my head as a magnifying glass because if you look at the entire number line through magnifying glass the points at the right and that's the left kind of like crunch into the sides and then the points in the middle kind of stay the same if you and matter is speaking terms the formula for this activation function this is called a sigma it's 1 over 1 plus e to the minus X you can see that if X is large this function is very close to 1 if X is small I mean large negative this function is very close to 0 and if you are in particular 0 you get 0 point 5 so what we're going to do is we're going to take our numbering of the points in the left apply the action intubation function and now we get a whole probability function for every point in the plane every point in the plane gets assigned a number from 0 to 1 and that number is the probability that a point being at that position is blue so in particular the points that are 50/50 to be blue or red are in the line in the middle and now let's look at these four points and let's try to calculate some probabilities so let me think of a random solution this is obviously not that good but a a random line that that tries to split the points and that gives them a whole whole probability distribution and let's try to figure out how likely is it that these four points are of the colors that this probability function sells they are so look at the point in them in the very top left that point is red and it's in a blue area in a pretty blue area so the probability of it being blue is is 0.1 ah then there's this point zero point six and top right because it is it is it lies in the blue area but close to the line so it's it's more likely to be blue than red um then there's this point in the bottom left 0.7 that point is also in the red area so it's correctly classified and it's on the probability of being red is it's more than 0.5 and since it's kind of well into the into the red area let's just say 0.7 and then at the bottom right we got the 0.2 point because it is a blue point living in a red area so it's very unlikely to be blue and now if we think that these are independent events the probability of all four of them happening it's actually the product of the probabilities so if I have these four points and this line with this probability function the probability that the points are of those colors is the product of the four probabilities which is zero point zero zero eight five so it's not very likely now let's move the line around now let's look at this arrangement this arrangement seems a little better what are the probabilities that these points are of those colors well the two points on the left are red and are well into the red area so let's just say the probabilities are 0 point ayr8 and 0.6 and the numbers to the right are blue colored blue and then well into the blue area so they probably deserve 0.7 and 0.9 and so the probability that these four points are of those colors is the product of the four probabilities this one gives us zero point three zero two four now that's a lot more likely than the top one so we can say that it's more likely that that the solution is this one than the previous one this thing what we just did is called the macklin maximum likelihood okay and that's how we're going to calculate our error function now I promised me an error function something that a penalty that I can add to each of the points and then add these points again error function I'll I've done so far is is multiplied probability so let me let me deliver in my promise so here are two arrangements here are our two probabilities this is the probability of the first arrangement happening and the probability of the second arrangement happening and now question how do we turn a product of things into a sum of things and the answer is we take the logarithm now I took the logarithm and I also multiplied to my -1 so that I get positive numbers I can also do the same thing at the right I take the logarithm not the left I get minus the logarithm of 0.008 4 which is around 4 point 8 at the right I get minus logarithm of zero point 3 0 to 4 which is 1.2 and in general the higher the probability they the more the smaller the negative logarithm of it is so now we have a genuine error function and to which point we can assign the corresponding logarithm negative logarithm of the probability and that's gonna be the error function that we assign to every point that's going to be the penalty and as you can see ah this is if we add them we get the total error for example look at this point it's on it's pretty badly misrepresented because it is a red point well into the blue area its penalty is 2 point 3 which is negative logarithm of 0.1 this same point on the right is well represented because now it's a red point in the red area it's penalties minus logarithm of 0.8 which is zero point two so now I finally delivered on my promise which was an error function a penalty on every point that tells me how badly misclassified is this point and when I add them again my error function the algorithm of defining the error function this way and solving the problem using gradient descent is called logistic regression now why are they called neural networks well we'll get there but here's a little visual representation look at this some line that splits the red points and the blue points and it's equation like an equation for line that eights 2x plus 7y equals 4 and the blue points are the points in the plane where 2x plus 7 y is bigger than 4 and the red points are the points where 2x plus 7y is smaller than 4 so we're going to represent that as this little node here where the inputs are the x and the y are two coordinates and the two two seven on the 4 get located in the 2 edges and in the central node so you can see that like that 2 times X is the 2 in the edge times the X in the node and same thing with 7y and then the constant which is the force in the middle node and this is called non-network because that kind of resembles say a neuron on the left you have the to entering points which are XY and what this does it's it checks to the point XY is in the blue area on the red area so on the positive or negative area it calculates the probability that the point is blue and then it just throws out that 0.9 ah in the in the brain we have the neurons that take on the dendrites take as input bunch of nervous impulses and then from the axon throws out another nervous impulse so this is sort of the analogy that's happening here and in the same way that in the brain you have many neurons and the inputs coming out as the output of a neuron becomes the input of another neuron in the same thing in the same way in a neural network we're going to have a bunch of little notes like the one on the left and the outputs off a certain node is going to come in as the input of another node but we'll see it more carefully in a minute so now let's make things a little more complicated let's say our data looks like this the red points and the blue points and now we want to split it but now we can't really use a line because this data is not linearly separable so what's the next thing after a line well we can use this kind of curve like this right this curve correctly splits the data but now the question is how do we get curves I mean well let's let's use what we have we know how to construct the lines so we're going to do is we're going to take for example the line them in the top which is in an OK solution on the line on the bottom which is also an OK solution and somehow we're going to try to merge them to get our perfect solution and the merging of these two lines we can see that the top right region is blue in it it kind of resembles this so we're going to learn how to take the two regions on the left and make it into the region on the right in other words we're going to think of it it's like like arithmetic right you take two regions and you add them to get a third region in this case we get to the two linear regions and we add them to get this curve region on the right so let's see how we can do to add two regions let's pick a random point let's pick this point so by the model in the top this point has probability of being blue of 0.7 now we take the same point on the bottom and the probability of being blue let's say it's 0.8 so what would be the way to add these two well we can think of it as just adding the probabilities so in our in our sum the probability of this point is 0.7 plus 0.8 which is a 1.5 but now we run into the same problems before 1.5 is not a probability probabilities have to be between 0 & 1 but if we remember correctly the last time we have to deal with this we use the activation function so all we do is we use the activation function and turn this 1.5 into is 0.82 so in summary what we did to add the two regions on the left to obtain the one on the right is take the probability that's given at every point add it to the probability at the same point in the bottom region and now to that sum we apply the activation function now what happens if I want to combine two regions but I won the one on the top to have more of a say so I want a weighted combination where the area in the top has a higher weight and then the area would look like this where the bottom area is just less powerful and then the sum of areas looks more like this which looks a lot more like the top region than like the bottom region well the way to solve this is what you're saying I'm going to take my favorite number seven and instead of considering the sum of the two areas I'm going to say seven times the first area plus the second area and that looks a lot more like the first area than the second area I can even put the waste that I want it can say seven times the first area plus five times the second area and the way I do this is again by calculating the probabilities at each point and now in the resulting point I take the linear combination seven times top probability plus five times the bottom probability and now let's just say I put a random number here because linear combinations can also have constants so I subtract my number six and then I can get to point nine so I can find a linear combination of the two areas including a constant if I want and the way to get this area is by just applying the activation function so now I'm going to encode the linear combination of the two areas with the coefficients 7 5 and -6 as this node where in my edges I have seven and a five and in the third node I have minus six and now it seems like we can build our neural network in the following way here I have my one area where the equation of the line is for example 5x minus 2y equals minus 8 I encode this as this neuron in this way I showed you before then I have my second area where their equation of the line is 7x minus 3y equals 1 and I encode that as this newer so let me draw them here again and again and now if I combine these two areas with coefficients 7 5 and -6 I get this area on the right now the magic happens here put this together and I get my neural network I can clean it up a bit and now when I get the neural network on the left with inputs X Y and s coefficients all these numbers in the the edges and inside the notes what I'm really thinking is that on the right I am having one area and another area and combining them to get that curve on the right so if you're gonna remember one thing from today's video remember this whenever you see a neural network maybe the one on the left think of it as some combination of areas like the one in the right and so this is the mental picture I have when I see a neural network I have my input layer which is the X and the y this is the coordinates my hidden layer is the lines that I'm using to cut up my final region and the output layer is the final region that I obtained as a combination of those linear regions now I can play with this input hidden and output layers as much as I want for example I can make the hidden layer bigger let's say my area looks like the one on the right so now I need this area to look like the combination of three lines instead of two lines not no problem my hidden layer now has three elements and by adding linear combinations these three lines again my sort of triangle at the right what happens to my input layer is larger if instead of I having two inputs I have three inputs that means instead of being in the plane with x and y coordinates I am in space with XY and z coordinates and now instead of lines that cut the plane in two i have planes that cut the 3d space in - and in the same way as before I can combine these regions in order to get a region like in the output layer and I can also do probably the most important of this which is I can add more than one hidden layer in this example I have two hidden layers so my first hidden layer draws a bunch of lines through my data my second hidden layer combines these lines to make some more complicated regions and then my third layer my output layer just combines the elements from the last inner layer to make a more much more complicated region so you can imagine how if I have many hidden layers I can create the most complicated regions I want and this is the definition of a deep neural network a deep neural network is just a neural network with many hidden layers here's an example this is the deep neural network used to play go so the input is the board it has five hidden layers and then an output which tells you what to do in the game here's another examples a neural network for the self-driving car so the input is given by the pixels in an image and then as going it goes to a hidden layer actually the real one has many more hidden layers but this is just a simple example and the output comes out with a answer of turnleft turnright go forward or reverse and that's it that's the definition of a neural network and deep learning so thank you for sticking all the way to the end I hope you enjoyed it if you liked it feel free to subscribe or like or share or comment I'm very happy to see comments I'm very happy to see suggestions on what other topics you would like to see or how could I make this better or any questions you have you can also email me or find me on LinkedIn or find me on Twitter and if you like what you saw you can see more of these classes at Udacity where I teach thank you very much bye bye
Info
Channel: Luis Serrano
Views: 624,842
Rating: 4.9149256 out of 5
Keywords: machine learning, deep learning, neural networks, artificial intelligence, logistic regression, probability, math, computer science, statistics
Id: BR9h47Jtqyw
Channel Id: undefined
Length: 33min 20sec (2000 seconds)
Published: Mon Dec 26 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.