Backpropagation in 5 Minutes (tutorial)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello world if Suraj and today we're going to learn how back propagation works in five minutes gradient descent is a popular optimization technique and can be used in many different types of machine learning models it's used to optimize or improve the accuracy of our models predictions one implementation of it that is particularly popular is for neural networks a neural network is a learning model that can make predictions we give it some input data the X values represent the input and the Y values represent the expected output labels our network's job is to learn this mapping so that given some arbitrary input it can correctly predict its output label so our three layer neural network will first have an input layer each neuron represents a different row from our input data then it has a hidden layer data will flow in one direction from our input layer to our output and the way it does this is by having weights that connect each neuron in one layer to every neuron in the next layer we can initialize these weights as matrices with random values to start off with we'll multiply each row of the input by each column of our weight matrix the resulting values from this operation results in our hidden neuron values we'll take each of those values and convert them to a probability value that is a value between 0 and 1 applying an activation function to it the type we'll use in this example is a sigmoid so each neuron receives a set of inputs performs a dot product and then applies an activation function to it we'll just repeat the same process again to calculate the output prediction we compute the dot product of the hidden layer neurons and the next weight matrix between the hidden layer and the output layer then we once again apply our activation function to it this resulting value is our prediction and this process that we just completed is called forward propagation if we compare this to our expected output will see that our prediction is incorrect we want to find the absolute best weight values that's given any input they would help calculate the correct output to do this we'll first once you calculate an error value we want to minimize this error if we were to create a simple graph of the error value versus some random way from our network it would look like this to smaller weight value in our error is high but too big in our error becomes high again we want an optimal value for each weight in our network where our care is smallest starting at some random weight value we want to take a step in the direction towards the minimum error this direction is the opposite to the gradient if we take mini steps descending down the gradient eventually the weight will find the minimum of the error we call this process gradient descent so how do we do this well we'll need to use calculus let's do a little refresher on three terms from calculus we'll need to know the derivative is a term that means the slope of the tangent line to a curve at a specific point it measures the rate of change of a function a derivative of a function f of X gives you another function that returns the slope of f of X at a point X for example the derivative of x squared is 2x so at x equals 2 the slope is 4 a partial derivative of a function of several variables is its derivative with respect to one of those variables with the others held constant and the chain rule is the process we can use to compute derivatives of composite functions a composite function is a function of other functions that is we might have one function that is composed of multiple inner or nested functions so you have some function f of X and another function G of X using them you form some composite function f of G of X the chain rule states that the derivative of f of G of X is equal to the derivative of G of X times the derivative of f of X a neural network is essentially a massive nested composite function each layer of a feed-forward neural network can be represented as a single function whose inputs are a weight vector and the outputs of the previous layer the purpose of back propagation is to figure out the partial derivatives of our error function with respect to each individual weight in the network so we can use those in gradient descent it gives us a way of computing the error for every layer and then relating those errors to the quantity of real interest a partial derivative with respect to any weight in the network we can use the chain rule to compute the partial derivatives that is the gradient of the error with respect to each weight propagation at its core simply consists of repeatedly applying the chain rule through all the possible paths in our network our ultimate goal in training a neural network is to find the gradient of each wave with respect to the output we do this so that we can update the weights incrementally using gradient descent we reuse multiple values as we compute the updates for weights that appear earlier and earlier in the network after we had the error for the output layer we calculate an error for each neuron in the hidden layers going backwards layer by layer the error for a neuron in a hidden layer is a sum of the products between the errors of the neurons in the next layer and the weights of the connections to those neurons multiplied by the derivative of the activation function we will use those errors to calculate the variation of the weights as a result of the current input pattern and ideal outputs the variation or Delta of a weight is the product of the input neuron output value with the error of the output neuron for that connection this process is repeated for all the input patterns and the deltas are accumulated at the end of a learning iteration we change the actual weights with the accumulated deltas for all the training patterns and we multiply it with a learning rate which states how fast a network converges to a result when we run our code we can see this process in action as a prediction gradually increases in accuracy please subscribe and for now I've got to go derive the meaning of life so thanks for watching

Info

Channel: Siraj Raval

Views: 218,558

Rating: 4.542779 out of 5

Keywords: backpropagation, back propagation, backpropagation example, back propagation neural network, backpropagation in neural networks, backpropagation algorithm, back propagation algorithm in neural network, neural network backpropagation, backpropagation explained, back propagation algorithm

Id: q555kfIFUCM

Channel Id: undefined

Length: 5min 29sec (329 seconds)

Published: Sun Apr 02 2017