Deep Learning Crash Course for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you've probably read in the news that deep learning is the secret recipe behind many exciting developments and has made many of our world's dreams and perhaps also nightmares come true who would have thought that deep minds alphago could be at least a doll in a boat game which boasts in more possible moves than there are atoms in the entire universe a lot of people including me never saw it coming it seemed impossible but it's here now deep learning is everywhere it's beating physicians are diagnosing cancer it's responsible for translating web pages in a matter of mere seconds to the autonomous vehicles by weimo and tesla hi my name is jason and welcome to this course in deep learning where you'll learn everything you need to get started with deep learning in python how to build remarkable algorithms capable of solving complex problems that weren't possible just a few decades ago we'll talk about what deep learning is and the difference between artificial intelligence and machine learning i'll introduce new networks what they are and just how essential they are to deep learning you're going to learn about how deep learning models train and learn and the various types of learning associated supervised unsupervised and reinforcement learning we're going to talk about loss functions optimizers the grading descent algorithm the different types of neural network architectures and the various steps involved in deep learning this entire course is centered on the notion of deep learning but what is it deep learning is a subset of machine learning which in turn is a subset of artificial intelligence which involves more traditional methods to learn representations directly from data machine learning involves teaching computers to recognize patterns in data in the same way as our brains do so as humans it's easy for us to distinguish between a cat and a dog but it's much more difficult to teach a machine to do this and we'll talk more about this later on in this course before i do that i want to give you a sense of the amazing successes of deep learning in the past in 1997 gary kasparov the most successful champion in the history of chess lost to ibm's deep blue one of the first computer or artificial systems it was the first defeat of a reigning world chess champion by a computer in 2011 ibm's watson competed in game show jeopardy against his champions brad rotter and ken jennings and won the first prize a million dollars in 2015 alphago a deep learning computer program created by google's deepmind division defeated lisa dole an 18 time world champion at go a game of google more times complex than chess but deep learning can do more than just betas at boat games it finds applications anywhere from self-driving vehicles to fake news detection to even predicting earthquakes these were astonishing moments not only because machines beat humans at their own games but because of the endless possibilities that they opened up what followed such events have been a series of striking breakthroughs in artificial intelligence machine learning and yes deep learning to put it simply deep learning is a machine learning technique that learns features and tasks directly from data by running inputs through a biologically inspired neural network architecture these neural networks contain a number of hidden layers through which data is processed allowing for the machine to go deep in its learning making connections and weighing input for the best results we'll go over neural networks in the next video so why deep learning the problem with traditional machine learning algorithms is that no matter how complex they get they'll always be machine like they need a lot of domain expertise human intervention and are only capable of what they're designed for for example if i show you the image of a face you will automatically recognize it's a face but how would a computer know what this is well if we follow traditional machine learning we'd have to manually and painstakingly define to a computer what it faces for example it has eyes ears and mouth but now how do you define an eye or a mouth to a computer well if you look at an eye the corners are at some angle they're definitely not 90 degrees they're definitely not zero degrees there's some angle in between so we could work with that and train our classifier to recognize these kinds of lines in certain orientations this is complicated for ei practitioners and the rest of the world that's where deep learning holds a bit of promise the key idea in deep learning is that you can learn these features just from raw data so i can feed a bunch of images or faces to my deep learning algorithm and it's going to develop some kind of hierarchical representation of detecting lines and edges and then using these lines and edges to detect eyes and a mouth and composing it together to ultimately detect the face as it turns out the underlying algorithms for training these models have existed for quite a long time so why has deep learning gaining popularity many decades later well for one data has become much more pervasive we're living in the age of big data and these algorithms require massive amounts of data to effectively be implemented second we have hardware and architecture that are capable of handling the vast amount of data and computational power that these algorithms require hardware that simply wasn't available a few decades ago third building and deploying these algorithms models as i call is extremely streamlined with the increasing popularity of open source software like tensorflow and pytorch deep learning models refer to the training of things called neural networks neural networks form the basis of deep learning a sub-field of machine learning where algorithms are inspired by the structure of the human brain just like neurons make up the brain the fundamental building blocks of a neural network is also a neuron neural networks take in data they train themselves to recognize patterns in this data and predict outputs for a new set of similar data in a new network information propagates through three central components that form the basis of every neural network architecture the input layer the output layer and several hidden layers between the two in the next video we'll go over the learning process of a neural network the learning process of a neural network can be broken into two main processes forward propagation and back propagation full propagation is the propagation of information from the input layer to the output layer we can define our input layer as several neurons x1 through xn these neurons connect to the neurons of the next layer through channels and they are assigned numerical values called weights the inputs are multiplied to the weights and their sum is sent as input to the neurons in the hidden layer where each neuron in turn is associated to a numerical value called the bias which is then added to the input sum this weighted sum is then passed through a non-linear function called the activation function which essentially decides if that particular neuron can contribute to the next layer in the output layer it's basically a form of probability the neuron with the highest value determines what the output finally is so let's go over a few terms the weight of a neuron tells us how important the neuron is the higher the value the more important it is in the relationship the bias is like the new on having an opinion to the relationship it serves to shift the activation function to the right or to the left if you have had some experience with high school math you should know that adding a scalar value to a function shifts a graph either to the left or to the right and this is exactly what the bias does it shifts the activation function to the right or to the left that propagation is almost like for propagation except in the reverse direction information here is passed from the output layer to the hidden layers not the input layer but what information gets passed on from the output layer isn't the output layer supposed to be the final layer where we get the final output well yes but no bad propagation is the reason why new networks is so powerful it is the reason why new networks can learn by themselves in the last step before propagation a new network spits out a prediction this prediction could have two possibilities either right or wrong in bad propagation the new network evaluates its own performance and checks if it is right or wrong if it is wrong the network uses something called a loss function to quantify the deviation from the expected output and it is this information that's sent back to the hidden layers for the weights and biases to be adjusted so that the network's accuracy level increases let's visualize the training process with a real example let's suppose we have a data set this dataset gives us the weight of a vehicle and the number of goods carried by that vehicle and also tells us if those vehicles are cars or trucks we want to go through this data trade a new networks to predict cars or trucks based on their weights and goods to start off let's initialize the neural network by giving it random weights and biases these can be anything we really don't care what these values are as long as they're there in the first entry of a data set we have vehicle weight equal to a value which in this case is 15 and good as two according to this it's a car we now start moving these input dimensions through the newer network so basically what we want to do is take both the inputs multiply them by their weight and add a bias and this is where the magic happens we run this weighted sum through an activation function okay now let's say that the output of this activation function is 0.001 this again is multiplied by the weight and added to the bias and finally in the output layer we have a guess now according to this neural network the type of vehicle with weight 15 and goods 2 has a greater probability of being a truck of course this is not true and a new network knows this so we use back propagation we're going to quantify the difference between the expected result and the predicted output using a loss function in bank propagation we're going to go backwards and adjust our initial rates and biases remember that during the initialization of the neural network we chose completely random weight and biases well during back propagation these values will be adjusted to better fit the prediction model okay so that was one iteration through the first piece of the data set in the second entry we have vehicle weight 34 and goods 67. we're going to use the same process as before multiply the input with the weight and add a box pass this result into an activation function and repeat till the output layer check the error difference and employ back propagation to adjust the weights in the biases your new network will continue doing this repeated processor for propagation calculating the error and then back propagation for as many entries there are in this data set the more data you give the newer network the better it will be at predicting the right output but there's a tradeoff because too much data and you'll end up with a problem like overfitting which i'll discuss later in this course but that's essentially how a new network works you feed input the network initializes with random weights and biases that are adjusted each time during back propagation until the network's gone through all your data and is now able to make predictions this learning algorithm can be summarized as follows first we initialize the network with random values for the network's parameters or the weights in the biases we take a set of input data and pass them through the network we compare these predictions obtained with the values of the expected labels and calculate the loss using a loss function we perform back propagation in order to propagate this loss to each and every weight and bias we use this propagated information to update the weights and biases of neural network with the gradient descent algorithm in such a way that the total loss is reduced and a better model is obtained the last step is continue iterating the previous steps until we consider that we have a good enough model in this section we're going to talk about the most common terminologies used in deep learning today let's start off with the activation function the activation function serves to introduce something called non-linearity into the network and also decides whether a particular neuron can contribute to the next layer but how do you decide if the neuron can fire or activate well we had a couple of ideas which led to the creation of different activation functions the first idea we had is how about we activate the neuron if it is above a certain value or threshold if it is less than the threshold don't activate it activation function a is equal to activated if y is greater than some threshold else it's not this is essentially a step function its output is 1 or activated when value is greater than 0. its output is activated when value is greater than some threshold and outputs not activated otherwise great so this makes an activation function for a neuron no confusions life is perfect except there are some drawbacks with this to understand about it think about the following think about a case where you want to classify multiple such neurons into classes say class 1 class 2 class 3 etc what will happen if more than one neuron is activated all these neurons will output a one well how do you decide now how do you decide which class it belongs to it's complicated right you would want the network to activate only one neuron and the other should be zero only then you would be able to say it was classified properly in real practice however it is harder to train and converge it this way it would be better if the activation was not binary and instead some probable value like 75 activated or 16 activated there's a 75 chance that it belongs to class 2 etc then if more than one neuron activates you could find which neuron fires based on which has the highest probability okay maybe you'll ask yourself i want something to give me a more analog value rather than just saying activated or not activated something other than in binary and maybe you would have thought about a linear function a straight line function where the activation is proportional to the input by a value called the slope of the line this way it gives us a range of activations so it isn't binary activation we can definitely connect a few neurons together and if more than one fires we could take the maximum value and decide based on that so that is okay too and what is the problem with this well if you are familiar with gradient descent which i'll come to in just a bit you'll notice that the derivative of a linear function is a constant makes sense because the slope isn't changing at any point for a function f of x is equal to mx plus c the derivative is m this means that the gradient has no relationship whatsoever with x this also means that during back propagation the adjustments made to the weights and the biases aren't dependent on x at all and this is not a good thing additionally think about if you have connected layers no matter how many layers you have if all of them are linear in nature the activation function of the final layer is nothing but just a linear function of the input of the first layer pause for a bit and think about it this means that the entire neural network of dozens of layers can be replaced by a single layer remember a combination of linear functions in the linear manner is still another linear function and this is terrible because we've just lost the ability to stack layers this way no matter how much we stack the whole network is still equivalent to a single layer with single activation next we have a sigmoid function and if you've ever watched a video on activation functions this is the kind of function used in the examples a sigmoid function is defined as a if x is equal to 1 over 1 plus e to the negative x well this looks smooth and kind of like a step function what are its benefits think about it for a moment well first things first it has non-linear nature combinations of this function are also non-linear great so now we can stack layers what about non-binary activations yes that too this function outputs an analog activation unlike the step function and also has a smooth gradient an advantage of this activation function is that unlike the linear function the output of this function is going to be in the range zero to one inclusive compared to the negative infinity to infinity of the latter so we have activations bound in a range and this won't blow up the activations and this is great and sigmoid functions are one of the most widely used activation functions today but life isn't always rosy and sigmoids two tend to have the share of disadvantages if you look closely between x is equal to negative two and x is equal to two the y values are very steep any small changes in values of x in that region will cause values of y to change drastically also towards either end of the function the y values tend to respond very less to changes in x the gradient at those regions is going to be really really small almost zero and it gives rise to the vanishing gradient problem which just says that if the input to the activation function is either large or small the sigmoids are going to squish that down to a value between zero and one and the gradient of this function becomes really small and you'll see why when we talk about gradient descent this is a huge problem another activation function that is used is the tan h function this looks very similar to sigmoid in fact mathematically this is what's known as a shifted sigmoid function okay so like the sigmoid it has characteristics that we discussed above it is nonlinear nature so we can stack layers it is bound to arrange from negative one to one so there's no worrying about the activations blowing up the derivative of the tanning function however is steeper than that of the sigmoid so deciding between the sigmoid and the tanh will really depend on your requirement of the gradient strength like sigmoid tanh is also a very popular and widely used activation function and yes like the sigmoid tanh does have a vanishing gradient problem the rectified linear unit or the value function is defined as a of x is equal to the max from 0 to x at first look this would look like a linear function right the graph is linear in the positive axis let me tell you rather is in fact non-linear nature and combinations of relu are also non-linear great so this means that we can stack layers however unlike the previous two functions that we discussed is not bounded the range of the relu is from zero to infinity this means there is a chance of blowing up the activation another point i would like to discuss here is sparsity of inactivation imagine a big neural network with lots of neurons using a sigmoid or a tanning will cause almost all the neurons to fire in an analog way this means that almost all activations will be processed to describe the network's output in other words the activation would be dense and this is costly ideally we want only a few neurons in the network to activate and thereby making the activations pass and efficient here's where the relu comes in imagine a network with randomly initialized weights and almost 50 percent of the network yields zero activation because of the characteristic relu it outputs zero for negative values of x this means that only 50 percent of the neurons fire sparse activation making the network lighter but when life gives you an apple it comes with a little worm inside because of that horizontal line in relu for negative values of x the gradient is zero in that region which means that during back propagation the weights will not get adjusted during descent this means that those neurons which go into that state will stop responding to variations in the error simply because the gradient is zero nothing changes this is called the dying value problem this problem can cause several neurons to just die and not respond thus making a substantial part of the network passive rather than what we want out of there are workarounds for this one way especially is to simply make the horizontal line into a non-horizontal component by adding a slope usually the slope is around 0.001 and this this new version of the relu is called leaky value the main idea is that the gradient should never be zero one major advantage of the relu is the fact that it's less computationally expensive than functions like tannage and sigmoid because it involves simpler mathematical operations this is a really good point to consider when you were designing your own deep neural networks great so now the question is which activation function to use because of the advantages that relu offers does this mean that you should use reload for everything you do or could you consider sigmoid and tan h well both when you know the function that you're trying to approximate has certain characteristics you should choose an activation function with which will approximate the function faster leading to faster training processes for example a sigmoid function works well for binary classification problems because approximating our classifier functions as combinations of the sigmoid is easier than maybe the relu this will lead to faster training processes and larger convergence you can use your own custom functions too if you don't know the nature of the function you're trying to learn i would suggest you start with relu and then work backwards from there before we move on to the next section i want to talk about why we use non-linear activation functions as opposed to linear ones if you recall in my definition of activation functions i mentioned that activation functions serve to introduce something called non-linearity in the network for all intensive purposes introducing non-linearity simply means that your activation function must be non-linear that is not a straight line mathematically linear functions are polynomials of degree 1 that when graphed in the x y plane are straight lines inclined to the x-axis at a certain value we call this the slope of the line non-linear functions are polynomials of degree greater than one and when graphed they don't form straight lines rather than more curved if we use linear activation functions to model our data then no matter how many hidden layers our network has it will always become equivalent to having a single layer network and in deep learning we want to be able to model every type of data without being restricted as would be the case should we use linear functions we discussed previously in the learning process of neural networks that we started with random weight and biases the neural network makes a prediction this prediction is compared against the expected output and the weights and biases are adjusted accordingly well loss functions are the reason that we're able to calculate that difference really simply a loss function is a way to quantify the deviation of the predicted output by the neural network to the expected output it's as simple as that nothing mo nothing less there are plenty of loss functions out there for example under regression we have squared error loss absolute ever loss in huber loss in binary classification we have binary cross entropy and hinge loss in multi-class classification problems we have the multi-class cross entropy and the callback liability divergence loss and so on the choice of the best function really depends on what kind of project you're working on different projects require different loss functions now i don't want to talk any further loss functions right now we'll do this under the optimization section because that's really where most functions are utilized in the previous section we dealt with loss functions which are mathematical ways of measuring how wrong predictions made by neural network are during the training process we tweak and change the parameters or the weights of the model to try and minimize that loss function and make our predictions as correct and optimized as possible but how exactly do you do that how do you change the parameters of your model by how much and when we have the ingredients how do we make the cake this is where optimizers come in they tied together the loss function and model parameters or the weight and biases by updating the network in response to the output of the loss function in simpler terms optimizers shape and mold your model into more accurate models by adjusting the weights and the biases the loss function is its guide it tells the optimizer whether it's moving in the right or the wrong direction to understand this better imagine did you have just killed mount everest and now you decide to descend the mountain blindfolded it's impossible to know which direction to go in you could either go up which is away from your goal or go down which is towards your goal but to begin you would start taking steps using your feet you'll be able to gauge whether you're going up or down in this analogy you resemble the neural network going down your goal is trying to minimize the error and your feet are resemblance of the loss functions they measure whether you're going in the right way or the wrong way similarly it's impossible to know what your model's weights should be right from the start but with some trial and error based on the loss function you could end up getting there eventually we now come to grading descent often called the grand daddy of optimizers grading descent is an iterative algorithm that starts up at a random point in the loss function and travels down its slope in steps until it reaches the lowest point or the minimum of the function it is the most popular optimizer we use nowadays it's fast robust and flexible and here's how it works first we calculated what a small change in each individual weight would do to the loss function we adjust each individual weight based on its gradient that is take a small step in the determined direction the last step is to repeat the first and the second step until the loss function gets as low as possible i want to talk about this notion of a gradient the gradient of a function is the vector of the partial derivatives with respect to all independent variables the gradient always points in the direction of the steepest increase in the function suppose we have a graph like so with loss on the y-axis and the value of the weight on the x-axis we have a little data point here that corresponds to the randomly initialized weight to minimize this loss that is to get this data point to the minimum of the function we need to take the negative gradient since we want to find the steepest decrease in function this process happens iteratively through the losses as minimized as possible and that's grading descent in a nutshell when dealing with high dimensional data sets that is a lot of variables it's possible you'll find yourself in an area where it seems like you've reached the lowest possible value for your loss function but in reality it's just a local minimum to avoid getting stuck in a local minima we make sure we use the proper learning rate changing our weights too fast by adding or subtracting too much that is taking steps that are too large or too small can hinder your ability to minimize the loss function we don't want to make a jump so large that we skip over the optimal value for a given weight to make sure this doesn't happen we use a variable called the learning rate this thing is usually just a small number like 0.001 that we multiply the gradients by to scale them this ensures that any changes we make to our weights are pretty small in math talk taking steps that are too large can mean that the algorithm will never converge to an optimum at the same time we don't want to take steps that are too small because then we might never end up with the right values for our weights in math talk steps that are too small might lead to our optimizer converging on a local minimum for the loss function but never the absolute minimum for a simple summary just remember that the learning rate ensures that we change our weight at the right pace not making any changes that are too big or too small instead of calculating the gradients for all your training examples on every part of the gradient descent it's sometimes more efficient to only use a subset of the training examples each time stochastic gradient descent is an implementation that either uses batches of examples at a time or random examples on each pass stochastic gradient descent uses the concept of momentum momentum accumulates gradients of the past steps to dictate what might happen in the next steps also because we don't include the entire training set sjd is less computationally expensive it's difficult to overstate how popular gradient descent really is back propagation is basically gradient descent implemented on a network there are other types of optimizers based on gradient descent that are used today ad grad adapts the learning rate specifically to individual features that means that some of the weights in your data set will have different learning rates than others this works really well for sparse data sets where a lot of input examples are missing at a grad has a major issue though the adaptive learning rate tends to get really really small over time rms prop is a special version of adegrad developed by professor jeffrey hinton instead of letting all the gradients accumulate for momentum it accumulates gradients in a fixed window rms prop is similar to add a prop which is another optimizer that seeks to solve some of the issues that atograd leaves open atom stands for adaptive moment estimation and is another way of using past gradients to calculate the carbon gradient atom also utilizes the concept of momentum which is basically our way of telling the neural network whether we want past changes to affect the new change by adding fractions of the previous gradients to the current one this optimizer has become pretty widespread and is practically accepted for use in training new networks it's easy to get lost in the complexity of some of these new optimizers just remember that they all have the same goal minimizing the loss function and trial and error will get you there you may have heard me referring to the words parameters quite a bit and often this word is confused with the term hyperparameters in this video i'm going to outline the basic difference between the two a model parameter is a variable that is internal to the new network and whose values can be estimated from the data itself they are required by the model when making predictions these values define the skill of the model on your problem they can be estimated directly from the data and are often not manually set by the practitioner and oftentimes when you save your model you are essentially saving your model's parameters parameters are key to machine learning algorithms and examples of these include the weights and the biases a hyper parameter is a configuration that is external to the model and whose value cannot be estimated from data there's no way that we can find the best value for a model hyper parameter on a given problem we may use rules of thumb copy values used in other problems or search for the best value by trial and error when a machine learning algorithm is tuned for a specific problem such as when you're using grid search or random search then you are in fact tuning the hyper parameters of the model in order to discover the parameters that result in more skillful predictions model hyper parameters are often referred to as parameters which can make things confusing so a good rule of thumb to overcome this confusion is as follows if you have to specify a parameter manually then it is probably a hyper parameter parameters are inherent to the model itself some examples of hyper parameters include the learning rate for training on your network the c in sigma hyper parameters for sport vector machines and the k and k newest neighbors we need terminologies like epochs batch size and iterations only when the data is too big which happens all the time in machine learning and when we can't pass all this data to the computer at once so to overcome this problem we need to divide the data set into smaller chunks give it to our computer one by one and update the weights of the new network at the end of every step to fit it into the data given one epoch is when an entire data set is passed forward and backward through the network once in a majority of deep learning models we use more than one epoch i know it does make sense in the beginning why do we need to pass the entire data set many times through the same neural network passing the entire data set through the network only once is trying to read the entire lyrics of a song once he won't be able to remember the entire song immediately you have to reread the lyrics a couple more times before you can say you know the song by memory the same is true with the neural network we pass the data set multiple times through the neural network so it's able to generalize better gradient descent is an iterative process and updating the parameters and back propagation in a single pass or one epoch is not enough as the number of epochs increases the more the parameters are adjusted leading to a better performing model but too many epochs could spell disaster and lead to something called overfitting where a model has essentially memorized the patterns in the training data and performs terribly on data it's never seen before so what is the right number of epochs unfortunately there is no right answer the answer is different for different data sets sometimes your data set can include millions of examples passing this entire data set at once becomes extremely difficult so what we do instead is divide the data set into a number of batches rather than passing the entire dataset once the total number of training examples present in a single batch is called a batch size iterations is the number of batches needed to complete one epoch note the number of batches is equal to the number of iterations for one epoch let's say that we have a data set of 34 000 training examples if we divide the data set into batches of 500 then it will take 68 iterations to complete one epoch well i hope that gives you some kind of sense about the very basic terminologies used in deep learning before we move on i do want to mention this and you will see this a lot in deep learning you'll often have a bunch of different choices to make how many hidden layers should i choose or which activation function must i use and where and to be honest there are no clear-cut guidelines as to what your choice should always be that's a fun part about deep learning it's extremely difficult to know in the beginning what's the right combination to use for your project what works for me might not work for you and a suggestion from my end would be that you dabble along with material shown try various combinations and see what works for you best ultimately that's a learning process pun intended throughout this course i'll give you quite a bit of intuition as to what's popular so that when it comes to building a deep learning project you won't find yourself lost in this section we're going to talk about the different types of learning which are machine learning concepts but i extended to deep learning as well in this course we'll go over supervised learning unsupervised learning and reinforcement learning supervised learning is the most common sub branch machine learning today typically if you're new to machine learning your journey will begin with supervised learning algorithms let's explore what these are supervised machine learning algorithms are designed to learn by example the name supervised learning originates from the idea that training this type of algorithm is almost like there's a human supervising the whole process in supervised learning we train our models on well-labeled data each example is a pair consisting of an input object which is typically a vector and a desired output value also called a supervisory signal during training a supervised learning algorithm will search for patterns in the data that correlate with the desired outputs after training it will take in new unseen inputs and will determine which label the new inputs will be classified as based on prior training data the objective of a supervised learning model is to predict the correct label for newly presented input data at its most basic form a supervised learning algorithm can simply be written as y is equal f x where y is the predicted output that is determined by a mapping function that assigns a class to an input value x the function used to connect input features to a predicted output is created by the machine learning model during training supervised learning can be split into two subcategories classification and regression during training a classification algorithm will be given data points with an assigned category the job of a classification algorithm is then to take this input value and assigned to a class of category that it fits into based on the training data provided the most common example of classification is determining if an email is spam or not with two classes to choose from spam or not spam this problem is called a binary classification problem the algorithm will be given training data with emails that are both spam and not spam and the model will find the features within the data that correlate to either class and create a mapping function then when provided with an unseen email the model will use this function to determine whether or not the email is pam an example of a classification problem would be the mnist handwritten digits dataset where the inputs are images of handwritten digits pixel data and the output is the class label for what digit the image represents that is numbers zero to nine there are numerous algorithms to solve classification problems each which depends on the data and the situation here are a few popular classification algorithms linear classifiers support vector machines decision trees k-nearest neighbors and drellum forest regression is a predictive statistical process where the model attempts to find the important relationship between dependent and independent variables the goal of a regression algorithm is to predict a continuous number such as sales income and tax scores the equation for a basic linear regression can be written as follows where x and 5 represents the features of the data and w of i and b are parameters which are developed during training for simple linear regression models with only one feature in the data the formula looks like this where w is the slope x is the single feature and b is a y-intercept familiar for simple regression problems such as this the model's predictions are represented by the line of best fit for models using two features a plane is used and for models with more than two features a hyperplane is used imagine we want to determine a student's test grade based on how many hours they study the week of the test let's say the plot data with line of best fit looks like this there is a clear positive correlation between our studied the independent variable and the student's final test goals the dependent variable a line of best fit can be drawn through the data points to show the model's predictions when given new input say we wanted to know how well a student would do with five hours of study we can use the line of best fit to predict the test call based on other students performances another example of regression problem would be the boston house prices data set where the input of variables that describe a neighborhood and the output is a house price in dollars there are many different types of regression algorithms three most common are linear regression lasso regression and multivariate regression supervised learning finds applications in classification and regression problems like bioinformatics such as fingerprint iris and face recognition in smartphones object recognition spam detection and speech recognition unsupervised learning is a branch of machine learning that is used to manifest underlying patterns and data and is often used in exploratory data analysis unlike supervised learning unsupervised learning does not use label data but instead focuses on the data's features label training data has a corresponding output for each input the goal of an unsupervised learning algorithm is to analyze data and find important features in that data unsupervised learning will often find subgroups or hidden patterns within the dataset that a human observer might not pick up on and this is extremely useful as we'll soon find out unsupervised learning can be of two types clustering and association clustering is the simplest and among the most common applications of unsupervised learning it is the process of grouping the given data into different clusters or groups classes will contain data points that are as similar as possible to each other and as dissimilar as possible to data points in other clusters clustering helps find underlying patterns within the data that may not be noticeable through a human observer it can be broken down into partitional clustering and hierarchical clustering partitional clustering refers to a set of clustering algorithms where each data point in a data set can belong to only one cluster hierarchical clustering finds clusters by a system of hierarchies every data point can belong to multiple clusters some classes will contain smaller clusters within it this hierarchy system can be organized as a tree diagram some of the more commonly used clustering algorithms are the k-means expectation and the hierarchical cluster analysis of the aca association on the other hand attempts to find relationships between different entities the classic example of association rules is market basket analysis this means using a database of transactions in the supermarket to find items that are frequently bought together for example a person who buys potatoes and burgers usually buys beer for example the person who buys tomatoes and pizza cheese might want to bring pizza bread and so on unsupervised learning finds applications almost everywhere for example airbnb which helps host stays and experiences and connect people all over the world this application uses unsupervised learning algorithms where a potential client queries their requirements and airbnb learns these patterns and recommends stays and experiences which fall under the same group of cluster example a person looking for houses in san francisco might not be interested in finding houses in boston amazon also uses unsupervised learning to learn the customers purchases and recommend products which are frequently bought together which is an example of association rule mining credit card fraud detection is another unsupervised learning algorithm that learns the various patterns of a user and and their usage of a credit card if the card is used in parts that do not match the behavior an alarm is generated which could possibly be marked as fraud and in some cases your bank might call you to confirm whether it was you using the card or not reinforcement learning is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences like supervised learning it uses mapping between the input and the output but unlike supervised learning where feedback provided to the agent is a correct set of actions for performing a task reinforcement learning uses rewards and punishments as signals for positive and negative behavior when you compare with unsupervised learning reinforcement learning is different in terms of its goals while the goal in unsupervised learning is to find similarities and differences between data points in reinforcement learning the goal is to find a suitable action model that would maximize the total cumulative reward of the agent reinforcement learning refers to goal-oriented algorithms which learn how to attain a complex objective or goal or how to maximize along a particular dimension over many steps for example they can maximize the points of one in a game over many moves reinforcement learning algorithms can start from a blank slate and under the right conditions achieve superhuman performance like a pet incentivized by scolding and treats these algorithms are penalized when they make the wrong decisions and rewarded when they make the right ones this is reinforcement reinforcement learning is usually modeled as a markov decision process although other frameworks like queue learning are used some key terms that describe the elements of a reinforcement learning problem are the environment which is the physical world in which the agent operates the state represents the current situation of the agent reward is a feedback received from the environment policy sometimes is the method to map the agent state to the agent's actions and finally value is the future reward that an agent will receive by taking an action in a particular state a reinforcement learning problem can be best explained through games let's take the game of pac-man where the goal of the agent or pac-man is to eat the food in the grid while avoiding the ghosts on its way the grid world is the interactive environment for the agent pac-man receives a reward for eating food and punishment if it gets killed by the ghost that is it loses the game the states are the location of pac-man in the grid world and the total cumulative reward is pac-man winning the game reinforcement learning finds applications in robotics business strategy planning traffic light control web system configuration and aircraft and robot motion control a central problem in deep learning is how to make an algorithm that will perform well not just in training data but also on new inputs one of the most common challenges you'll face when training models is a problem of overfitting a situation where your model performs exceptionally well on training data but not in testing data see i have a data set graphed on the xy plane like so now i want to construct a model that would best fit the data set what i could do is draw a line of some random slope in intercept now evidently this isn't the best model and in fact this is called underfitting because it doesn't fit the model well in fact it underestimates the data set instead what we could do is draw a line that looks something like this now this really fits our model the best but this is overfitting remember that while training we show our networks and training data and once that's done we'd expect it to be almost close to perfect the problem with this graph is that although it is probably the best line of fit for this graph it is the best line of fit only if you're considering your trading data what your network is done in this graph is memorize the patterns between the trading data and won't give accurate predictions at all on data it's never seen before and this makes sense because instead of memorizing patterns generally to perform well on both training as well as new testing data our network in fact has memorized the patterns only on the training data so it obviously won't perform well on new data it's never seen before this is the problem of overfitting it fitted too much and by the way this would be the more accurate kind of fitting it's not perfect but it'll do well on both training as well as new testing data with sizeable accuracy there are a couple of ways to tackle overfitting the most interesting type of regularization is dropout it produces very good results and is consequently the most frequently used regularization technique in the field of deep learning to understand dropout let's say that we have a new network with two hidden layers what dropout does is that at every iteration it randomly selects some nodes and removes them along with their incoming and outgoing connections as shown so each iteration has a different set of nodes and this results in a different set of outputs so why do these models perform better these models usually perform better than a single model as they capture more randomness and memorizes less of the training data and hence will be folks to generalize better and build a more robust predictive model sometimes the best way to make a deep learning model generalized better is to train it on mode data in practice the amount of data we have is limited and one way to get around this problem is to create fake data and add it to the training set for some deep learning tasks it is reasonably straightforward to create new fake data this approach is easiest for classification a classifier needs to take complicated high dimensional input x and summarize it with the category identity y this means that the main task facing a classifier is to be invariant to a wide variety of transformations we can generate new xy pairs easily just by applying transformations on the xy inputs in our training set dataset augmentation has been a particularly effective technique for a specific classification problem object recognition images are high dimensional and include an enormous range of factors of variation many of which can easily be simulated operations like translating the training images a few pixels in each direction can often greatly improve generalization many other operations such as rotating the image or scaling the image have also proved quite effective you must be careful not to apply transformation that would change the correct class for example in optical character recognition tasks that require recognizing the difference between a b and a d and the difference between a six and a nine horizontal flips and 180 degree rotations are not appropriate ways of augmenting data sets with these tasks when training large models with sufficient representational capacity to overfit the task we often observe that the training error decreases steadily over time but the error validation set begins to rise again this means we can obtain a model with better validation set error and thus hopefully better test that error by stopping training at the point where the error in the validation set starts to increase this strategy is known as early stopping it is probably the most commonly used form of regularization in deploying today its popularity is due to both its effectiveness and its simplicity in this section i'm going to introduce the three most common types of neural network architectures today fully connected v4 with new networks recurring neural networks and convolutional neural networks the first type of new network architecture we're going to discuss is a fully connected feed forward neural network by fully connected i mean that each neuron in the preceding layer is connected to every neuron in the subsequent layer without any connection backwards there are no cycles or loops in the connections in the network as i mentioned previously each neuron in a neural network contains an activation function that changes the output of a neuron when given its input there are several types of activation functions that can change this input to output relationship to make a neuron behave in a variety of ways some of the most well-known activation functions are the linear function which is a straight line that essentially multiplies the input by a constant value the sigmoid function that ranges from zero to one the hyperbolic tangent of the tanning function ranges from negative one to positive one and the rectified linear unit or the relu function which is a piecewise function that outputs a zero if the input is less than a certain value or a linear multiple if the input is greater than a certain value each type of activation function has its pros and cons so we use them in various layers in the deep neural network based on the problem each is designed to solve in addition the last three activation functions we refer to as non-linear functions because the output is not a linear multiple of the input non-linearity is what allows deep neural networks to model complex functions using everything we've learned so far we can create a wide variety of fully connected feed for neural networks we can create networks with various inputs various outputs various hidden layers neurons per hidden layer and a variety of activation functions these numerous combinations allow us to create a variety of powerful deep neural networks that can solve the wide array of problems the more neurons we add to each hidden layer the wider the network becomes in addition the more hidden layers we add the deeper the network becomes however each neuron we add increases the complexity and thus the computational resource necessary to train a new network increases this increasing complexity isn't linear in the number of neurons we add so it leads to an explosion in complexity and training time for large neural networks that's the trade-off you need to consider when you are building deep neural networks all the new networks we've discussed so far are known as feed forward neural networks they take in a fixed sized input and give you a fixed size output that's all it does and that's what we expect neural networks to do take in an input and give a sizeable output but as it turns out these plane of vanilla neural networks aren't able to model every single problem that we have today to better understand this use this analogy suppose i show you the picture of a ball a round spherical ball that was moving in space in some direction i've just taken a photo of the ball or a snapshot of the ball at some time t now i want you to predict the next position of the ball in say two or three seconds you're probably not going to give me an accurate answer now let's look at another example suppose i walk up to you and say the word dog you will never understand my statement because well it doesn't make sense there are trilling combinations solely using the word dog and among these trillion combinations i'm expecting you to now guess what i'm trying to tell you what these two examples have in common is that it doesn't make sense it doesn't in the first case i'm expecting you to predict the next position in time and in the second i'm expecting you to understand what i mean by dog these two examples cannot be understood and interpreted unless some information about the pass was supplied now in the first example if i give you the previous position states of the ball and now ask you to predict the future trajectory of the ball you're going to be able to do this accurately and in the second case if i give you a full sentence saying i have a dog this makes sense because now you understand that out of the trillion possible combinations involving a dog my original intent was for you to understand that i have a dog why did i give you this example how does this apply to neural networks in the introduction i said vanilla neural networks can't model every single situation or problem that we have and the biggest problem it turns out is that plain vanilla feed forward neural networks cannot model sequential data sequential data is data in a sequence for example a sentence is a sequence of what a ball moving in space is a sequence of all its position states in the sentence that i've shown you you understood each word based off your understanding of the previous words this is called sequential memory you were able to understand the data point in the sequence by your memory of the previous data point in that sequence traditional neural networks can't do this and it seems like a major shortcoming one of the disadvantages of modelling sequences with traditional neural networks is the fact that they don't share parameters across time let us take for example these two sentences on tuesday it was raining and it was raining on tuesday these sentences mean the same thing although the details are in different parts of the sequence actually when we feed these sentences into a field for new network for a prediction task the model will assign different weights to on tuesday and it was raining at each moment in time things we learn about the sequence won't transfer if they appear at different points in the sequence sharing parameters gives the network the ability to look for a given feature everywhere in the sequence rather than just in a certain area that's the model sequences we need a specific learning framework able to deal with variable lens sequences maintain sequence order and to keep track of long-term dependencies rather than cutting input data too short and finally to share parameters across the sequence so as to not reload things and that's where recurrent neural networks come in rnns are a type of new network architecture that use something called a feedback loop in the hidden layer unlike feed forward new networks the recurrent neural network or rnn can operate effectively on sequences of data with variable input length this is how an rnn is usually represented this little loop here is called the feedback loop sometimes you may find the rnns depicted over time like this the first path represents the network in the first time step the hidden node h1 uses the input x1 to produce output y1 this is exactly what we've seen with basic feed forward new networks however at the second time step the hidden node at the current time step h2 uses both the new input x2 as well as the state from the previous time step h1 as input to make new predictions this means that a current neural network uses knowledge of its previous states as input for its current prediction and we can repeat this process for an arbitrary number of steps allowing for the network to propagate information by its hidden state throughout time this is almost like giving a neural network a short-term memory they have this abstract concept of sequential memory and because of this we're able to model certain areas of data sequential data that standalone neural networks aren't able to model recurrent neural networks remember their past and their decisions are influenced by what it has learned from the past basic feed-forward networks remember things too but they remember things they learned during training for example an image classifier learns what a three looks like during training and then uses that knowledge to classify things in production so how do we train an rnn well it is almost the same as training a basic fully connected feed forward network except that the back propagation algorithm is applied for every sequence data point rather than the entire sequence this algorithm is sometimes called the back propagation through time algorithm or the btt algorithm to really understand how this works imagine we're creating a recurring new network to predict the next letter a person is likely to type based on the previous letters they've already typed the letter that a user just typed is quite important to predicting the new letter however all the previous letters are also very important to this prediction as well at the first time step say the user typed the letter f so a network might predict that the next letter is in e based on all of the previous training examples that included the word fe at the next time step the user types the letter r so our network uses both the new letter r plus the state of the first hidden neuron in order to compute the next prediction l the network predicts this because of the high frequency of occurrences in the word fel in our training data set adding the letter a might predict the letter t adding an n would predict the letter k which would match the word i use in tender to type which is frank there however is an issue with rnn's known as short-term memory short-term memory is caused by the infamous vanishing and exploding gradient problems as the rnn processes more words it has trouble retaining information from previous steps kind of like our memory if you're given a long sequence of numbers like pi and you tried reading them out you're probably going to forget the initial few digits right short-term memory and the vanishing gradient is due to the nature of back propagation the algorithm used to train and optimize neural networks after the forward propagation or the pass the network compares this prediction to the ground truth using a loss function which outputs an error value an estimate of how poorly the network is performing the network uses that error value to perform back propagation which calculates the gradients for each node in the network the gradient is the value used to adjust the network's internal weights allowing for the network to learn the bigger the gradient the bigger the adjustments are and vice versa here's where the problem lies when performing back propagation each node in a layer calculates its gradient with respect to the effects of the gradient in the layer before it so if the adjustment to the layers before it is small then the adjustments to the current layer will be even smaller and this causes gradients to exponentially shrink as it back propagates down the earlier layers fail to do any learning as the internal weights are barely being adjusted due to extremely small gradients and that is the vanishing gradient problem let's see how this applies to recover new networks you can think of each time step in a recurrent neural network as a layer to train recon neural network you use an application of back propagation called back propagation through time the gradient values will exponentially shrink as the back propagates through each time step again the gradient is used to make adjustments in the new network weights thus allowing it to learn small gradients means small adjustments and this causes the early layers not alone because of the vanishing gradient the rnn doesn't learn the long range dependencies across time steps this means that in a sequence it was raining on tuesday there is a possibility that the words it and was are not considered when trying to predict the user's intention the network then has to make the best guess with on tuesday and that's pretty ambiguous and would be difficult even for a human so not being able to learn on earlier time steps causes the network to have a short-term memory we can combat the short-term memory of an rnn by using two variants of recurrent neural networks gated rnns and long short-term memory rnns also known as lcms both these variants function just like rns but they're capable of learning long-term dependencies using mechanisms called gates these gates are different tensor operations that learn information that can learn what information to add or remove to the hidden state or the feedback loop the main difference between a gated rnn and an lscm is that the gated rnn has two gates to control its memory an update gate and reset gate while an lsem has three gates an input gate an output gate and a forget gate rnns work well for applications that involve sequences of data that change over time these applications include natural language processing sentiment classification dna sequence classification speech recognition and language translation a convolutional neural network or cnn for short is a type of deep neural network architecture designed for specific tasks like image classification cnns were inspired by the organization of neurons in the visual cortex of the animal brain as a result they provide some very interesting features that are useful for processing certain types of data like images audio and video like a fully connected neural network a cnn is composed of an input layer an output layer and several hidden layers between the two cnns derive their names from the type of hidden layers it consists of the hidden layers of a cnn typically consist of convolutional layers pooling layers fully connected layers and normalization layers this means that instead of traditional activation functions we use in feed-forward neural networks convolution and pooling functions are used instead more often than not the input of a cnn is typically a two-dimensional array of neurons which correspond to the pixels of an image for example if you're doing image classification the output layer is typically one-dimensional convolution is a technique that allows us to extract visual features from a 2d array in small chunks each neuron in a convolution layer is responsible for a small cluster of neurons in the preceding layer the bounding box that determines the cluster of neurons is called a filter also called a kernel conceptually you can think of it as a filter moving across an image and performing a mathematical operation on individual regions of the image it then sends its result to the corresponding neuron in the convolution layer mathematically a convolution of two functions f and g is defined as follows which is in fact the dot product of the input function and the kernel function pooling also known as sub sampling or down sampling is the next step in a convolutional neural network its objective is to further reduce the number of neurons necessary in subsequent layers of the network while still retaining the most important information there are two different types of pooling that can be performed max pulling and min pooling as the name suggests max pooling is based on picking up the maximum value from the selected region and min pooling is based on picking up the minimum value from that region when we put all these techniques together we get an architecture for a deep neural network quite different from a fully connected neural network for image classification where cnns are used heavily we first take an input image which is a two-dimensional matrix of pixels typically with three color channels red green and blue next we use a convolution layer with multiple filters to create a two-dimensional feature matrix as the output for each filter we then pool the results to produce a downsample feature matrix for each filter in the convolution layer next we typically repeat the convolution and pooling steps multiple times using previous features as input then we add a few fully connected hidden layers to help classify the image and finally we produce a classification prediction in the output layer convolutional neural networks are used heavily in the field of computer vision and work well for a variety of tasks including image recognition image processing image segmentation video analysis and natural language processing in this section i'm going to discuss the 5 steps that are common in every deep learning project that you build these can be extended to include various other aspects but at its very core they are very fundamentally five steps data is at the core of what deep learning is all about your model will only be as powerful as the data you bring which brings me to the first step gathering your data the choice of data and how much data you would require entirely depends on the problem you're trying to solve picking the right data is key and i can't stress how important this part is bad data implies a bad model a good rule of thumb is to make assumptions about the data you require and be careful to record these assumptions so that you can test them later if needed data comes in a variety of sizes for example iris flaw data set contains about 150 images in the total set gmail smart reply has around 238 million examples in its running sets and google translate reportedly has trillions of data points when you're choosing a data set there's no one-size-fits-all but the general rule of thumb is that the amount of data you need for a well-performing model should be 10 times the number of parameters in that model however this may differ from time to time depending on the type of model you're building for example in regression analysis you should use around 10 examples per predictor variable for image classification the minimum you should have is around a thousand images per class that you're trying to classify while quantity of data matters quality matters too there's no use having a lot of data if it's bad data there are certain aspects of quality that tend to correspond to well-performing models one aspect is reliability reliability refers to the degree in which you can trust your data a model train on a reliable data set is more likely to yield useful predictions than model trained on unreliable data how common are label errors if your data is labeled by humans sometimes there may be mistakes are your features noisy is it completely accurate some noise is all right you'll never be able to purge your data of all the noise there are many other factors that determine quality for the purpose of this video though i'm not going to talk about the remaining although if you're interested i'll leave them in the show notes below lucky for us there are plenty of resources on the web that offer good data sets for free here are a few sites where you can begin your dataset search the uci machine learning repository maintains around 500 extremely well maintained data sets that you can use in your deep learning projects kaggle's another one you'll love how detailed that data sets are they give you info on the features data types number of records and so on you can use a kernel too and you won't have to download the data set google's dataset search is still in beta but is one of the most amazing sites that you can find today reddit2 is a great place to request for data sets you want but again there is a chance of not being properly organized create your own data set that will work too you can use web scrapers like beautiful soup to get your required data for the data set after you have selected your data set you now need to think of how you're going to use this data there are some common pre-processing steps that you should follow first splitting the data set into subsets in general we usually split a data set into three parts training testing and validating sets we train our modules with the training set evaluated on the validation set and finally once it's ready to use test it one last time on the testing data set now it is reasonable to ask the following question why not have two sets training and testing in that way the process will be much simpler just train the model on the training data and test it on the testing data the answer to that is developing a model involves tuning its configuration in other words choosing certain values for the hyper parameters or the weight and biases this tuning is done with the feedback received from the validation set and is in essence a form of learning it turns out we just can't split the data set randomly do that and you'll get random results there has to be some kind of logic to split the data set essentially what you want is for all three sets the training testing and validation sets to be very similar to each other and to eliminate skewing as much as possible this mainly depends on two things first the total number of samples in your data and second or the actual model you're trying to train models with very few hyper parameters will be very easy to validate in tune so you can probably reduce the size of your validation set but if your model has many hyper parameters you would want to have a large validation set as well as consider cross validation also if you happen to have a model with no hyper parameters whatsoever or ones that cannot be easily tuned you probably don't need a validation set all in all like many other things in machine learning and deep learning the train test validation split ratio is also quite specific to your use case and it gets easier to make judgment as you train and build more and more models so here's a quick note on cross validation usually you'd want to split your data set into two the train and the test after this you keep aside the test set and randomly choose some percentage of the training set to be the actual train set and the remaining to be the validation set the model is then iteratively trained and validated on these different sets there are multiple ways to do this and this is commonly known as cross validation basically you use your training set to generate multiple splits of the train and validation set cross validation avoids overfitting and is getting more and more popular with k fold cross validation being the most popular method additionally if you're working on time series data a frequent technique is to split the data by time for example if you have a dataset with 40 days of data you can train your data from days 1 to 39 and evaluate your model on the data from day 40. for systems like this the training data is older than the serving data so this technique ensures your validation set mirrors the lag between training and serving however keep in mind that time-based splits work best with very very large data sets such as those with tens of millions of examples the second method that we have in pre-processing is formatting the data set that you've picked might not be in the right format that you like for example the data might be in the form of a database but you'd like it as a csv file vice versa of course there are a couple of ways to do this and you can google them if you'd like dealing with missing data is one of the most challenging steps in the gathering of data for your deep learning projects unless you're extremely lucky to land with the perfect data set which is quite rare dealing with missing data will probably take a significant chunk of your time it is quite common in real world problems to miss some values of our data samples this may be due to errors on the data collection blank spaces on surveys measurements not applicable and so on missing values are typically represented with the nan or the null indicators the problem with this is that most algorithms can't handle these kind of missing values so we need to take care of them before feeding data to our models there are a couple of ways to deal with them one is eliminating the samples of the features with missing values the downside of code is that you risk to delete relevant information the second step is to impute the missing values a common way is to set the missing values as the mean value for the rest of the samples but of course there are other ways to deal with specific data sets be smart as handling missing data in the wrong way can spell disasters sometimes you may have too much data that what you require more data can result in larger computational and memory requirements in cases like this it's best practice to use a small sample of the data set it will be faster and ultimately an increase in time for you to explore and prototype solutions in most real world data sets you're going to come across imbalanced data that is classification data that has skewed class proportions leading to the rise of a minority class and a majority class if we train a model on data like this a model will only spend time learning about the majority class and a lot less time on the minority class and hence a model will ultimately be biased to the majority class and so in cases like this we usually use a process called down sampling and up weighting which is essentially reducing majority cost by some factor and adding example weights of that factor to the down sample class for example if we down sample the majority cost by a factor of 10 then the example weighted we add to that class should be 10. it may seem odd to add example weights after down sampling what is its purpose well there are a couple of reasons at least a faster convergence during training we see the minority class more often which helps the model converge faster by consolidating the majority class into fewer examples with larger weights we spend less disk space storing them operating ensures their model is still calibrated we add operating after down sampling so as to keep the data set in similar proportion these processes essentially help a model see more of the minority costs rather than just solely the majority class this helps our model perform better in real world situations feature scaling is a crucial step in the pre-processing phase as the majority of deep learning algorithms perform much better when dealing with features that are on the same scale the most common techniques are normalization which refers to the rescaling of features to a range between 0 and 1 which in fact is a special case of min max scaling to normalize that data we need to apply min max scaling to each feature column standardization consists of centering the field at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution that is 0 mean and unit variance this makes it much easier for the learning algorithms to learn the weights of the parameters in addition it keeps useful information about outliers and makes the algorithms less sensitive to them once our data has been prepared we now feed this into our network to trade we've discussed the learning process of a neural network in the previous module so if you are unsure i'd advise you to watch that module first but essentially once a data has been fed forward propagation occurs and the losses compared against the loss function and the parameters are adjusted based on this loss incurred again nothing too different from what we discussed previously your model has successfully trained congratulations now we need to test how good our model is using the validation set that we had set aside earlier the evaluation process allows us to test a model against data it has never seen before and this is meant to be representative of how good the model might perform in the real world after the evaluation process there's a high chance that your model could be optimized further remember we started with random weights and biases and these were fine-tuned during back propagation well in quite a few cases bad propagation won't get it right the first time and that's okay there are a few ways to optimize your model further tuning hyper parameters is a good way of optimizing your model's performance one way to do this is by showing the model the entire data set multiple times that is by increasing the number of epochs this has sometimes shown to improve accuracy in other ways by adjusting the learning rate we talked about what the learning rate was in the previous module so if you don't know what the learning rate is i do advise you to check out the previous module but essentially the learning rate defines how far we shift the line during each step based on information from the previous training step in back propagation these values all play a role in how accurate a model can become and how long the training takes for complex models initial conditions can play a significant role in determining the outcome of training there are many considerations at this phase of training and it's important you define what makes a model good enough otherwise you might find yourself tweaking parameters for a long long time the adjustment of these hyper parameters remains a bit of an art and is more of an experimental process that heavily depends on the specifics of your data set model and training process you will develop this as you go more and more into deep learning so don't worry too much about this now one of the more common problems you will encounter is when your model performs well on training data but performs terribly on data it's never seen before this is the problem of overfitting this happens when the model learns a pattern specific to the training data set that aren't relevant to other unseen data there are two ways to avoid this overfitting getting more data and regularization getting more data is usually the best solution a model trainer mode data will naturally generalize better reducing the model size by reducing the number of learnable parameters in the model and with it its learning capacity is another way however by lowering the capacity of the network you force it to learn patterns that matter or then minimize the loss on the other hand reducing the network's capacity too much will lead to underfitting the model will not be able to learn the relevant patterns in the trained data unfortunately there are no magical formulas to determine this balance it must be tested and evaluated by setting different number of parameters and observing its performance the second method to addressing overfitting is by applying weight regularization to the model a common way to achieve this is to constraint the complexity of the network by forcing its weights to take only small values regularizing the distribution of weight values this is done by adding to the loss function of the network a cost associated with having larger weights and this cost comes in two ways l1 regularization at the cost with regards to the absolute value of the weight coefficient or the l1 norm of the weights l2 regularization adds a cost with regards to the squared value of the weight's coefficient that is the l2 norm of the weight another way of reducing overfitting is by augmenting data for a model to perform well or satisfactory we need to have a lot of data we've just have just already but typically if you're working with images there's always a chance that your model won't perform as well as you'd like it no matter how much data you have in cases like this when you have limited data sets data augmentation is a good way of increasing your data set without really increasing it we artificially augment our data or in this case images so that we get more data from already existing data so what kind of augmentations are we talking about well anything from flipping the image of the y-axis flipping over the x-axis applying blur to even zooming on in the image what this does is that it shows your model more than what meets the eye it exposes your model to more of the existing data so that in testing it will automatically perform better because it has seen images represented in almost every single form finally the last method we're going to talk about is dropout dropout is a technique used in deep learning that randomly drops out units or neurons in the network simply put dropout refers to the ignoring of neurons during the training phase of a randomly chosen set of neurons by ignoring i mean that these units are not considered during a particular forward or backward pass so why do we need dropout at all why do we need to shut down parts of a neural network a fully connected layer occupies most of the parameters and hence neurons develop a co-dependency amongst each other during training which curbs the individual power of each neuron and which ultimately leads to overfitting of the training data so drop out a good way of reducing overfitting i hope that this introductory course has helped you develop a good intuition of deep learning as a whole of course we've only just scraped the surface there's a whole new world out there if you like this course please consider liking and subscribing it really helps me make courses like this i have a couple of videos on computer vision with opencv that i will be releasing in a couple of weeks so stay tuned for that in the meantime good luck
Info
Channel: freeCodeCamp.org
Views: 153,002
Rating: 4.936574 out of 5
Keywords: deep, learning, deep learning tutorial, python, machine learning tutorial, deep learning, deeplearning, what is deep learning, deep learning tutorial for beginners, deep learning tensorflow, deep learning full course, deep learning introduction, intro to machine learning, basics of deep learning, deep learning crash course
Id: VyWAvY2CF9c
Channel Id: undefined
Length: 85min 38sec (5138 seconds)
Published: Thu Jul 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.