Antonio Marquina - Machine Learning for Gravitational Wave Astronomy: concepts & terminology

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i have to talk about a difficult topic because it's too weak and in part this this topic is a people work like you working in gravitational wave detection is is really interested in in this topic so i i want to to give some ideas about how the two tutorials will be uh well tutorial one which is today uh general conscious and terminology so i i will resent reference to gravitational wave uh machine learning but very few so if you know a lot of machine learning so you can skip this this tutorial and the second one will be more focuses on applications so i will talk about mathematical background of dictionary learning applications to the recovery of wave signals and we will see a step-by-step gravitational wave recovery from actual data so today i started with very very basic uh uh well now this is the outline for this first part in in the tutorial of today i will i will give basic concepts and terminology and then a brief introduction to neural networks within what is machine learning why should we use machine learning types of machine learning strategies but online learning eastern-based personal model-based learning well and the machine learning process we have several concepts which are important sometimes i see in the papers in physical reviews concerning machine learning for gravitational wave i don't see many attention to all these concepts when they use for instance neural networks to do this job so i think this community needs more knowledge about machine learning machine learning is the science of programming computers so they can learn from data more engineering oriented definition i like it by tom mitchell will be computer programming said to learn from experience e with respect to sound task t and some performance measure p if its performance on t as measured by p improves with experience e so you can learn again and to but i given one example when you identify all these letters the simple example is the spam filter our spam filter is a machine learning program that can learn to flag spam given examples of spam emails flagged by users and examples of regular non-spam emails then in this case we have a training set the collection of given examples that the system uses to learn each training example is called a training instance or sample the task is to flag spam for new emails the experience c consists of training data the performance p needs to be defined is the the the hard part to to define this algorithm for example you can use the ratio of quarterly classified emails over the total this particular performance measure is called accuracy used very often in classification tasks oh it's a little bit unstable imagine you are playing scrabble against a computer and a machine learning program we might beat the computer every time in the beginning after lots of games the computer starts beating you until finally you never win conclusion either we are getting worse or the machine learning program is learning how to win at scrabble machine learning allowance may learn from experience so when the machine learning program learns to beat us it goes online uses the learning strategies against other players never seen other players so he can win other players this is important this machine learning strategy is called reinforcement learning then i i do have one remarks about traditional programming versus machine learning techniques example of spam filter using traditional programming will be you will look at what the span typically looks like then we might notice that some expression like for you our password free credit card amazing tend to come up a lot of the subject we will write a detection algorithm for each of these patterns that we notices and our program will flag image as a span if a number of these patterns are detected we will test our program and repeat the steps one and two until is good enough since the the problem is not trivial our code will likely become a long list of complex roles pretty hard to maintain on the other hand spam filter based on machine learning techniques automatically learns which expressions are good spam predictors by detecting unusually frequent patterns of words in the spam examples the code is much shorter easier to maintain and more accurate then the the they use two rules every email we flag like spam is used as data to train the machine learning algorithm every email that we restore it to the inbox because it has been incorrectly flagged as spam by the filter it's also used as data to train the machine learning alloy then every time we are doing these operations the the pro the machine learning program is going to be smart machine learning is a great advance in concerning computer in particular for gravitational wave data problems in which existing solutions requires a lot a lot of hand tuning of long list of rules one machine learning algorithm can simplify the code and perform better complex problems for which there is no good solution using a traditional approach fluctuating fluctuating environments the machine learning system can adapt to new data getting insights insights about complex problems and less amounts of data this is the usual scenario we have with the gravitational wave data generated by legal video detectors there are many types or strategies of machine learning alloys but we can classify them in general categories based on one whether or not they are trained with human supervision supervisors and supervisors semi-supervised and reinforcement learning whether or not they can learn incrementally on the fly online versus batch learning whether they were by detecting patterns in the training the data and build a predictive model such as scientists do instance bases versus model based learning in supervised learning the training data we include the desired solution called labels a typical supervised learning is classification another typical tax is to predict the target numerical value such as the price of a car given a set of features mileage aids brand call predictors this sort of tasks is called regression the spam filter is a good example of classification promises trained with many example images along with their class and must learn how to classify new emails to train mla um to train a machine learning system in these cases we need to provide many examples they predict us and their and levels in the case of cars their prices okay i would like to to know how to pass on the top ah yeah yeah okay thank you yes now i named the the following some the most important supervisor learning algorithms not all covered in this tutorial of course we can mention k nearest neighbors linear regression logistic regression super vector machines decision trees and random forests and network networks some neural networks because this this is refers to supervised learning algorithms and neural networks are unsupervised for instance uh out encoders and restricted worldwide machines okay now supervisor of the lightning is is this one okay no i don't know i this this this is just enhanced always assuming the uh okay i don't know exactly which point yeah i can do this but this is terrible because it filled with one finger and then it does not yeah and then and then i do this in order to restore right yeah oh let's just do it with one finger not two ah yeah yeah you don't have to change yeah thank you it's everything new for me okay uh unsupervised learning the following are the most in important algorithms uses of training data and level it the first one is clustering k means hierarchical cluster analysis h ca expectation maximization the second one class dimensionality reduction which is very important principle component analysis kernel pca locally linear embedding anomaly detection association rule learning so unsupervised learning uh in this case uh we we talked about sound words about dimensionality reduction the goal of them is now introduction is to simplify the data without losing too much information that is obvious one way to do this is to merge several correlated features into one the dimensionality reduction can be obtained by means by means of pca or singular value decomposition in terms of linear algebra since a since c it it is an unsupervised algorithm that finds strong correlations in large datasets another important and supervised task is anomaly detection for example detecting unusual credit card transactions to prevent fraud catching manufacturing the effect automatically removing outliers from that i said before feeding it to another learning algorithm the algorithm is trained with normal instances and when it sees a new instance it can tell whether it looks like a normal one or whether it is like likely an anomaly as you can see these concepts can be applied and ensured for for many situations you have in gravitational wave and data analysis semi-supervised learning as another interesting class some algorithms can deal with partially labeled trainings usually a lot of unlabeled data and a little bit of later level data this is semi-supervised learning example some photo hosting services such as apple iphoto are good examples of this we upload our family pictures to the service supervisor part we need to provide the algorithm with a level per person at least in one or two or three pictures on supervision part the algorithm automatically recognize the same person a for instance appears in pictures 1 3 and 12 while another person does in pictures 3 5 and 7. usually this use class clustering algorithms another option is reinforcing learning much more difficult to quote the learning algorithm called anagen agent has the following task it can be observed the environment it can select and perform actions get rewards in return or penalties in the form of negative rewards the so-called policy is the ability to learn by itself what is the best strategy to get the most reward over time a policy defines what action the agent should choose when it is given in a situation so i give a few couple of examples many robots of course could implement reinforced learning algorithm walking or other other kind of situations another more more famous is the deep minds alphago program which is a program to play against computers or human players and strain it as a reinforcement learning algorithm by learning is winning policy through the analysis of millions of games and playing playing many games against itself alpha 0 is the last one of the last breakthrough in this area uh google group a program lends only by playing go or chess against itself and these were featured in science editorial in december 2018 okay now we want to see bats learning and online learning the math learning the system is incapable of learning incrementally that must be trained using all the available data this will generally take a lot of time and computing resources so it is typically done offline the offline learning works as follows first the algorithm is trained second is launched into production and runs without learning anymore it just applies what it has learned in practice we want if we want a batch learning algorithm to know about new data such as a new type of spam for instance we need to train a new version of the system from scratch on the full data set not just the new data but also the all data then stop the whole system and replace it with a new one oh no yes this is not learning uh consists of the following steps train the algorithm incrementally by feeding new data instances sequentially this may either individually or by small groups called mini batches and its learning step is fast and cheap so the algorithm can learn about new data on the fly as it arrives now we discuss instant-based virtual model-based learning one more way to categorize machine learning system is by how they generalize most machine learning tasks are about predictions this means that a given number of training examples the algorithm needs to be able to generalize to cases it has never seen before having a good performance measure on the training data is convenient but in unsatisfactory the true goal is to perform well on new instances there are two main approaches to generalization instance-based learning and model-based learning the instant-based learning consists of an algorithm that learns the examples and the generalized to new cases new cases using a similarity measure for instance instead of just flagging emails that are identical to non-spam emails the spam filter could be programmed to also flag images that are very similar to known spam emails this requires a measure of similarity between two emails a very basic similarity measure between two images could be count the number of words they have in common the system will flag an email as hispanic has many words in common with a noun spam email another way to generalize general generally generalize from a set of examples is by building a model of these examples and then use the model to make prediction this is called called model-based training here for instance we have a one example which is very taken from one of the reference i i used which is the uh oral jerome with this makes a uh take the data from the unesco from taking 45 countries with a gross domestic product per capita and then relates this with the what they call life satisfaction so the the problem is to know as many make people happier then in this case linear regression so is is good enough to make a good prediction so the goal is determine two constants which are the constant corresponding to the regression uh the regulation line uh in gravitational wave for instance a with either our one one we will talk about that in the tutorial two which was we classify uh blips generated with a model so we take three types of of uh functions and we take one sign with gaussian gaussian and then something similar to bring down separated and we use every part of this uh this model to detect parts of this uh and in the in the case of dictionary learning we will talk about we we ignore everything uh with we use only one dictionary and then in our way that the for instance we we detect uh one spiral and then the other the other part is not doesn't exist is noise pure noise so we will we will explain how how to do that this is one example of a model-based learning so we we we go to the steps we describe the step we have to follow when we have a machine learning problem either classification or regression so data collects so we have to follow this order so we start with data collection and preparation uh i know people not working in gravitational waves doing before they choose the algorithm and then adapt the data to the algorithm they they have so this is not a good strategy to work with so in the beginning we have to collect clean measure the size of the chosen dataset input data for training and we need to collect target data for supervised learnings so feature selection define the possible feature we want to identify algorithms given a data set choose the appropriate algorithm so when we have studied this the the data collection and preparation of the data set then we choose the algorithm then parameter and model selection oh what does this this i don't know yeah wow uh yeah that's then for many algorithms there are parameters they have to to set manually or require experimentation to identify appropriate values for instance in the case of neural network the hyper parameters usually are fitted manually before a starter training the training phase training given the dataset algorithm training should be simply to use computational shorts this is the the second the the packet that takes more more computational costs at the end the valuation check the accuracy for the training algorithms on the validation dataset so we will talk about what is the validation data set and everything so we use vectors matrices input vector output vectors so in machine learning the dimensionality of an input is the number of components of the vector sometimes dimensionality of a vector is understood as the number of components different from c so the important terminology is vector of inputs is the data given as the one input of the algorithm where the vector is has m dimensions for instance afterwards we have weights which are weighted connections that are arranged in in a matrix outputs the output vector different dimensionality of the output and we can write y x w to remind that the outputs depend on the inputs and the weights later on on the neural networks we we will talk about how is the role of the weights targets the target vector are the labels of the elements uh one target corresponds to one component of the of the input activation function we use in neural networks the activation is a mathematical function they describe the firing of the neuron as a response to the weighted inputs error of cost is a function that computes the inaccuracies of the networks a function of the outputs and the targets so at the end for the same machine learning algorithm the following ingredients will become essential so apply linear algebra means the solution of large existable linear equations will be needed especially when we we are checking if our strategy is good enough it is convenient to check this strategy on simple simple structure structures for instance like doing doing in in large linear systems euclidean norms and other metrics to measure the closest between outputs and targets non-linear activation functions the role of this function is to follow reduce dimensionality and classify outputs at the end and no no not least computational cost of the training the highest computational cost correspond to the training step over training and under training the magic well associated to machine learning algorithm is generalization there are many words of applications that this word doesn't appear this this means something that for me something is important omission our ideal goal is to make sure that we that we do enough training so the algorithm generalized well there is as much danger in overtraining as there is in under training if we train for too long the we will over feed the the data another magic word which means that we have to learn about the noise and in accuracies in the data in this case the model that we learn will be too complicated i won't be able to generalize so over when we are doing uh over training so we we we might fall in the all with with output we will be overfeed we see this in more this is a simple picture to understand a little bit what is overfeed when do when we are doing numerical approximation by functions the effect of the effect of overfitting is rather than finding the generalized function as shown in on the left the algorithm matches the inputs perfectly including the noise in in them on the right the approximation on the right reduce the generalization capabilities of the function so if we learn too much about a cloud of data then this learning function have problems to to make a good generalization to other cases testing and training testing and validation sets we want to stop the learning process before the algorithm overheats we cannot use the training data for this since since we cannot detect overfitting so if we cannot use the testing data since it is used for final test we need a third set of data for this purpose which is called validation set cross validation in statistics so this is the validation set is the set of instances that the algorithm uh don't don't is unseen data of the algorithm i will see a scheme so here we have for instance the whole data set so we make a partition of this data set in three pieces for training for testing for validation so the the point is that we train over training and testing and then we check if the accuracy obtainer in in the training and testing is the same order of magnitude the obtainer with the trainers algorithm in the validation set we still need to work out whether or not the result is good enough after training and testing an algorithm a method there is three table for classification problem is the one known as the confusion matrix which is used in some papers the idea is the following let us consider square matrix that contains all possible classes in both of the horizontal and vertical directions the element i j of the matrix tells us how many patterns were put in the class i in the targets but class j in the alloy anything in the leading or main diagonal is correct suppose we have three classes so we count the number of times that the output is class c one when the original target was c one then all well correctly classified then when the target was c2 this number would be put in the second line and so until the table failed so according to the table most examples were classified correctly because are located on the main diagonal but for instance two examples of class c3 were misclassified c1 the accuracy of the classification is a ratio between the sum of the entries of the leading diagonal over the sum of the all entries in the matrix so one important particular class is the binary classification that considers only class c1 class c2 the possible outputs of the classes can be arranged in a simple chart c1 in the main diagonal we have two positives two negatives uh false positive false negatives in the secondary diagonal true positives and conservation correlates so we we have the same interpretation as any confusion matrix but in this case we have names for the location of the entries the entries in the leading diagonal are correct and those out of the examiners are incorrect so the accuracy is defined in the zone of the number of true positives and two negatives divided by the total number of examples this is one measure of accuracy they are other measurements they can help to understand the performance of the classifier but this metric specificity the precision recall are out of the scope of this tutorial so this much more statistics uh another uh element interesting to for when we are addressing a classification problem is the roc curve we can evaluate a particular classifier or compare different classifiers either the same classifier with different learning parameters or completely different classifier when we have binary classification algorithm so here we have what appears to be the this is the roc curve is represented fast uh false positive against true positive rate so the in percent in one in in the same person so this is the car so a single run of the classifier represents one point of the rroc curve so we we represent in every point what what is the result for for a confusion matrix we represent one point in every so if we have a classifier uh located at the point zero hundred it means that it's perfect classification so we have no vast negatives uh total failure will be 100 zero so it will be over here so uh a reference so i use uh i like to this these two to reference because the the first one is is very more elementary so but the is a great introduction so i covered a wide range of topics in depth with code examples for everyone written in python using numpy numpy something very close to mala so basic matlab so it's very is is really good to to learn python and to use things that you use before foreign the second is aurelien geron is is much more difficult so they they they explained lots of very uh machine learning uh uh lots of examples in machine learning updated to tensorflow 2.0 the the the codes presented in this book are good enough but are more cryptic so it's much more difficult to understand a python code in intentional flow that yeah um i had a quick question about uh the roc curve so shouldn't the sum of um the true positives this false positive be one or or 100 or is that not 100 because the false negatives and false sorry true negatives and true positives have to be removed yeah yeah um what happened is in the roc chord we we rescale to our c taking the the data from the accuracy will be from zero to one is is true so but the to representing time in percentage say we rescale to co 200 so yeah so this just depends on the application so some some people understand much better sorry so uh i was asking uh also that if the sum of true positive and false positives should be 100 or not uh at any point yeah uh no yep but an roc curve means that you represent every point represents one the execution of one classifier different one one two or the other so you can compare many classifiers if if if the curve you you you write your picture in the roc is far away and the upper side of the diagonal is is far away from the diagonal then the this means that the classifiers are good yeah yeah that's that's the the same rule for them okay so so you you you cannot say nothing deterministic so you say this this is what is good enough you have the best one will be going the uh the the the axis going to zero to 100 in the in the in and then you have this but this is not usual a classifier they have error and then they they are different so this is not important so i think is is the scaling so and you you you can fill this roc depending of the what you want to compare so even if you take only one classifier for different uh parameters so it's it's it's a lot of things yes i i explained this just to say that is one tool in order to explain much better what's going on without classifier so so that's boring but of course there is much more things about the statistics and things okay thank you yeah okay so what with this i finished my the first part of the talk which is general concert and and then i i i passed to [Music] to the the other one it is a brief introduction to neural networks uh okay how i don't know actually i have a question for you when you show the confusion matrix yeah there is an assumption there that every time you have the true label let's say c1 you are getting an outcome which is c1 right and c2 you get c2 all the time now for highly non-linear systems that may not be true right you can have chaotic systems so since your labeling is uh is discrete so yeah yeah i think i will think about this or well i i never tried a very non-linear problem so probably doesn't work or maybe in some statistic books of statistics is discussed so i i don't know how to to answer your question so maybe maybe is not uh good enough to to to give uh good evaluation of the classifier when when you are classifying things they are very non-linear uh correlated oh so yeah i think it's so so uh usually when i when i have this kind of depth so i use i i like to use a pca or a singular value the composition which is the unsupervised tool in machine learning to know which is strong correlated and which is not a strong correlation so this is a is a good tool to analyze your your data before you are going far away with other with neural networks or other kind of things so this is my my point of view for uh following my experience of this so i i don't know how to answer your questions but so you mentioned anomaly detections yeah supervised and normally detection so the methods i see uh i saw assume that you know what the normal is right so certain sense you have to train a little bit and tell tell the algorithm this is normal and so you can detect an anomaly now have ever have you seen any any methods or algorithms where you can start doing anomaly detection without knowing what the normal is so it will take some time to learn yeah yeah that's true but you start from from scratch so for example in one year time we'll turn on the interferometer and we don't know what the normal is for the new interferometer right yeah is there an unsupervised normally detection mechanism that learns what's the normal is before learning what an anomaly is uh you have to inform to the algorithm what kind of patterns you you want to look at in a way so depend on the algorithm so usually uh instant because this is will be instance-based uh learning instead basic learning in general should be less you have you have more uncertainty in a way so yeah you cannot say that well you have to to know what you are looking for so normal normal patterns according to not everything so you know the normal signal detected in one so i i don't know so you have to to tell the algorithm what you want to see and perhaps detect something i don't know this yeah i have a question uh so you mentioned that you divide your data into three sets the training the testing and the validation is there like a standard where people you know separate maybe like 30 for testing or does it depend on the problem itself yeah yeah the the now when i explain the the neural network you will see what what's going on but uh in the beginning when we take the your all that data the the simplest way is to do testing sorry training and validation and you have you avoid testing so this is a it's more complicated procedure of inference statistical influence so you train the algorithm on the training data set which is larger than the other one and you you arrive to some accuracy for instance and then uh you you you check your fitted algorithm on the validation set and then if you see the accuracy the the the the number of cases you yeah are are okay is similar to the one you are you obtain in the training set then you see the the the algorithm is well trained so generalize well because they get the same accuracy on the training set and on the validation set because the validation set is the set of instances they are going never seen so that's the point so if you you get this uh thumbs roll you can ensure that you don't have overfit so this is a trans rule in order to avoid overfitting to get the same accuracy you obtain after the training uh and then you take the training algorithm and you check how accuracy you get on the validation set so this is the normal strategy is used in in neural networks for instance so no many papers that don't care about that especially for applications because they they have more more difficult problems but but you have to ensure this because with with gravitational wave data is difficult to know you have or not overfitting so or feeding something elusive in a way okay the other one now we have here uh what we talk about biological artificial neurons simple samples of neural networks activation functions elements of neural networks stochastic gradient descent and reference and links it is impossible to in one hour uh well we have maybe i can take some more minutes so to to talk about all the topics all the things about this so i i like to start with because we until now was all everything literature so now i want to to get some pictures so the first picture is probably one of the first selfies the first selfies in the history this this guy is the father of the neuroscience which is was a scientist winning the the nobel prize in the beginning of so but hal professor of the faculty of medicine at the university of valencia what what was the the first position they got the after where was madrid because you know in in europe everything is centralized so hard the first place was this and this is in his laboratory and takes his own picture so was a selfie more selfie in a way okay it's just something great anime cajal writes and drawings that makes him famous in a museum so analyze how it works the neurons and then after some details that arrives the anatomy structure of a biological neuron was composed by dendrites cell body action and action potential means dendrites so the is the input of the cell this is the neuron this is the right is the signals coming from from from other networks and then the cell body is charged with cell voltage until the voltage crosses some threshold and then it crosses this this threshold then some signal is going to through the action to go to other neighborhoods so this is the original one so you want to explain this oh what happened so this receive information from many other neurons from the dendrites aggregate this information at that time at the beginning they don't know how to aggregate this information via changes in the cell voltage in this in the cell body and then there there was a nonlinear step simple transmit a signal if the cell voltage crosses the threshold level as a signal they can be received by many other neurons of the in the network so here the only thing is not clear is how to aggregate this information so the first guy proposing a model was frank rosenblatt in 1958 which which is the the more percep perceptron mode receptor model you have the same idea is there are inputs neuron and output and then how how this is this works uh the same steps except the linear step there is a linear step which is a specification what are doing with aggregation information so receive input from multiple other neurons from the dendrites in the case biological neurons linear step aggregate those inputs via a simple arimatic operation called weighted sum we talk about that a nonlinear step generate an output if this weighted sum crosses a threshold level i put b4 which can be sent to many other neurons so is specify what are doing this so the basic basic operation so to perform weighted through means is uh you have n inputs x i the sound waves corresponding to some line of input which is the weight then the binary binary output or the perceptron can be described as four observations strictly larger than threshold then output is one if it's lesser e well the threshold and the output is zero so why we can write the weighted zoom in a short way with this kind of these things and when we define the the the the parameter b which is called the bias as the minus the threshold in order to put everything in one side and compare everything with zero so then we can write the perceptronic inside rotation is extremely positive w plus bias output one is extremely positive and in the other case is output negative so here we go to have elementary neural networks with layers so here we have this the first layer is the input layer and the last one is the output and in the middle we can have several layers with called hidden layers when we have more than three three or more hidden layers we people consider the the network is deep so just having three or more so activation function we write the weight exam in this way then what is the activation function is a function sigma which is a non-linear real value function which range is either zero one or minus one one and has the only argument the value set of the weighted sum plus the bias then in a way when doing this these uh activations you have so we associate to every now and we apply this function it's a probability between zero and one or sometimes from minus one to one depending on the on the the problem but uh everything escalated in from zero one of minus one one everything works numerically much better than other other things even the input data can be is recommended to rescale to be interfe between zero and one so the step function is of course is a activation function which is this if it is associated to one neuron in this case a step neuron is a perceptron but they have a binary response zero one see my function then instead of giving a zero one binary response is giving a probability distribution between zero and one uh sometimes depend on this uh the other one is hyperbolic tangent between minus one and one sometimes corresponding to these kind of things and then at the end rectifier linear unit function which is almost linear linear for positive numbers and zero for negative there are thousands of activation fashion modifications of this so reluc you you you might find a lot of dna depending on the problem works much better or so this is one of the things you have to think about when you you are you are you trying to design a neural network so list of components of neural networks of course at the beginning neurons which can be of type sigmoid hyperbolic tangent or rel cos functions the most popular quadratic ghost or cross entropy the ah now we'll see what is the the role of this uh each one then we need a a gradient distance method which which has a stochastic component and i will explain why this stochastic component is added this in order to to get to to get good training and the training neural network that's an overfit so avoid overfit so it prevents so back propagation so because we are we will use gravity in the same nothing the same means that you have to to compute uh gradients of the so in a neural network the input is the is is fitted with all the your dataset and the weights and buyers are the unknowns so we have to fit a big amount of parameters so the weights and buyers for all the the inputs in the data set in a way that we minimize the cost so well layers you can you can use dense convolutional soft marks well two words about dense and convolutional dense means that you have to fit a lot of weight is the the worst case from the point of view of how the code is is big so people when you have your data is is high dimensional they they use usually use convolutional convolutional are eternal very small and they use very few weights compared with the dense case soft max was easy it's a kind of of a final processing to the output layer in order to assign a probability to every eat item in there well there is uh important also the initialization which is four years ago i discovered that the the people use random initialization but if you use randomization without caring about that so you you might you might get that you're in your training process the the neurons are saturated and going to go to c11 so they were truncated gaussian which is discovery by glorot and benjamin and then they propose something that works very well so at the end i there is another thing which is dropout which means and data augmentation which prevents eliminate the overfitting okay not that much so cost functions neural networks may address two kinds of problems classification and regression given target vectors i will define the desired labels for the outputs and given the estimated output vectors through the negative network then the regression problems the appropriate cost function should be quality cos is the same as this square but in this huge estimation uh the output approximates what is called the learning function we will define this defined by the forward phase of the already trained neural network predicting the targets concerning classification problems the appropriate cost function is the cross entropy cost the cross entry because is is much better because this is this uh energy is based on the shannon theory of information that that minimizing this function so from the theory of information that we we are reducing the uncertainty contained in the in the signal so what the mathematical analysis explanation is the cross entropy is structured so that when we calculate this derivative with respect to the weights related to the difference between the target and the estimate then the greater difference gives a larger variation of this derivative with respect to the weights and then the the the neurons learn faster so the the the classification problems are much faster learn much faster than the regression problems okay so the parameters to e fitter are the weights and the values are the unknowns we define the learning function as its argument is the input vector as the predicted value computed from the input vector so uh doing for word propagation you take the you want to to get the value of the the learning function so you take the one sample of data and then you you go through all the the from from left to the right according to our picture and then at the end you get one output uh estimated output and this estimated output is the value of the so-called or i call learning function the learning function is the composition of a chain of functions mathematic mathematician i have to to say that each one corresponds to the evaluation of activation function applied to the weighted sum of neuron values of a layer indeed f is a continuous piecewise linear function on a cube of rm and its graph is a surface made up of many flat pieces when used relu for instance and it appears a kind of origami with some flat pieces going to infinity some mathematical historical not related to this the why the neural networks are so successful from the mathematical point of view is because the learning function is a composition of continuous functions of linear so the the operation of composition in mathematics is very is very powerful to approximate functions a famous generalization of the 13th problem of gilbert in the list of 23 unsolved problems was the ebay is every one let's say one mathematical problem is every continuous function of three variables is the composition of continuous function g1 d2 final number of two variables so this is what the the uh reduce the pro the thirteen problem of hilbert list so the answer is yes and the positive answer was given in 1957 by vladimir arnault h19 this his teacher was andre kolmogorov well i have maybe the imagine pronouns but much better than russian but we are interested in the design of network whereas for gravitational waves and our main purpose is to classification rather understand and uh on time yeah it's okay no but not bad i have three or four yeah okay so i i i make my mathematical observation so the the this this topic of the thirteen problem a lot of things from the mathematical point of view so it's really interesting uh really understand is the the the the algorithm to to minimize the cost function is the fundamental in training the neighborhood is based on the iterative procedures defined by here we write the evolution of the gravity in the strength with with the weights and with the wires uh this we have one where sk is the learning rate k is is the the number of iterations and we expect convergence when w k tends to w star and b k be a star where the cost function is minimized for the training data and testing data so in general what happened large network we have large network many unknown weights so this is the the usual problem many samples in the training set but maybe so at this moment the catalogues of waveforms are not too big well then the iteration is not successful we remark that under bubble conditions computing derivatives is expensive too expensive in spite that easy to compute and the training network can get poor results on scene test concussion the learning function in this case can fail to generalize the stochastic gradient strain uses not the full batch of training data but the mini batch of samples of training data the mini batch mini batches is a partition of the training data set with a size much more smaller than the full data set then i i will explain in a few weeks how is working this uh which is a small mini what size at the volvo book expanding in pieces so there is two iterations one uh outer iteration and the other one inner one the outer one is one iteration of the outer one is called epoch and the inner one is called steps we got eight steps of the gravity in this thing so we start with the first epoch and then we have a the set of mini batches of the training set we reordered the list of minivaches in a random order then we performed what is called a round of training roundup training means that i apply as a gravity understand on these mini watts only so sampler universe for word propagate it is through network to estimate the target with er why hat calculate the cost comparing y and y hat there's the gradient of c to adjust w and b enabling enabling x to in this mini watch to better predict why computing derivatives without propagation the derivatives of the uh of the gradient in a neural network has to be computed with respect to all the weights some weights are in the face the closest hidden layer to the output one and the other one are in the beginning so when we compute the we we can do a recurrent chain rule because we you apply central you you only have to compute derivative derivatives of the weighted zooms and and the activation functions and then the the total derivative for it for any weight is a product of derivatives corresponding in one of the layers so it's possible is to explain these with details and pour in equations and things like that but you understand more or less you know at the beginning sound engineers won't train it they they they don't apply the chain rule so they have a lot of problems to compute that at the beginning when this theory starts to work computationally so uh repeat the round of training affects the number of iterations so we fix the number of iterations of gravity intestine we now we take a fix but when we finish the epoch we start another epoch and then the epoch applies for all mini watches the gradient descent and the right understand is computed in a small mini watch a small sample of data sample the new one and repeat this process to complete one epoch without repetition so this way we have here we start a new epoch we reorder again mini batches so we we are not working with the same order we use in the first one so we apply random order in every epoch but when you are adjusting a double vm wmb for the next mini batch you use the one you have already obtained or you get new values i i adjust w and b in a minivache the next one uh in the on the list of the random list i choose then i i use these numbers if i need to show if if if they are if they are connected with the network so i update because at the beginning i need to initialize every value say you do i do this with with uh in taking a distribution which is the glory one which is one preventing the saturation of neurons so uh so you you are updating you you go through all the mini batches and then finish the poke and then you can estimate at the end of the box how accuracy you get in this epoch counting the numbers of uh yeah you know in the training set you are we're now only on the training set because on the the the the validation set if this is not or the testing set this is not no not meteor you are we are not training the validation set you just take the training the trainers uh network and we apply to the data on the validation set and count that you you estimate accuracy then okay how how is the is it's much more difficult in this so general case to to to analyze why the random component of the algorithm makes things to work better in the general case the the reason is always you get a simple example so one way to to do this it's possible to for instance you you know i'm sure you know the gauss-seidel method to iterative method to solve linear linear systems uh there is another one which is the cast mark one which is not very popular with one polish mathematician from the theories so the you you might program this kind of uh algorithms and these ions can be randomized but by if you want to to to work with a big linear system so you take minivaches and then you no minivaches in this case the mini but size is one because gauss-seidel and kasmar act only on one equation the next one takes what you obtained in the the one before so if you you program this you see uh that you you get a really good for instance to solve normal equations in a list square problem which is a symmetry things like that so it's but when the mattress is too big you have ill conditioning so you you might fight against the ill condition using randomness so so then you think that you translate to this which is a more complicated case that is a good uh ingredient to to work with the second thing is the usually the cost function might be non-convex so the solution is not unique but the introducing the the because the function in total so the point is that introducing the random names you might go from one step for the other uh and you you can avoid uh to be a trap trapped by a local minima so this is another thumbs rule this is not a problem this is all not mathematics it's just experience so i'm sure we'll be mathematics behind okay then the success of deep learning using grounding stochastic in the same resin too fast computing the gradient by using back propagation recurrent chain rule on the mini batch samples is much faster than the one on the full batch the doing this procedure is stochastic produce weights that also succeed on unseen data this means that generalize well well but you have to check to generalize well by checking the accuracy and the that does in the training that data said and the other one following the general rules the inference explain it to avoid overfitting to get satisfactory generalization is defense rule is early stopping at the beginning this works with very efficiently but arise one moment it is going to be uh not not so but the the way you can control if you are what if the training is working is by if doing too much epochs that the accuracy in the training set is going up or going down the grades and you cannot match this accuracy with the validation so the the the first rule is to stop when you see balanced accuracy in both sets this is the inference question then you avoid overfitting well i have no proof of this so this is the last one parameters of a neural network will be the one trend is that the number of hidden layers and the dimensions usually the hidden layers are all have the same dimension this is the usual one except the specific problems activation function chosen for the neurons in healing layers dropout use we will talk about drop values in the next tutorial cost optimizer which consumes optimizer usage it is possible to use stochasm that the the gradient distance have other algorithms which are more accurate are second order or third order and you have one of the uh war experts in this is nesteroff there is one one which is called nested of adam which is second order that is thinking about third order the idea of this the the idea of the gradient strength is to avoid the computation of second derivatives or third derivatives so because this is too much costly compared with the first one so there are second order methods there other methods where we use finite finite difference in a way so because it is much easier and there exists a chain rules for finite difference so any well another important is a good asymmetric user uh mini batch size so you change one of these parameters the training can be good very good going to a disaster number of mini batches number of epochs which is usually they fix a number of steps of the gradient on each minimaps so those are the the topics they are using so i could recommend sound reference this is the classical one i am fellow joseph benjo there is one update online this deep learning uh from the point of view there is a very very nice paper the john legume benzo and hinton were the three the the the three one winning the touring prize in the nature paper explaining the state of the art uh i follow uh o'reilly online courses of deep learning one of iocron and then i send ideas i take from from this of course there is a tutorial you you can see in your home which is by three blue and brown there is many tutorials for one this is one for the explaining the how to an idea how to train and there is a playground playground neural network which you you can use uh different modifying the parameters and see how is working so it is looks this because you don't need to code okay so that's all what i wanted to say thank you [Applause]
Info
Channel: Institute for Pure & Applied Mathematics (IPAM)
Views: 161
Rating: 5 out of 5
Keywords: ipam, math, mathematics, ucla
Id: wcTGThokGAY
Channel Id: undefined
Length: 82min 5sec (4925 seconds)
Published: Wed Sep 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.