01. It is time for a theory of deep learning. Tomaso Poggio

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

thanks to the organizers for inviting me to thanks to the sponsor and this is a great venue and it's a great city it's it's time for Americans to explore other possibilities and it's also the time for deep nights of course and why is that well the reason is that I believe that we need to understand intelligence and the problem intelligence is actually not only one of the great problems in science like the problem of the origin of life or the origin of the universe or the structure of space and time but it's probably the greatest one because if we make progress on it then we can more easily solve all other problems and of course we want not only to develop engineering and software and hacks but also the size of it because the engineering of tomorrow will come from the science of today and so in fact if you look at what happened in the last few years two of the great successes of in AI and deep learning alphago this is demis hassabis here and Mobileye which was really recently acquired by intel developed autonomous driving systems this is actually a deep deep network labeling the picture is the image that are free for the car to drive on and labeling with different colors at the borders the different type of objects from the curb in the street to other cars it's interesting this is a small parenthesis this is a real-time pretty very good accuracy this was 20 years ago in my lab a project with Daimler these are actual streets in Ulm which is the site of the research lab of diameter and also of course where Einstein was born and you can see that there was one of the very first systems based on machine learning at the time support vector machines trained on a few thousand images to detect pedestrians and the point here is that there are a few mistakes you can see the traffic light labeled as a pedestrian at the time were very happy this was 95 we had about one error every three frames which translates into ten errors per second which means that of course was not usable at all and now for the same problem mobilized about one error every 30 or 40 thousand kilometers of driving which translated to 1 million times better accuracy which is actually doubling every year for 20 years so that's kind of Bohr law for deep learning and the real point coming back is that in the algorithms and in in alphago and in mobile I there are two basic schemes one is deep learning the other one is a reinforcement learning and they're both inspired from neuroscience for instance tip learning the architecture of it came from studies by huber in visual in the sixties at harvard recording from the visual cortex of monkeys and proposing a rar Kiko structure for v1 the primary visual area in terms of simple and complex cells and there were quantitative models after that this is one of them from my group but previous one from by Fukushima that showed how this hierarchical architecture could work and the architecture another way of training of course but is the same in recent deep learning multi-layer networks the most probably the state-of-the-art more or less is rational the extreme right so so skipping this so that kind of ironic situation is that now this deep networks are used to understand better visual cortex is full circle back to the original inspiration this is work by Danny amines and Jim De Carlo in which they show that units in the model have very similar tuning to actual neurons in IT and v4 areas in the monkeys and so this is really ironic because we are trying to understand something we don't understand the brain using models of something we don't understand how they work deep networks so a theory is really needed and let me try to tell you about some of the work we've done in this direction first of all all of you know of course that deep networks consist of multi layers simple weighted sum of inputs to each unit a non-linearity which is these days can be more or less anything but this is the linear rectifier and training techniques that most of the time is to cast a gradient descent trying to optimize minimize the error on a large training set there are at least I think three type of questions that you can ask one is about Israeli approximation Theory is one and why are deep networks better than shallow networks what is the representational power the second one is about the optimization when you're using stochastic gradient descent even in networks relatively simple like the ones used to solve safar which is a mini version of imagenet you have like 60,000 data points images 32 by 32 pixels and you have up to 1 million parameters so it's a huge optimization problem and the fact that you can actually minimize the error on this is quite impressive is that you work that it converges the third one is that it's really a machine learning theory problem is the fact that the network that you find by the optimization process are largely over parameterised and so you'd expect them to overfit and not to generalize but instead they seem to do pretty well or new example so those are three key problems and let me go through all of them especially the first one so the first one is why and when especially when our deep network better than general networks first of all the second one is about as I said the landscape of the empirical risk and the third is about the generalization problem okay so let me give you some background typical shallow Network one layer networks are networks like support vector machines radial basis functions regularization networks in which you are minimize an empirical loss it's a functional comprising an empirical loss part we could be the square loss and then a regularization term like the norm in a reproducing carranger space and the solution of these things independently of V more or less is always in a linear combination of of kernels evaluated at the examples and you have as many parameters as data that's important in this type of networks and these metals can be essentially written as this this equation can be written as a network with as I said as many parameters the C the weights you have data points all right now from the point of view approximation theory it is known since the 80s at least that if you have a generic function of n variables say in this case eight variables you can approximate it arbitrarily well in the super norm on a compact domain by a one layer network where the non-linearity can be essentially anything but not a polynomial okay in particular it can be the linear rectifier that will describe you before so each unit has that form so it was partly because of this universality property that people did not explore very much the multi-layer option because you don't have any advantage from the point of view of approximation apparently there is a weather problem and this is well known is that in if you want to approximate to learn to represent a function of and D variables in this case eight you have this what bellman called the curse of dimensionality you need in general an exponential number of parameters the units in our case so if you want to win to resort to approximate a function within epsilon so epsilon epsilon is 10% in the soup norm over a certain range you need a number of units that is e epsilon to the 1 over epsilon to the D so for instance if epsilon is 10% 1 over epsilon is 10 if B is 8 here we have 10 to the 8 but if these thousand like you have in Seifer you have 10 to the thousand which is a very large number okay so that's a big obstacle how do you avoid it well classically what people have done in the past is assumptions on the target function the function you have to approximate in terms of smoothness because if this function is smooth then the exponent that appears epsilon 2 is the dimensionality divided by the smoothness like the number of derivatives but another way is is to consider functions of this type come that we call erratically local compositional functions so it's a if it's a specific set of multi-dimensional function in this case eight variables but can deal in which it's a function of function of functions but all the constituent functions have dimensionality to more in general small dimensionality for instance in the case of say C for all the cases in which you use resonate the color there is about three by three so that would be the dimensionality with the Constituent functions now this is an assumption not on the network but on the target function that you want to approximate with learn with the networks so assume the target function has a structure like this then we can prove that you don't have anymore the curse of dimensionality so in other words the number of units that you need if you use a deep networks with an architecture similar to the architecture of the function does not need to be the same then you instead of the exponential dependence on the dimensionality of a linear dependence and key part here that the miracle of it of avoiding the curse of dimensionality is not because of weight sharing so convolutional networker a special case of this structure but the key is not the weight sharing is the locality of the constituent functions okay so for instance don't tell okay this yeah so weight sharing would mean that each of these constituent function at each level are the same function the weights are the same so all these are the same and then these two are the same so this would be weight sharing but the magic in avoiding the curse of animation allottee is that you have function at each node that depends only on two variables that look only at a small part of an image if the inputs are an image the province is actually pretty simple because it it essentially says that you are approximating within epsilon at each stage and the error you are making propagates in a good way it's not amplified as you as you go out of the architecture so and by the way this is really rare to a very very basic thing suppose what is the simplest representation of a function it's a table right so in one dimension suppose you have one variable you have 10 bins so you have 10 entry in the table to dimension you start to have 10 to the two hundred 100 entry if you have a dimension is 10 to the 8th okay that's the curse of dimensionality but if you have this compositional function instead of having one and the dimensional table you have D for instance two-dimensional table which is of course much less yes sorry yeah although it's interesting because because you don't have long term dependency right here at this level but at the end that you will have them so yes and of course you can approximate any function you want arbitrary well even if it's not compositional by superposition of compositional function just at that point your you lose the exponential advantage ok and as the matter generic remark is that what I described is a special case of compositionality is this hierarchical e local in general compositionality just means function of function of functions and that's a very generic statement you know the theory of a recursive function in computer science tells you that every function computable function can be essentially constructed from the composition of a few elementary functions so every function is really compositional in particular I can write as initatives key this would be just to make inputs into the same number of dimensionality but I can iterate this would be a duration F of F of F and so I can write the state a finite state machine in this way and if I let T be arbitrarily big this is a Turing machine so that's a very basic statement I'm making about compositional computations so to summarize we have up here's the graph of the functions this is the target what I want to learn what I want to approximate and the nettles are also functions but they are the functions I'm using parameterize function to approximate these functions in general if I already know it's a generic function I cannot do better than what shallow networks do even with deep networks but if the function has a structure like this or other particular structures then deep networks can avoid the curse of dimensionality and be much better let me skip this particular there are number of cases one is the when you have this low dimensionality of the constituent function or high smoothness of some of the components and then also sharing of tasks in learning so if you have you can think of a composition for instance in a deep networks the the first layer to be used for a variety of tasks for its different classes and the final layer or so for is common then you save quite a bit of sample complexity so there are a number of reason why compositionality is interesting and just to actually did a number of experiments just verifying this result he is one on Seifer in which we have a shallow Network this is training error this is testing error and these are a box in the SGD algorithm and this one the best performing is the convolutional with shared weights but giving up the shared ways just keeping locality also works pretty well and by the ways and historical comment disease which is related to this conjecture about composition are ethically conjecture that that yoshua bengio had several years ago is these results about avoiding the curse of dimensionality and so on are very closely related to all the results on boolean function on logical circuits in which that say that certain functions like the parity functions can be represented much more efficiently by a deep circuit rather than the shallow one okay so it's an interesting question open question is why the compositional functions are probably a set of measure zero in this in the space of all possible functions or why are they so interesting why they seem to come up in vision in speech in language in text well first of all let's notice that you know speech and text are man-made and they are compositional almost obviously like text is made up of letters of words of sentences or pages chapters and books and so on a speech is similar but vision is a bit different so why for reasons vision should be compositional a colleague of mine max tegmark says this is because of physics is because is because the amento nians that are generating the universe and things in the universe pretty local and and so this is kind of a generative process that implies that the the inference process is also local this maybe I am more of the idea that it may be a consequence of neuroscience so in other words suppose that the brain evolution finds easier to make local connections for instance in our visual system we know that there are photoreceptors and under our ganglion cells and again grow cells in the retina only look at the small number of photoreceptors and so on and you know it local connectivity is much easier to wire up during development so suppose this happens then our brain a visual system will be wired up to approximate well compositional tasks and so that those are the things we can solve and the question we can ask and so that's why those are the interesting one we find that's kind of upside down and this is just to show that one of the main topic of perception was really looking at locality of computations and for instance the cover here show spiral and the task it to decide whether it's connect which one is connected or or not and something that is not a local computation and we find very difficult to do okay so second the optimization theory part this is relatively some of salvations one is that suppose I'm writing the exact solution so that's that's looking for the function which is the network that exactly fits the data okay so in the case of Seifer for instance you have 60,000 equation with a few hundred thousand parameters now the parameters come in a form like this it's a linear combination of some of the variables and then the rectified version so this is each one of the units in a network suppose that I made make a univariate polynomial approximation of each of these terms so this would be a polynomial in one variable which is this linear composition of x1 and x2 could be a polynomial of order three or four or six I forgot we used one but you can think of it as a polynomial approximation of the actual network which I can I can use a degree which is high enough that the proximation is as good as I want and then if I if I do this then I have a large set of polynomial equation and I can apply some results about the solution this system self more direct if you don't believe in the approximation approach you can replace in the network actually replace the rectifier with a polynomial approximation which we have done and you find that it behaves essentially in the same way as the one with a rectifier so that training error tester for the actual you network and this is the polynomial one so at this point you can use Bisset theorem that tells you how many solutions a set of polynomial equations give you actually it takes you an upper bound but upper bound that you get is very high it's like more than there are atoms in the universe so this means you have a lot of solutions of this equations if if the equations are consistent which means if your networks can actually approximate or actually represent the the function that originates the data because there are more parameters and data the theorems tells you that like in the case of linear equations the solution are degenerate in general so the 0 the solution that give you 0 error are degenerate because you have 60,000 of them but many more parameters whereas the solution for the local minimum the 0 of the gradient are as many as the weights because you have to put all the gradients to 0 so in generic term you have that the zero solution are degenerate so there is actually varieties of infinite solutions and the critical points the local minima are not generate generic ok so so this brings up to the next puzzle which is that as I mentioned despite the fact that the zero minimizer are degenerate and you have more parameter than than equations you still in general you seem to do pretty well in terms of a prediction of expected there and the reason is a few properties that I bet I don't know if somebody knows would love to hear but I don't think this has been remarked before I some interesting properties of stochastic gradient descent it seems that stochastic gradient descent is minimizing but subject to the constraints of finding flight minima and this means the generate minima so that will mean that will preferentially find exactly the zero minimizer instead of the local minima and furthermore as we'll see later this large minima correspond to solution that general generalize is not the right words that have good expected error so first of all let me start with what is in physics is called Langevin equation which is essentially the motion of a particle under Brownian noise that so this is the derivative of the Brownian motion in a potential field actually I have sorry I have a notation problem this V should be u so potential use SI so this is a stochastic differential equation this is the full gradient and the asymptotic the probability distribution of F so as in counting in the sense for time going to infinity is the Boltzmann distribution have another typos here there should be no equal just normalization constant but this is e to the minus the U which would be in our case the loss function okay so you can see immediately that the loss function will depend off in our case the parameters that matter are the weights so there will be three hundred thousand or so weights and and so if you have suppose you have two minima that's your you two minima which one are one is a bit larger than the other this is actually twice but one point zero zero one will have this exactly the same effect one is larger then if you look what happens in one dimensions this is the Boltzmann distribution what you find stochastic gradient descent this is in two dimension in three in four five in other words as dimensionality increases the smaller minimum disappears in terms of your solution although they have the same depth so stochastic gradient well in this case the Langevin equation minimizes but amount the minimizer concentrates because of the high dimensionality on the larger minima because all the volume and the probability measure mass is there so this also means that if you have a degenerate minimum and another one which is not the generate it will preferentially loop find the degenerate minimum and this becomes again this is in one two and five dimensions and we have 300,000 so we have a very large concentration effect due to the large dimensionality okay this was for the Langevin equation but the stochastic gradient descent which is taking the gradient of one point at a time or a small mini batch can be written as as a Langevin equation in which you have a pseudo noise that is actually the difference between the full gradient and the mini-batch gradient it is a particular type of noise but as basically gaussian projection because of the central limit theorem on the mini-batch for the right size of mini-batches it's very similar and empirical we find behaves in the same way as the Langevin equation so so so this means that the category descent will find large flight minima large minima usually have a flatter region than the known the smaller ones and and it's possible to connect flatness and volume to the margin in the classification case to the classical version of the functional or geometric margin between classification boundaries of say two classes so this means that focus agree descent will minimize with the constraint of finding the largest the flattest minima corresponding to let the largest minima and large margin implies you know good generalization where of course good translation mean depends on the distribution of data and so this I think explain some of the puzzles recent puzzles this was it's it's what happens in Seifer when you go with the number of examples above the number of parameters so this is case in which we have 10000 parameters we have when the number of example is smaller than the number of parameters we have zero training error and the test error is going down and then it continues to go down as you increase the number of examples beyond the parameters this is usually something people don't do right and here is what happens in the case this was a student of mine and Sammy - yeah this was and been wrecked and I forget the other authors but showing that if you have random labels in Seifer then of course the test error is it's random it's chance so 0.9 because there are 10 classes and what you can see is that you have training your mother can still fit the data so as long as the number of example is a smaller than number of parameters but then so this is this was the puzzle that you can fit the data and this is the same network that does pretty well with normal labels but then when you increase the number of example your training errors starts to appear and increases and probably asymptotically we'll get to the terror so this would be in generalization or again and this one I find even more interesting this is now fixed number of data 50,000 and then we increase the number of parameters in the same architecture and what you find is that the the training error goes to 0 essentially basically at training data number equal number parameters and then you increase the number of parameters and you're testing your test error remains the same does not you don't have overfitting lets them make the kind of surprising part but in terms of the theory I said it's just that you get at this point you have a large margin solution and if you increase the number of parameters the the over parameterization is basically not seen by stochastic gradient descent stochastic gradient descent so the generator direction are invisible to stochastic gradient descent because the gradient would be 0 is exactly the same argument that one uses to prove the stochastic gradient descent in the linear case finding the minimum normal solution so I think we can explain not quantitatively but qualitatively all of these puzzles and so let me conclude I've I think it's important to understand there is much more work to be done along the lines or another direction along the lines I described the hope is that if we understand better how they work then we can understand better how they fail is important for practical implication and even more important help to improve them and one of the improvements of course would be to try to reduce the number of labeled examples because that still still one of the problems today I think one can look at story of computer science and think we have at least three ages one is the age of programmers the in the age of programmers you have to pay a very smart people quite a lot of money to make some system behave in an intelligent way we are now in the age of labels in which you pay a lot of people not much money in order to label data and both of them especially this one is and go into infinity you know number of label data going to field but of course you know children don't don't have a big date we really would like to be able to show one car and one plane to a child like we do for a child and then they're able immediately to generalize very well so our idea is I'm going to one and two have computed that learn like children from experience thank you [Applause] my wife Tesla whistles I can tell you sir so that's that's very convenient okay anyway so so we basically I find very illuminating to the tee which you found about the stochastic gradient descent so in a way I would and I would like to ask a provocative question so if you do I mean in a way a stochastic gradient descent is an algorithm that works with a lot of noise that in you know in a sense noise of the program it's it's almost you know it's a it's also include some regularization because we've actually shown that right so in a way you're using the optimization to also imply some regularization you could also turn things around and say why not define the proper loss function and use a proper optimization and then you know go to the right optimum and so I wonder whether you've also looked at like a Newton or something which which would not have this noise included what what would happen there ya know it's a good question of course we have not yet explored you know how to improve that's part of the program as I said it we are trying to understand first what's going on and then possibly to improve it so I really don't know whether it's possible whether as SGG is the best we can do right now I don't know it seems pretty pretty good I find it I find almost magical the fact that he finds that the generate solution and those and those conditions are also the they wants the general they do best in terms of expected error for classification anyway does not mean it's the best one and so on yes that's I mean I feel it's like you use your shifting the regularization you know it's like you use just the Joker from one side to the field to the next and it's absolutely what which way to write well first of all it's not clear and I'm not sure you know I we did answer completely this question it's not completely clear the implicit regularization what it is exactly and how to characterize it in that you have an SGD because if you write it there is no regularization term right so it's already interesting to find that essentially the the noise has this this effect and I try to explore there is a connection we have some ways I have some results on a connection with so called the robust optimization that it's optimization subject to so you're looking for a minimum that is the most tolerant to perturbations in parameters or inputs and so there is a connection there but it's not perfect especially the thing is it's rigorous in the linear case I can make it rigorous in the linear case but not so this mean one layer networks but not in the multi-layer ones so there is more work to be done there in that case the connection organization is very clear you have a quadratic or other normalization terms coming out but otherwise you know it's kind of implicit in this concentration phenomena so what I can say that as defines the minimum among the minimis with the same minima selects with very high probability the one which is has the largest volume so does it work about this the idea that slightly mal good maybe you know about our recent papers showing that you can easily change a solution that is flat minimum yeah - another one that is not at all but it's the same functional so how do you fit so yes well your result does not say that treatment minimum don't generalize right no that's true right but it says that it's not because it's flat that right but I don't think it works for isotropically flight minima so if their flight in all weights then you cannot a big Iggy show that if you take it flat minimum solution you can make a very simple transformation that doesn't change anything in the functional yeah but suppose is flight in all the weights yeah you know you can then it does not work there in the the renormalization you do okay so yeah I think some like me Matt cannot be transformed in some solution that had isentropically flight you cannot correct and we find empirically that we have isotropic leaflet minima yeah so I found actually a first part very interesting my question is as follows so the university approximation theorem says that basically you can approximate every continuous function on a compact arbitrary well but it doesn't say much about the complexity of those networks in terms of maybe the number of neurons or in terms of the sparse connectivity so my question would be in your setting for your specific functions you consider the networks use for approximation are they in some sense optimal in certain sense are these optimally sparsely connected or do they have a let's say a minimal number of neurons - are you interested in that question or do you have any thoughts on that so let's see we say you know that for a generic continuous function of a certain smoothness so for instance we our setting is an example of space in which you have a dimensionality D and say s derivatives and we can say that generic function will have in terms of approximation by a deep network will be require a number of parameters that is not exponential in the dimensional it will be D over ice so we can say this and we then have a number of examples of specific functions and these other people liked Algar ski which which cannot be approximated by shallow networks with so I think in terms of complexity we with you know we give this this rate of approximations for I don't know what what exactly you can do more in that direction you show that the host Garden descent is mostly insensitive to over parameterization and the example it that you presented this on was a fifty thousand label company I would like to suggest that if you would use much less maybe not even 100 maybe ten samples or five samples then this would no longer be insensitive to over parameterization and therefore the exact polarization is going to be much more important so you are saying instead of sixty thousand data points to have a much smaller number if you would be in this case we're trying to learn with with very few spinning something then the structure that you enforce on the network is much more important well the structure of the network let's see I think you are so you are mentioned your interest in this French so we should be training examples so the slide with the fixed number of 20 examples but the different number of parameters I think it's one one before the last slide this one is one yes so if you would run this experiment again but instead of 50,000 different 20 examples you would add 10 sometimes or 50 or five yeah oh one yes Oh zero because I'm actually thinking about unsupervised learning so my purpose is that when you apply unsupervised learning then the structure of the network which relates also to the first topic the type of conference in a compositionality did you apply is where you actually give the hint out to Twitter data check it again sorry so what I'm thinking is that in unsupervised learning and in the last year we had excellent results in unsupervised learning coming for from all sort of groups basically things that we didn't think were even possible for example all the cycle gun type of walk I think much of it is because that the structure of the network dictates the solution and over there you cannot assume that our parameterization is going to give it some the same outcome could be I'm not sure I understand computer but of course the structure of the data is important also in the supervised case you know in this case the compositionality structure is important and you need to have the network representing ideally it has to be able to implement a function which is which includes the target function right the function that actually produces the data the conditional probability of Y given X according to this curve I assume that if the training data size was smaller the green curve would be on the left and the like yellowish curves would just be higher I said again it will be on the line yeah extrapolating if the Green Line was on the left of what it is like the extreme being on the axis then the tester would be like very high that's yeah okay my question is more whether the orange line would remain horizontal or whether it would actually form yeah so what do you think but you can check I think that with very few training samples you wouldn't see this phenomena of insensitive to I performed to overpower motivation so if you use too many parameters and you're in the regime of n equal 1 or unsupervised learning then you would see that it's very important to use exactly the number of parameters that you need in order to model the problem yeah I get I think I agree because it's you not be able to implement the approximating function that is powerful enough right if you have too few parameters another way to say it need more radiation and it can be mini forming being the structure much more colonization but so this is a fasting because you don't need organization and you add parameters and you still get this plateau of performance and I'm saying that in the case where you have very small n you need more or less exactly the minimum number that you need to actually express the function but it's much more than yeah yeah I agree yes I think that icing that you know the I guess this was the point in your paper with g1 that you know the classical theory has been emphasizing generalization which technically is the convergence for n going to infinity of the training error with the generalization that the test error but here in this regime you you don't have generalization right you know if you if you have say more suppose you have more parameters than data and you make both going to infinity I I bet that you never have generalization you never converge and this is actually quite well-known because one nearest neighbor which is the simplest possible learning algorithm as exactly this you know perform pretty well I mean it there is this bound that it cannot be worse than twice the base error but but you never have generalization so I'm quite interested in the last part implicit supervised learning so so far already we have a lot of results I don't know either deep architecture is the key semi-supervised unsupervised learning in turkey or transfer learning or regularization so what I think would be the key challenge for for this implicit supervised learning do you do we need a completely new concept well we have to try and we'll find out I don't know I think I don't think we need completely new concept but I think that there are a lot of tools but you know there is certainly a lot of work to be done still very interesting results to be found in particular it's I find the interesting that when I was studying physics many years ago we were told in order to fit parameters should have you know at least twice the number of data as the number of parameters ten times is better then we got in a stage with kernel machines in which you have as many parameters as data exactly and now we are in the year in which having more parameters than the data is a mass perfect and so it's a very interesting I think we can understand all of this but but it opens up new possibilities and I'm not sure we really understand yet how to use it optimally back to your question I hope not so we can improve so maybe if I can ask another question in the first part that also does work by schwa and tf-51 yeah they use tender tender and consent on on their model can you say a bit how this relates to the work I mean include in your world yeah that's a good question I think that this theorem we have on um on composition of functions is actually in a foundation for the tensor work so they use the hierarchical Tucker the composition of ten so that is actually very similar in terms of the organization to this compositor are called compositionality that we have and in the tensor literature I never found you know theorems results that say that's characterized the class of functions for which the article tackle the composition works I think our theorem just says that in the first part you mentioned that having multiple tasks to solve improve the performance of the net yeah do you have a new quantitative result on this well there are some results actually not by by us but by andreas möller I can can give you the reference there are some bounds on the sample complexity the depend on having in our case part of the network be used for a different number of tasks and then another part that is specific for each task so it's pretty generic thank you so I was wondering which way does the size of the mini bet affect the efficiency of so has degrading descent because the noise is pretty important right yeah and you said that the noise behaves similarly to the Browning noise yes and that's why it probably generalized so well signs the flat minima so maybe that's also relates to the question before if you have lower trading size then you don't have so much noise and I was wondering and like how if I increase the size like how does it affect efficiency it says it's a Stanley behave good like is it's fairly efficient or inequality it's always taking one sample highest probability to to find the sliding you know so I cannot tell you let's see we have not looked yet at this this is very recent to work is part three impoverished right so I cannot we've not looked exactly this question but I can tell you qualitatively what happens if you have that mini-batch which is very large then you are doing gradient descent and then you may have the risk of finding you know local meet the principal local minima or get stuck in local minima if it's too small but then the day if it's medium-sized or mean not too big not too small then within the mini-batch you have the central limit theorem working and so the noise which we see the Tuda noise is really Gaussian I don't know how important that is for you know the kind of behavior of the Langevin type equation but it's it's not you don't need the mini-batch to be too large in order for the central limit theorem to give you essentially a Gaussian type noise so that's that should not be who's more like you know just a single point well then you lose the Gaussian property and I don't know whether that yeah but then it's not similar to the Langevin equation so I don't get the Boltzmann distribution it probably does not matter too much but I haven't seen any empirical results showing that smaller mini batches are bad in any way except that they are slower or yeah I think you're right there are some empirical is that saying that if it's too large is bad right yeah yeah and not the other side I haven't seen yeah yeah no no it that's my guys - it's just that if you want to have a good approximation of the Langevin equation then gaussian it is important but my guess is that probably everything will work even if the noise through the noise is not gas then yes I mean there is also a price when you have like more parameters right because your model becomes less interpret over it it takes down as a compute or it takes more space to store so do you think do you think in future our models will proceed in this direction that we will have larger and larger models because of the properties which you showed what do you think there is also like this right oh and which is kind of also important for for practice yeah well you know I I hope that having developing a theory will help understanding how these things work it does not understand that or necessarily mean understanding which each unit is doing that maybe too much maybe not the important thing but understanding the properties and you know which kind of failure you can expect and so on in terms of you know having a lot of parameters may be a problem for hardware implementations you know compression of networks is people are now starting to to you know to design special purpose hardware so reducing number of parameters is important for that but otherwise having more parameters in training is probably makes it easier to do the minimization training time front finding finding zero errors solutions is much easier if you have over parameterised especially because of these properties of so caste agreed in the sense that seems to prefer the general minimum so so I don't know it's a trade-off it's an interesting one [Applause]

Info

Channel: Компьютерные науки

Views: 12,635

Rating: 4.9245281 out of 5

Keywords: Yandex, Яндекс, computer science

Id: Vx3uN0dt8M8

Channel Id: undefined

Length: 65min 3sec (3903 seconds)

Published: Thu Aug 03 2017