Class 01 - The Course at a Glance

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

and Tommaso Pedro and I'm in charge of this class which is I think the oldest one at MIT on machine learning we have it since the 90s but has been evolving a lot of course and the first step is to introduce you to the people were much much more important than me for the class and especially for you giorgos the class has the first part is more about really the core of machine learning theory body of results that has been developing over the last 3040 years I'm speaking about regularization support vector machines classifiers linear polynomials and and some of the basic results of why you can predict in machine learning the in a sense the future or come back to this so the first eight classes are the core the next ten or so are about topics of interest for the last several years and applications in machine learning and then the last five classes are about deep learning and I can tell you this is in mainly theoretical point of view but they rely on the first part quite a bit especially the first nine classes in a sense you can think of the first part as shallow networks networks with one day a year and the last part as deep networks with multiple layers but in reality the first part is giving you the foundation for dealing with shallow and deep networks so what I want to do today it's today is a little bit of hand waving it's a spring code of things will speak about during the various classes to give you a feeling a taste of what is going to come and some of the motivations for it so I'll go through four parts of this class today the first one is the main motivation this is a golden age for artificial intelligence and the engine of it is machine learning and and also the background about the Center for brains minds and machines of which I am the director and this class is a part in a core part of the educational mission of the center and then I'll give you a little bit of history about the evolution of statistical learning some of the main question in the last 20 years and a quick tour of historical applications mainly to let you know that there are a number of interesting applications much before deep learning came around and then just a preview of results we'll see at the end of the course in the plural so first of all CBMM CBMM is a center housed on the fifth floor here funded by the National Science Foundation ten years fifty million dollars and it's really about understanding intelligence of making progress in it understanding how the brain makes intelligence and how to build intelligent machines and this is both because we want to do science we are curious about how the brain works and how intelligent intelligence is but also because eventually understanding the scientific part of intelligence will allow better engineering so if you are interested not in the engineering of today but in the engineering of tomorrow then that's what we want to do that's what you should be interested I'll come back to this later but first of all let me make the point that there are a number of great problems in science and technology problems about the origin of life the origin of the universe structure of time and matter one of them is what is intelligence and personally I think that is not only one of the great problems in science is the greatest one because if you can make any progress on it and you can make yourself a bit smarter or have machines that help you think better which is happening then you can solve all other problems more easily so the bet we're making is at CBMM is that in order to make progress computer science is not enough machine learning is not enough you also need to know to bring to the table what we are learning about the brain and neurons and circuits in the brain and what we know about the mind about behavior human behavior and the human mind our Center is therefore a mix of not only different institution the main one are MIT and Harvard but also people faculty from different domains from neuroscience cognitive science computer science for instance at MIT you have people in computer science like Shimon Ullman and Patrick Winston and Leslie valiant is not listed anymore but at Harvard and you have neuroscientist like Bob DeSimone and Jim DiCarlo and so on we also have a number of academic but also corporate partners and these are the usual suspects involved in in the market for intelligence Google mainly some of the smaller companies have been acquired by the bigger one since we started the center and we are quite a few people at the moment including faculty postdocs and students so we are about more than hundred people researchers working on this and here is a little advertisement of what the things we are doing at CBM am one of the best is is the summer course a good soul just ended last Sunday and it has been a very good initiative I think we are creating a community of younger searchers that are know about computer science machine learning and also neuroscience and so if you are interested you can apply for the next edition that will happen in the summer of 2018 around the second half of August great faculty even greater students we had 45 of them chosen among 300 applications or so and and what Saul is very nice so it's also funny now the background of all of this is what happened in the last 20 years or so where we having had systems that can be better than humans at activities like playing chess that were supposed to be just human domain or domain of intelligent humans this is of course Kasparov and big blue and this is jeopardy Watson and more recently in the last three or four years there have been a number of achievements in machine learning for instance deep queue is a system developed by deep mind which is a company in London acquired by Google a few years ago that learns to play Atari games better than humans and you have alphago also developed by deep mind that did beat ensue last theory of an official champion of go and and systems like the one developed by George Tannenbaum here that can learn from a small number of labeled examples to recognize handwriting and I think really just to give you a feeling of what I want to say let's take this two examples of success in machine learning in the last three years one is game Cisco the other one is mobilizing autonomous driving this is demis hassabis this is the founder CEO the bind was a postdoc upstairs and and this is his sole I was there the competition this is Lee SEDOL the unofficial champion of go against which alphago played this is the other one its mobile I in addition to preventing collisions the system must sense the road at great details including lane markings curves barriers obstructions recognize a large vocabulary of wait okay so this is I'm not Joshua and it was also postdoc with me and it started mobile I several years ago two thousand was sold a few actually weeks ago to Intel for several billion dollars now it's interesting point is this kind of parent parenthesis but just to tell you how much machine learning are quickly it's progressing this was a system that we had developed in the ninety five so about twenty years ago in my group in collaboration with diagram so one of the first vision system that was trained instead of constructed hard were hardwired by programmers so was trained on images of people and here the system was running on a PC in the trunk of a Mercedes in Ulm where the research center of died memories and is detecting pedestrians and at the time we had one error every three frames you can see some of them here like false detection of a traffic light so that rate of error we are very happy about it at the time this was 95 this translates into ten errors per second at 30 frames a second right so that's completely unusable scientifically fine at the time but unusable now so the equivalent error rate of mobile I on a similar problem is something like 1 error every 30 or 50,000 kilometers of driving and if you make some back of the envelope calculation this is about 1 million times better accuracy than we had this is in 20 years so it's like public accuracy every year for 20 years so that's kind of more law of machine learn okay another interesting things of this to example alphago and and mobile AI is that the the basic algorithms they are used using these two application are reinforcement learning and deep learning and they are both coming really from neuroscience tip learning for instance is and architectures I'm speaking about the architecture which in which the system is running and not how the networks are trained so the idea of hierarchical networks of neurons that have local connections among them is an idea that was essentially proposed by David Hubel and Torsten Wiesel at Harvard in the 60s because of recordings they did from the visual system of monkeys this is a diagram of the visual system of a primate and our visual system is very similar and the seer article architecture with local connectivity is the same that is used in installation of deep learning including the most recent best performing systems like residual networks so this really comes from the brain the other point is that despite all these progresses intelligence is far from being solved and there is an interesting statement here from Mark Zuckerberg in his blog because last year his personal challenge two years ago was to build a personal assistant Jarvis and he says that we are still very far from understanding intelligence and so on and so on so there is a growing realization that we are still far from understanding how the human mind works and how to replicate it but one how can we go further well one approach is more theory and that's whatever falling in this class the other one which is the other class that will start coming Friday is to follow the inspiration of neuroscience to hope that suggestion like the one about deep learning will come again for the next big step in machine learning and this is interestingly the topic of a paper that appeared recently a couple of months ago in neuron and the authors are from deep mind the first one has damage and that's interesting because because you arguably deep mind is the most advanced place for AI research in the world and so having them say that in order to build the better brains we know we need one to get to look at neuroscience which is the point of this it's kind of quite interesting will not look at neuroscience really in this class as I said there is the other one if you are interested but I also think that as I said before neuroscience is not the only the only discipline we need to advance our understanding of intelligence but we need also other disciplines in particular machine learning and in particular learning theory so let me wrap up this first part I really told you why we are in a Golden Age of machine learning told you about the limitations I told you about ideas of how to progress and let me stress that even with this limitation the current advances in machine learning promise a lot of very interesting applications you know things like autonomous driving like Alexa and many other ones so let me go to the second part of today which is a little bit of quick story about statistical learning theory first of all the setting of the problem this is really supervised learning and it's not the the only form of learning by far but is the core of of any theory of learning and the framework I'm sure you know is that you are given a set s of examples and each example is a pair of input outputs and typically you have vectors x1 could be a vectors of various financial variables today and y1 could be just whether the S&P will go up or down tomorrow and you have training sites because you can look at the past and then what you'd like to do from this training site is to synthesize a model let's call it a function that map's X vectors to Y so in other words I can put in the financial variables of today and I can get the prediction for tomorrow whether the market would be up or down and if it's up out by a lot of S&P future and if it's down or sell them or something like this of course as you know history is not a guarantee for future performance and so the big question is under which conditions I can I know that my function that I've learned from my rule that I've learned from the data will be predictive it's not just explaining or fitting the data that's kind of all the statistics its predictive it will work well with data or which are not in the training set and the simple example in one-dimension shows that if you have a few examples this would be points you know value of x and corresponding value of y the right dot then there is essentially an infinite number of function that can exactly explain the data interpolate the data but be quite different as well even if I put restrictions like continuity of the functions you can still see that there is too much freedom so that by the way this is what I just mentioned now so what is called the regression a different version of this instead of getting a real number for the output Y getting just a boolean variable like 0 1 you know going up or down it's a person or is not a person is a very similar problem separating classifying images as for instance people or pedestrians just to give you a feeling of the kind of language one uses to deal formally with this kind of problems you assume that you have a probability distribution on the space which is the tensor product of the x and y this probability distribution this density is mu of Z and and then you assume that this the training set samples drawn in independently identically distributed from mu you also assume that you have a possible space of models of theories or functions that go from X to Y and then you speak about early learning algorithm as a map that essentially looks at s and chooses a model a function from H so indicate this function with the subscript s because it has looked at the training set is based on the training set and this is the function that I would like to be predictive now what as predictive means in order to do this we'll sit during the class I'm just giving you a taste for the kind of things I need to deal with to to analyze this problem more formally but you can define the expected error the error that you're the function are going to make on any Z expected so this is really error in the future and and the proxy for it which is the error you can measure on the training set that's the training error and and one can speak for instance of what is called generalization is a condition that tells you that when your training set is very large then the training error is a good proxy for the testing error it's not the only concept I'm giving you this only to give you a feeling of the kind of language we're going to use for instance if you are a hedge fund and suppose you have very large data set and you have trained an algorithm and you have an FS and your empirical error of this FS is very small on the training set and now you want to bet a billion dollars whether the market tomorrow is going up or down it's nice to have a guarantee that your algorithm will generalize in other words you can trust your proxy the empirical error now there are four the ones among you that are interested in philosophy of science there are so quite a interesting connection between some of these results about productivity and generalization and actually the what what science is because in a sense this cartoon of science in which Newton observes gets experimental data the Apple is falling and then from this produces a theory that describes what's happening and this theory is predictive it can describes other apples falling or even other things this is really a cartoonish version of science that makes clear the connection learning theory you know in early theory you ask is my model the function that I learned from the data from the Apple falling so the equation Newton's equations are this predictive or not and that's the key question in learning theory and it's also the key question in science and so there are some very interesting relations between you know this ideas about philosophy of science and some of the theorems in learning theory anyway that's just that they probably have not much to say about practice but it's interesting to see the fundamental role of machine learning so the classical in the past 20 years or so the classical algorithm for ensure productivity is something like this you are minimizing the empirical error this V could be a square so you're measuring the square error of your predictor on the training set and you are trying to find parameter in your predictor f so this would be like parameters in the networks weights in the network that are minimizing that error but there is an extra term which is essentially a constraint of our complex your predictor your function your network can be often this is a condition like smoothness you want F to be smooth in some sense and this some sense we'll see that in the class is defined by the choice you have of a kernel K which could be like a Gaussian or just a dot product the important point here is only one that these classical rigor ization techniques and the ones you will see for the first ten classes or more correspond to networks that have one hidden layer and we're the only learn about coefficients are from the hidden layer to the output that's really a graph a graphical version of this equation this K is each unit is computing this there is one unit per example for instance suppose that the rule here the K is just the dot product of a new input which one of the examples so how similar the new input is to which one of the examples and and so each unit contains X 1 X 2 this data point the input part of your examples and when a new input comes in in this case to two variables two components a new X it will be the unit will perform the dot product and then would be multiplied by weight alpha I and the alpha are learned from the training set by doing this minimization of Parrish this is basically how support vector machines regularization splines work so I productivity this is a typo it means productivity but I think the spell checker 14 you're better so this is not only useful in itself because there are many problems where techniques like support vector machine are still the best ones that's the case especially when you don't have a huge databases for training but these concepts are also quite important if you want to understand how deep networks work alright so let me now you are probably a bit tired this is a little bit of fun just to show that the metaphor of learning of having systems can learn from data is very broad you can be applied to a very large number of problems and this is really a little history of what we did in my group one of the first problems who looked at and this was in the 90s was face detection of course you know any camera these days has this but at the time there was no software able to do it from learning so so we developed a system that learned from a few example of faces acting hundred or so at the time and one of the funny thing was that the system were able to recognize line drawings of faces even if was not trained with line drawings they said it was one of the very first system there were similar one at CMU and then we also another funny one okay that took a few years to come to the market typically idle time probably and and then the next one was people detection I show you already the system with Daimler we trained it with several hundred faces from images taken in Boston and in generalized to Alma that's pretty good and so there are quite a few other ones there was video compression with the catalyzation estimating the pose an object in the scene I'll show you something about graphics speech recognition speech synthesis decoding the neural code trying to read out what you are thinking about from recording the activity in your brain actually not yours but a monkey text classification pricing stock options you can learn I don't know how many of you know the black-scholes equation yeah so you can you can learn it from data and with some constraints and find out what the correct price of an option is given volatility and strike price and other things for instance one of them I said was reading the neural code this was the idea that you could record activity in a population of neurons in higher visual areas this is the primary visual area visual information starts here and then goes towards the front of your head and these are the higher visual areas recording from a population of neurons they are like a few hundreds and from the activity being able to say what the monkey was looking at and we did that few 15 years ago you train a network by with a training set in which we have images and activity there are many such examples and then you can check whether you can predict from the activity a new object now there are a lot of restrictions in what a new object is in this case has to be similar to the ones used for training but you can do that genetics you can do some pretty good diagnostic on expression of genes from gene arrays this was done in collaboration the Broad Institute here you can do something of course like computer vision so from images learn have a system a function again from a training data in which inputs are images and outputs are a viewpoint try to predict new viewpoint from new images but interest you can also invert this you can say what about doing graphics now my I use exactly the same example now my ax become white and vice versa so the input is viewpoint and the output is an image and an application of this I'll show it because more news in a moment was the system so it's also 15 years old but still pretty good in which the system learned to so the voice is real but the video is synthetic ok so we had Mary speaking the system learned from 4 minutes learned the video of Mary and the dynamics corresponding to the speech and now I have a system that anybody can speak inside and the output would be Mary apparently saying those words ok so the video synthetic the voice is real and [Music] she did not actually sink this the voice is somebody else's died I could be Chinese [Music] or Chinese [Music] with one lighter we had more this was Katie Couric this was if you're too young to remember the shoes the Today Show very popular don't believe everything you see or hear a graduate student at the Massachusetts Institute of Technology has developed a computer program that can manipulate video to make it appear as if you've spoken or some words that in reality you have not even in a language you can't speak Tony is that is the MIT student who developed the program okay Tanya and New York not from the beaches of can TV news legend Walter Cronkite he'll be hosting this year's 61st annual Georgia this is the voice of Katie Couric on Monday we'll talk with him about that and this is also another computer my voice can sound and look as if it's coming out of someone else's mouth NBC News Cambridge Massachusetts and New York and this is the actual ray my Reema [Music] [Music] okay the meeting was Frank the seating was Frank so one is real the other one is synthetic you have to say which one 53 percent the voice is real the video is synthetic which one more news in a moment it's a matter of money it's a matter of money that's all for tonight that's all for tonight actually you can take months it could take months actually you can distinguish but if you have naive people looking at short segments then you know it's basically chance which which one they say is synthetic or not so it's kind of a Turing test for video anyway this is this was fun it's still a pretty good system I think there is now as of last year a deep learning system that is probably better I don't know exactly but it would be an interesting project to for the class to look into it what is the state of this kind of virtual actors and and the future of it so okay just mainly for fun I showed you some old applications gives you a feeling for the how broad machine learning is and and the fact that of course a lot of the things that you hear and read today well the they did just not start from from now they was quite a few years before of machine learning success and attempts and good results so let me come to the deep learning part I assume you all know who knows about deep networks who does not know how many do not know yes as expected okay so for vision the system would look something like this you have an image as input so many thousand pixels typically and then you have layers of units neurons that process information from one layer to the other until there is the last layer they spit out for instance the class which object it is this is of course after training training with possibly millions of examples which would be as usual there will be pairs of input images and output labels the real enthusiasm for deep learning started five years ago about when deep learning network provided a significant like maybe 15 percent better performance on image net the database of images 1 million of them thousand object classes different types of dogs of trees of objects and the system spit out five classes in order of ranking the most likely and the lead like so in this examples I think for in some they're all right I think anyway the performance was far from perfect but pretty good and far from perfect means it was probably better than humans in classifying different type of dogs but was also making mistakes that humans would never make it's interesting for a neuro science point of view that you see a progression of complexity in what the different layers represent or learn if you look at what images activate most after training the units in the first layer you find are mainly edges and this is what we believe happens in visual cortex the first area v1 is mainly stimulated by contours edges oriented edges in the image in the second layer you find more things like texture and in the fifth layer in this system you find units that are more stimulated by objects or object parts again that's similar to what you find in the progression of visual areas in primate cortex and in fact recent work by dynamics then is it's time for gym the calories here they looked at relations correspondence between a deep network I think Alex NATO's the original one they used and actual neurons in the brain of the monkeys and they found very good correspondence very surprising correspondence in properties between the two now all together the success of deep learning and also this fact that it seems to give some insight at how visual cortex works they really beg for a theory nobody really knows the people working on deep network why they work as well as they do so that's what we are working on and this is what we are going to discuss in the last classes of this course there has been progress being made in our group and in others and so in the last five classes we'll review what the state of the art is in terms of explaining why deep networks are working well people are getting interested statisticians like David Donoho Stamford I think there are quite a few classes here at MIT in the meantime on learning hours used to be the only one but it's not anymore the case and let me tell you the basis of how a deep network works is very simple you have neurons unit and each unit receive inputs a number of inputs and they may be from the input so this would be the value of pixels in the image or could be from the neurons below so you have this structure of feed-forward processing you have another layer here and these neurons will be connected to many of the neurons coming after these inputs have weights these weights are learned during training typically you have locality in the sense that in the successful Network they're usually called convolutional networks each neuron receive inputs only from a small neighborhood of neurons below not from all of them okay so these inputs weighted inputs are submitted you get one number and then this one number goes through this non-linearity which used to be sigmoid and now is this which is called our Lu rectified linear units typically called the ramp so it's zero and flat afterwards and the linear afterwards and that's it so this was multiple layers do so that's the architecture and how do you learn the weight well you learn the way typically by by minimizing the empirical error on a training set so L is a loss function could be the square so we're looking at the sum of the squares of the errors between the prediction of your network f of X I traumatized by the weights W on the example I submitted overall I and you're looking for the value of W that minimize this this it can be a huge problem for instance for image that you have more than 1 million parameters in w and but a variation of gradient descent essentially following the gradient will surprisingly works in finding good value of W okay I'll give you just the sprinkle but we see that in much more detail later in the class and I want to tell you what the problems are in deep learning at least the ones I'm interested in but I think these are at least not all but three of the main questions you can ask the first one is is when are deep networks better and why then shallow networks shallow networks as I said would be networks with one hidden layer and just one set of weights to be optimized this will be current machines and support vector machines and so on can they express better certain functions what does it mean so that's the first question is really questioning approximation theory does the architecture is more powerful and when and how then in approximating functions then one layer the second one is about optimization why is it possible or even easy in some case it's a huge optimization and the third one is as to do with the following is that typically deep learning networks have many more parameters than the size of the training set now when I was learning basic statistics in physics many years ago I was told that if I had a model and with some parameters I wanted to fit those parameters I should be sure to get at least twice the number of data twice the number of parameters so data should be twice as many as parameters at least and if there are 10 times much better but this seems to say you know I over parameterize having many more degrees of freedom then data point and I still do quite well I don't over fit I can predict that's kind of this would be an interesting question to keep in mind during the old class because especially in the first part that is kind of referring to shallow networks this problem of overfitting is a big one so anyway this is a puzzle how it happens these are some of the people involved in the group in this Lorenz is missing again let me give you again a feeling for we'll deal more in details at the end of the class but how the answer a question like this could be formulated so the first one is when and why are deep networks better than shallow networks and and so a shallow network could be a network like this you have units here and the only learn about weights are up here everything else is fixed and a deep network will be with several layers and parameters to be optimized at each level okay this is what each unit is doing is the rlu this would be the positive part of the dot products plus a bias term this is a single unit okay what we know is that for both and this is something knowns in the 80s or so is that in both cases shallow or deep they both are universal approximate errs in the sense that I can approximate any continuous function on a compact interval arbitrarily while within epsilon I can make epsilon as small as possible okay it's like you know power expansion or Taylor series you can approximate arbitrary well given enough terms same here the only constraint is that the non-linearity in the case of the shallow networks cannot be a polynomial has to be a non-linearity can be sigmoid an are Alou cannot be a polynomial that's the only constrain this constraint disappears in the case of deep networks if you allow as many layers as you need for reaching the epsilon approximation but anyway based on this looked like in the 80s that there was really no need for deep networks because shallow Nettles can already approximate every function there is however a caution here and this is that in general you have this curse of dimensionality will find this curse of dimensionality many domain of mathematics control theory optimization approximation in approximation curse of dimensionality refers to the fact that if you have a function of multiple variables this will be input to your networks then the number of parameters or including our case neurons that you need is exponentially in the dimensionality okay so so if you want to reach an approximation epsilon an error or say in the super norm which is less than Epsilon then you need in the order of 1 over epsilon to the D parameters or neurons for instance say epsilon is 10% then 1 over epsilon is 10 this T is 10 variables we have 10 to the 10 which is already large number if your inputs instead of 10 is like thousand which is the number of pixels in c4 c4 database is a kind of mini image net where the images are just 42 by 32 you know you have 10 to the Taos which is a very very large number so although in principle you can approximate our bit where every function in practice this makes it impossible in general all right however there are certain classes of functions for which the curse of dimensionality can be avoided if you use it deep network cannot be avoided if you use a shallow Network and these are what we called your arc Akali local compositional function these are functions of functions in this example each function calling this constituent function and so this function f is composed of constituent functions and each constituent function has just two inputs does not need to be two can be more the point is that it's much smaller than the full number of variables here okay for this type of functions and here I have a show up there are graphs of function and down here networks so this is a graph of you know a generic function of eight variables you can think of it as one node and all the inputs go in there no other constraint this is also a graph of a function but is a compositional function is the one I showed before so is this one is the graph of this function so each node represent one of these constituent functions and you can see that each node is a function of two variables only but the all graph is a function of eight variables and you could make it increase the number of input variables as much as you want okay by increasing the number the depth of the so it turns out that if I use a deep network with the same architect or similar does not need to be exactly the same at all of the function like this this does not have the curse of dimensionality the number of neurons or weights I need don't scale exponentially with D but linearly and by the way this means this means that convolutional Network are a special case of this kind of things and so we expect that convolutional networks should work much better than shallow networks for certain classes of problems that correspond to compositional functions and a side remark we'll go through that later in the class but is that is that it's not the convolution it's not weight sharing what makes the magic with magic but is the locality of the connectivities so in fact weight sharing helps but does not help in an exponential way locality does okay there is open problem in this part is that you know why functions that are compositional seems to be important for perception and there is an interesting debate we may discuss that during the course whether the reason is because physics is local in terms of interactions or or it's something that our brain does well you know you can go into philosophy yeah again opportunity for projects and I'll mention quickly other question early why why optimization is relatively easy we have some insights and there are some insights in the field and also the last one about about lack of overfitting essentially the problem was mentioning is shown here you have you have here number of training example this is this database of images Seifer is so the images are about thousand pixels and in this case you have a number of parameters we kept artificially low so a number of units in multi-layer network which was around ten thousand and so you if you use a few examples to Train say thousand so you have over parameterised then it's easy to find zero error because you have a lot of variables to play with but of course the data set for training is small and so you are not doing very well in terms of the tester or error or new data that's the that's this one as you increase the number of examples your test error goes down and at some point your training error goes up and if you increase further you see that the training error goes towards the testing error this would be an example of generalization in which for n going to infinity you get the empirical error goes to the expected error the error on your date but one interesting thing on this other plot where we have here is the number of training data this is a logarithmic scale so this is 10 to the 5 so these 10,000 20,000 30 times the number of parameters is here now training data here and you can see that the training error goes down as I increase the number of model parameters as before and that the test error goes also down and then when I mean the over parameterization regime so have many more parameters then the data I still have does not change very much so I don't have overfitting the test error is still pretty good and this is one of the puzzles that you have a lot of degrees of freedom here in your network but it does not seems to to create bad overfitting by the way fitting had been zero error but very bad predictions so we'll deal with that in the last few classes but I want to I wanted to give you the flavor of the kind of things we'll deal in the class and in particular in this part I repeat you want to know really to have a theory because you want to know when these networks can fail especially you want to understand how they work so that you can come up with a principled way to improve them instead of just trying out a trick after the other and there are bigger question one is what is after for instance problems like image that requires 1 million labeled examples for training and this is way way over what you get as a baby or as a child you know you don't have your parents telling you 1 million times this is a car this is a try so so there is this big challenge right which is important from the scientific point of view to understand how our mind works but also from the engineering point of view you know you can think that until a couple of decades ago we before machine learning yet companies had to pay expensive programmers you know in order to write system that did something relatively intelligent the blue whatever it was called that the chess play machine was done this way but then the situation now is that you you pay much less inexpensive labor that will label tons of data right but the ideal is really to have machines that I can learn from experience by themselves like children do children have a lot of other problems but you have to raise them so but anyway to be interesting to get there so with each other Monday and one one comment from Lorenzo you will see his face a lot in the next few classes

Info

Channel: MITCBMM

Views: 7,919

Rating: 5 out of 5

Keywords: CBMM, Center for Brains Minds and Machines, Artificial Intelligence

Id: Q5itLKscYTA

Channel Id: undefined

Length: 68min 8sec (4088 seconds)

Published: Wed Sep 13 2017