MIT Sloan: Intro to Machine Learning (in 360/VR)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the video you're watching now is in 360 resolution is not great but we wanted to try something different so if you're on a desktop or laptop you can pan around with your mouse or if you're in a phone or tablet you should be able to just move your device to look around of course it's best viewed with a VR headset the video that follows is a guest lecture on machine learning that I gave an MIT Sloan course on the business of artificial intelligence the lecture is non technical and intended to build intuition about these ideas amongst the business students in the audience the room was a half circle so we thought why not film the lecture in 360 we recorded a screencast of the slides and pasted it into the video so that the slides are more crisp let me know what you think and remember it's an experiment so this course is talking about the broad context the impact of artificial intelligence the global there's global which is the global impact of artificial intelligence says the business which is when you have to take these fun research ideas that I'll talk about today a lot of them are cool on toy examples when you bring them to reality you face real challenges which is what I would like to really highlight today that's the business part when you want to make real impact when you Miller make these technologies of reality so I'll talk about how amazing the technology is for a nerd like me but also talk about how when you take that into the real world what are the challenges you face so machine learning which is the technology at the core of artificial intelligence will talk about the promise the excitement that I feel about it the limitations will bring it down a little bit what are the real capabilities the technology where for the first time really as a civilization exploring the meaning of intelligence it is if you pause for a second and just think you know maybe of many of you want to make money out of this technology many of you want to save lives help people but also in the philosophical level we get to explore what makes us human so while I'll talk about the low-level technologies also think about the incredible opportunity here we get to almost psychoanalyze ourselves by trying to build versions of ourselves in the machine alright so here's the open question how powerful is artificial intelligence how powerful is machine learning that lies at the core of artificial intelligence is it simply a helpful tool a special-purpose tool to help you solve simple problems if your which is what it currently is currently machine learning artificial intelligence is a way if you can formally define the problem you can formally define the tools you're working with you can formally define the utility function where you want to achieve with those tools as long as you can define those things we can come up with algorithms that can solve them as long as you have the right kind of data which is all I'll talk about data is key and the question is into the future can we break past this very narrow definition of what machine learning can give us which is solve specific problems to something bigger to where we approach the general intelligence that we exhibit as human beings when we're born we know nothing and we learn quickly from very little data the right answer is we don't know we don't know what are the limitations of technology what kind of machine learning are there there are several flavors the first two is what's really the first is what's achieved success today supervised learning what I'm showing here on the left of the slide is the teachers is the data that is fed to the system and on the right is the students which is the system itself for machine learning so they're supervised learning whenever everybody talks about machine learning today what for the most part they're referred to supervised learning which means every single piece of data that is used to train the model is seen by human eyes and those human eyes with an accompanying brain label that data in a way that makes it useful to the machine this is this is critical because that's one the blue box the human is really costly so whenever every single piece of data that needs to be that's used to train the machine needs to be seen by a human you need to pay for that human and second you're limited to just the the time there's the amount of data necessary to label what it means to exist in this world is humongous augmented supervised learning is when you get machine to really to help you a little bit there's a few tricks there but still it's still only tricks it's still the human is at the core of it and the promise of future research that we're pursuing that I'm pursuing and perhaps in the applications if we get to discuss or some of the speakers here get to discuss they're pursuing in semi-supervised and reinforcement learning where the human starts to play a smaller and smaller role in how much they get to annotate they have to annotate the data and the dream of the sort of Wizards of the dark arts of deep learning are all excited about unsupervised learning that has very few actual successes in application in the real world today but it is the idea that you can build a machine that doesn't require a human teacher a human being to teach it anything is fills us artificial intelligence researchers with excitement there's a theme here machine learning is really simple the learning system in the middle there's a training stage where you teach it something all you need is some data input data and you need to teach it the correct output for that input data so you have to have a lot of pairs of input data and correct output there'll be a theme of cats throughout this presentation so if you want to teach in a system difference being a cat and a dog you need a lot of images of cats you need to tell it that this is a cat this bounding box here and the images of cat you have to give it a lot of images of dogs and tell it ok for this in this in these pictures they're dogs and then then there's a spelling mistake on the second stage is the testing stage when you actually give it new input it has never seen before and you hope that it has given for cat versus dog enough data to guess is this new image that I've never seen before a cat or a dog now the one of the open questions do you want to keep in mind is what in this world can we not model in this way what activity what task what goal my I offer to you that there's nothing you can't model in this way so let's think about what in terms of machine learning can be so it starts small what can be modeled in this way first on the bottom of the slide left is one-to-one mapping where the input is an image of a cat and the output is a is a label that says cat or dog you can also do one-to-many where the image the input is a image of a cat and the output is a story about that cat captioning of the image you can first of all you can do the other way many to one mapping where you give it a story about a cat and it generates an image there's many to many this is Google Translate we translate a sentence from one language to another and there's various flavors of that again same theme here input data provided with correct output and then let it go into the wild where it runs on input data hasn't seen before to provide guesses and it's as simple as this whatever you can come into one of the following four things numbers vector of numbers so bunch of numbers sequence of numbers or the temporal dynamics matters so like audio video where the sequence the ordering matters or sequence of vector numbers just a bunch of numbers if you can convert it into numbers and I propose to you that there's nothing you can't convert it to numbers if you can convert it to numbers you can have a system learn to do it and the same thing with the output generate numbers vectors and numbers sequence the numbers or sequence of vectors and numbers first is there any questions at this point well we have a lot of fun slides to get through but I'll pause every once in a while to make sure we're on the same page here so what kind of input are we talking about just to fly through it images so faces or medical applications for looking looking at scans of different parts of the body to determine if they're to diagnose any kind of medical conditions text so conversations your texts article blog posts for sentiment analysis question and answering so you ask it a question where the output you hope is answers sound so voice recognition any kind of anything you could tell from audio time series data so financial data stock market can use it to predict anything you want about the stock market including whether to buy or sell I think if you're curious doesn't work quite well as a machine learning application physical world so cars or any kind of object any kind of robot that exists in this world so location of where I am location of where other things are the actions of others that could be all the input all of it can be converted to numbers and the correct output same thing classification a bunch of numbers classification is saying is it's a cat or dog regression is saying to what degree I turn the steering wheel sequence is generating audio generating video generating stories captioning text images generate anything you could think of as numbers and at the core of it is a bunch of data agnostic machine learning algorithms there's traditional ones nearest neighbors Navy base support machine support vector machines a lot of them are limited in all describe how and then there's neural networks there's nothing special and new about neural networks and I'll describe exactly the very subtle thing that is powerful that's always been there all along and certain things have now been able to unlock that power about neural networks but it's still just the flavor of a machine learning algorithm and the inspiration for neural networks as Jonathan showed last time is our human brain as perhaps why the media perhaps why the hype is captivated by the idea of neural networks is because you immediately jump to this feeling like because there's this mysterious structure to them that scientists don't understand artificial neural networks I'm referring to and the biological ones we don't understand them and the similarity captivates our minds that we think well this approach is perhaps as limited as our as limitless as our own human mind but the comparison ends there in fact the artificial neuron their artificial neural networks are much simpler computational units at the core of everything is this neuron if this is a computational unit that does a very two very simple operations on the left side it takes a set of numbers as inputs it applies weights to those inputs sums them together applies a little bias and provides an output somewhere between 0 and 1 so you can think of it as computational entity that gets excited when it sees certain inputs and gets totally turned off when it gets other kinds of inputs so maybe this neuron with a zero with a point seven point six one point four weights it gets really excited when it sees pictures of cats and totally doesn't care about dogs some of us are like that so that's the job of this neuron it's to detect cats now what the way you build an artificial neural network the way you release the power that I'll talk about in the following slides about the applications what could be achieved it's just stacking a bunch of these together think about it this is this is a extremely simple computational unit there so you need to sort of pause whenever we talk about the following slides and think that there there's a few slides that I'll show that say neural networks are amazing now I want you to think back to this slide that everything is built on top of these really simple addition operations with the a simple nonlinear function applied at the end just a tiny math operation we stack them together within a feed-forward way so there's a bunch of layers and when people talk about deep neural networks it means there's a bunch of those layers and then there's recurrent neural networks that are also a special flavor that's able to have memory so as opposed to just pushing input into output directly it's also able to do stuff on the inside in a loop where it remembers things this is useful for natural language processing for audio processing whenever the sequence is not the length of the sequence is not defined okay slide number one in terms of neural networks are amazing this is this is perhaps for the math nerds but also I want you to use your imagination there's a universality to neural networks means that this simple computational unit on the left is an input on the right is the output of this network with just a single hidden layer it's called a hidden layer because it sits there in the middle of the input and the output layers a single hidden layer with some number of notes can represent any function any function that means anything you want to build in this world everyone in this room can be represented with a neural network with a single hidden layer so the power and this is just one hidden layer the power of these things is limitless the problem of course is how do you find the network so how do you build a network that as as clever as many of the people in this room but the fact that you can build such a network is incredible it's amazing I want you to think about that and the way you train a network so it's born as a blank slate some random weights assigned to the edges again a network is represented the numbers at the core the parameters of the core of this network are the numbers on each of those arrows each of those edges and you start knowing nothing this is a baby Network and the way you teach it something unfortunately currently as I said in a supervised learning mechanism you have to give it pairs of input and output you have to give it pictures of cats and labels on those pictures saying that they're cats and the basic fundamental operation of learning is when you compute the measure of an error and you back propagate it to the network what I mean everything is easier with cats I apologize I apologize too many cats and so the input here is a cat and the neural network we trained it's just guessing it doesn't know say I don't know it's guessing cat well it happens to be right so we have to this is the measure of error yes you got a right and you have to back propagate that error you have to reward the network for doing a good job and all you do what I mean by a reward there's weights on each of those edges and so the the node that individual neurons that were responsible that back to that cat neuron that cat neuron needs to be rewarded for seeing the cat so you just increase the weights on the neurons that were associated with producing the correct answer now you give it a picture of a dog and the neural networks is cat well that's an incorrect answer so no there's a high error needs to be back propagated to the network so the weights are responsible with classifying this out of this picture as a cat need to be punished they need to be decreased simple and you just repeat this process over and over this is what we do as kids when we're first learning i I'm you know for the most part that we have to we're also supervised learning machines in the sense that we have our parents and we have the environment the world that teaches about what's correct and what's incorrect and we back propagate this error and reward through our brain to learn the problem is as human beings we don't need too many examples and I'll talk about some of the drawbacks of these approaches we don't need too many examples you fall off your bike once or twice and you learn how to ride the bike unfortunately neural networks needs need tens of thousands of times when they fall off the bike in order to learn how to not do it that's one of the limitation and one key thing I didn't mention here is when we refer to input data it's when we refer to input data we usually refer to sensory data raw data we have to represent that data in some clever way in some deeply clever way where we can reason about it whether it's in our brains or in the neural network in a very simple example here to illustrate what representation of data matters so the way you represent the data can make the discrimination of one class from another a cat versus dog either incredibly difficult or incredibly simple here is a visualization of the same kind of data and Cartesian coordinates and polar coordinates on the right you can just draw a simple line to separate the two what you want is a system that's able to learn the polar coordinate representation versus the Cartesian representation automatically and this is where deep learning has stepped in and revealed the incredible power of this approach which deep learning is the smallest circle there is a type of representational learning machine learning is the bigger second to the biggest up so this class is about the biggest circle AI includes robotics includes all the fun things that are built on learning and I'll discuss while machine learning I think will close this entire circle into one but for now AI is the biggest circle then a subset of that is machine learning and a smaller subset of that is representation learning so deep learning is not only able to say given a few examples of cats and dogs to discriminate between a cat and a dog it's able to represent what it means to be a cat it's so it's able to automatically determine what are the fundamental units at the low level and the high level talking about this very Plato what it means to represent a cat from the whiskers to the high level shape of the head to the the fuzziness and the deformable aspects of the cat not a cat expert but I hear this these are the features of a cat verses that are essential to discriminate between a cat and a dog learning those features as opposed to having to have experts this is the drawback of systems that Jonathan talked about from the 80s and 90s where you have to bring in experts for any specific domain that you try to solve you had to have them encode that information deep learning this is this is simply the only big difference between deep learning and other methods is that it learns the representation for you it learns what it means to be a cat nobody has to step in and help it figure out what what that cats have whiskers and dogs don't what does this mean the fact that it can learn these features these whisker features is as opposed to having five or ten or a hundred or five hundred features that are encoded by brilliant engineers with PhDs it can find hundreds of thousands millions of features automatically hundreds of millions of features so stuff that that can't be put into words are described in fact it's one of the limitations the neural networks is they find so many fundamental things about what it means to be a cat that you can't visualize what it really knows it just seems to know stuff and it finds that stuff automatically what what does this mean it's the critical thing here is because it's able to automatically learn those hundreds of millions of features it's able to utilize data it doesn't start the diminishing returns don't hit on until what we don't know when they hit the point is with the classical machine learning algorithms you start hitting a wall when you have tens of thousands of images of cats with deep learning you get better better with more data neural networks are amazing slide two here's here's a game a simple arcade game where there's two paddles the bouncing a ball back and forth okay great you can figure out an artificial intelligence agent that can play this game it can and not even that well just kind of it kind of learns to do all right and eventually win here's the fascinating thing with deep learning as opposed to encoding the position of the paddles the position of the ball having an expert in this game as many come in and encode the physics of this game the input to the neural network is the raw pixels of the game so it's learning in the following way you give it an evolution of the game you give it a bunch of pixels pixels are you know images are built up of pixels they're just numbers from 0 to 256 so there's this array of numbers that represent each image and then you give it several tens of thousands of images they're represented game so you have the stack of pixels and stack of images that represents a game and the only thing you know this giant stack of numbers the only thing he knows at the end you won or lost that's it so based on that you have to figure out how to play the game you know nothing about games you know nothing about colors or balls or paddles or winning or anything that's it so this is it's why is this amazing that it even works and it works too it wins it's amazing because that's exactly what we do as human beings this is general intelligence so I need you to pause and think about this well we'll talk about special intelligence - the usefulness and it ok there's cool tricks here and there that we can do to get you an edge on your high-frequency trading system but this is general intelligence general intelligence is the same intelligence we use as babies when we're born what we get is an input sensory input of image sensory input right now all of us most of us are seeing hearing feeling with touch and that's the only input we get we know nothing and with that input we have to learn something nobody is pre teaching us stuff and this is an example of that a trivial example but one of the first examples where this is truly working I sorry to linger on this but it's a fundamental fact the fact that we have systems that and now outperform human beings in these simple arcade games is incredible this is the research side of things but let me step back these again the takeaways that previous slide is why I think machine learning is limitless in the future currently it's limited again the representation of the data matters and if you want to have impact we currently can only tackle the small problems what are those problems image recognition we can classify given the entire image of a leopard of a boat of a mite with pretty good accuracy of what's in that image that's image classification what else we can find exactly where in that image each individual object is that's called image segmentation again the same the process is the same as the learning system in the middle and neural network as long as you give it a set of numbers as input and the correct set of labels as output it learns to do that for data hasn't seen the best let me pause a second and maybe if you have any questions does anyone have any questions about the techniques of neural networks yes so that's a great question and in a couple of slides I'll get to it exactly so the the the data representation I'll elaborate in a little bit but loosely the data representation is for a neural network is in the weights of each of those arrows that connecting your ons that's where the representation is so I'll show to really clarify that example of what that means the Cartesian versus polar coordinates is just the visual very simple visualization of the concept but you want to be able to represent the data in an arbitrary way where there's no limits to the representation it could be highly nonlinear highly complex any other questions so I have a couple of slides almost asking this questions because there's no good answers but one could argue and I think somebody in last class brought up that you know is machine learning just pattern recognition it's possible that reasoning thinking is just pattern recognition and I'll describe sort of an intuition behind that so we tend to respect thinking a lot because we've recently as human beings learned to do it in our evolutionary time we think that it's somehow special from for example perception we've had visual perception for several orders of magnitude longer in our evolution evolution as a living species we've started to learn to reason I think about a hundred thousand years ago so we think it's somehow special from the same kind of mechanism we use for seeing things perhaps it's exactly the same thing it's so perception is pattern recognition perhaps reasoning is just a few more layers of that that's the hope that's an open question it's yes that's a great question there there's been very few breakthroughs in your networks since through the AI winters that we discussed through a lot of excitement in spurts and even recently there's been a very few algorithmic innovations the big gains came from compute so improvements in GPU and better faster computers the you can't underestimate the power of community so the ability to share code and the internet ability to communicate together through the internet and work on code together and then digitization of data so like ability to have large datasets easily accessible and downloadable all of those little things but I think it in terms of the future of deep learning and machine learning it it all rides on compute I think meaning continued bigger and faster computers that doesn't necessarily mean Moore's law in making small and smaller chips it means getting clever in different directions massive parallelization coming up with ways to do super efficient power efficient implementations and neural networks and so on so let me just fly through a few examples of what we can do with machine learning just to give you a flavor I think in future lectures as possible we'll discuss different speakers the different specific applications really dig into those so we can as opposed to working with just images you can work with videos and segments those I mentioned image segmentation we do video segmentation so through video segments the different parts of a scene that's useful to a particular application here and driving you can segment the road from cars and vegetation and lane markings you can also a subtle but important point these very small piece of information that we just we know are important like there is a red light like I have to stop I have to slow down so hard questions so the question was how do you detect the traffic light and lights so how do we do it as human beings first of all let's start there the way we do it is by the knowledge we'll bring to the table so we we know what it means to be on the road there's a lot of the huge network of knowledge that you come with and so that makes the perception problem much easier this is pure perception you take an image and you separate different parts based purely on tiny patterns of pixels so first it finds all the edges and it learns that traffic lights have certain kinds of edges around them and then zoom out a little bit they have a certain collection of edges that make up this black rectangle type shape so it's all about shapes it kind of build up knowing this this shape structure of things but it's a purely perception problem and one of the things that argue is that if it's purely a perception approach and you bring no knowledge to the table about the physics of the world the three-dimensional physics and the temporal dynamics that you are now going to be able to successfully achieve near 100% accuracy and some of the so that's exactly the right question is you for all of these things think about how you as a human being would solve these problems and what is lacking in the machine learning approach what data is lacking in the machine learning approach in order to achieve the same kind of results the same kind of reasoning required to that you would use as a human so there is also image detection image detection which means the subtle but important point the stuff I've mentioned before image classification is given them image of a cat you find the cat noting the side you don't find the cat you say this images of a cat or not and then detection or localization is when you actually find where in the image that is that problem is much harder but also doable with machine learning with with deep neural networks now as I said inputs outputs can be anything the input could be a video the output could be video and you could do anything you want with these videos you can colorize the video you can add take an old black-and-white film and produce color images again in terms of being out in terms of having an impact in the world using these applications you have to think this is a cool demonstration but how well does it actually work in the real world translation whether that's from text to text or image to image you can translate here dark-chocolate from one language to another it's class global business of artificial intelligence there's a reference below there you can go and generate your own text you can generate the writing of the act of generating handwriting so you can type in some text and given different styles that it learns from other handwriting samples it can generate any kind of text using handwriting again the input is language the output is a sequence of writing of pen movements on the screen you can complete sentences this is kind of a fun one where if you start so you can generate language you can generate language where you start you feed the system some input first so in black there's says life is and then have the neural network complete those sentences life is about kids life about life is about the weather there's a lot of knowledge here I think being conveyed and you can start the sentence with the meaning of life is the meaning of life is literary recognition true for us academics or the meaning of life is the tradition of ancient human production also true but these are all generated by a computer you can also caption this has been become very popular recently is caption generation given us input as an image the output is a set of text the cap captures the content of the image you find the different objects in the in the image that's a perception problem and once you find the different objects you stitch them together in a sentence that makes sense you generate a bunch of sentences and classify which sentence is the most likely to fit this this image and you can so certainly in the I tried to avoid mentioning to driving too much because it is my field with this what I'm excited about what then the moment I start talking about driving it'll all be about driving so but I should mention of course the deep learning is critical to driving applications for the both the perception and what is really exciting to us now is the end-to-end the end-to-end approach so whenever you say end-to-end in any application what that means is you start from the very raw inputs that the system gets and you produce the very final output that's expected of the system so supposed to in the self-driving car case as opposed to breaking a car down into each individual components of perception localization mapping control planning and just taking the whole stack and just ignoring all the super complex problems in the middle just taking the external scene as input and as output produced steering and acceleration of braking commands and so in this way taking this input is the image of the external world in this case in a Tesla we can generate steering commands for the car again input a bunch of numbers that that's just images I'll put a single number that gives you the steering of the of the car okay so let's step back for a second and think about what can't we do with machine learning we talked we talked about you can map numbers to numbers let's think about what we can't do this at the core of artificial intelligence in terms of making an impact on this world is robotics so what can't we solve in robotics and artificial intelligence with a machine learning approach and let's break down what artificial intelligence means here's a stack starting at the very top is the environment the world you operate in their sensors that sense that world there is feature extraction and learning from that data and there's some reasoning planning and effectors are the ways you manipulate the world what can't we learn in this way so we've had a lot of success as Jonathan talked about in the history of AI with formal tasks playing games solving puzzles recently we're having a lot of breakthroughs with medical diagnosis we're still we're still struggling but are very excited about in the robotic space with more mundane tasks of walking of basic perception of natural language written and spoken and then there is the human tasks which are perhaps completely out of reach of this pipeline at the moment is cognition imagination suggests a subjective experience so high-level reasoning not just common sense or high level human level reasoning so let's fly through this pipeline they're sensors cameras lidar audio there is communication that flies to the air or wired or wireless or wired I am you measuring the movement of things so that's the way you think about it that's the way assuming beings and as any kind of system that you design you measure the world you don't just get an API to the world you need to somehow measure aspects of this world so that's how you get the data so that's how you convert the world into data you can play with and once you have the data this is the representation side you have to convert that raw data of raw pixels of raw audio raw lidar data you have to convert that into data that's useful for the intelligence system for the learning system to to use to discriminate between one thing and another for vision that's finding edges corners object parts and entire objects there's the machine learning that I'll talk about that I've talked about there's different kinds of mapping of the representation that you've learned to an actual outputs there is once you have this so you have this idea of and this goes to maybe a little bit of Simon's question is reasoning this is something that's out of reach or machine learning at the moment this is going to your question then we can we can build a world-class machine learning system for taking an image and classifying that it's a duck I wonder if this will work wake you up so we could take this is well studied exceptionally well studied problem could take audio sample of a doc and tell that it's a duck in fact what species of bird it's incredible how much research there is in bird species classification and you can look at video and we could tell that we can do extra recognition that it's just swimming but we can't do with learning now is reason that if it looks like a duck it swims like a duck and quacks like a duck is very likely to be a duck this is the reasoning problem this is the task that I personally am obsessed with and that I hope that machine learning can close and then there is the planning action and the effectors so this is another place where machine learning has not had many strides there's mechanical issues here that incredibly difficult the degrees of freedom with all the actuators involved with all the just just the ability to localize every party yourself in this dynamic space where things are constantly changing when there's degrees of uncertainty when there's noise just that basic problem is exceptionally difficult let me just pose this question we talked about how machine what machine learning can do with the cats and the duck we could do that given a representation it could predict what's in the image but one of the open questions is and deep learning has been able to do the feature extraction the representation learning this is the big breakthrough that everybody's excited about but can also reason these are the open questions in a reason can it do the planning in action and as human beings do can it close the loop entirely from sensors to effectors so learn not only the brain but the way you sense the world and the way you affect the world it the so the question was about the pong game thank you talk to it a little longer it it doesn't get punished when it doesn't detect the ball this is the beautiful thing it gets punished only at the very end of the game for losing the game and gets her water for winning the game so it knows nothing about that ball and it learns about that ball that's something you really sit and think about has like how do as human beings imagine if you're playing with a physical ball how do you learn what a ball is you you get hurt by it you like squeeze and you throw it you feel the dynamics of it the physics of it and nobody tells you about what a ball is you're just using the raw sensor input we take you for granted and maybe this is what I can end on is this is what's something Jonathan brought up is we take the simplicity of this task for granted because we've been we've had eyes we broadly speaking as living species on planet Earth there's eyes have been evolved for 540 million years so we have 540 million years of data we've been walking for close to that bipedal mammals we have been thinking only very recently so a hundred thousand years versus a hundred million years and that's why we can't some of these problems that we're trying to solve you can't take for granted how actually difficult they are so for example this is the marvex paradox the Jonathan brought up is that the easy problems are hard the things would think are easy actually really hard this is state-of-the-art robot on the right playing soccer and that was a state-of-the-art human on the left playing soccer and I'll give it a second the question was you know there's a fundamental difference between the way with train your networks and the way we've trained biological neural networks for evolution by discarding through natural selection a bunch of the the the the neural networks that didn't work so well that's so first of all the process of evolution is I think not well understood meaning sorry the raw huh says he careful here the role of evolution in the evolution of our cognition of our intelligence I don't know if that's so this is an open question so maybe clarify this point his neural networks artificial neural networks are fixed for the most part in size this is exactly right it's like a single human being that gets to learn we don't have mechanisms of of modifying or revolving those neural networks yet although you could think of researchers as doing exactly that there you have grad students working on different neural networks and the ones that don't do a good job don't get promoted and get a good you know there is a natural selection there but other than that it's a it's an open question stay tuned and keep your head up because the future I believe is really promising and the slides will be made available for sure I think a lot of the explorations of what it means to build an intelligent machine has been in sci-fi movies we're now beginning to actually make it a reality this is Space Odyssey to keep with that theme in the previous lecture go ahead this is as opposed to the dreamlike monolith view when the astronaut is gazing out into the open sky at the stars we're going to look at the practice of AI today and how we go if you're familiar with the movie when this new technology appeared before our eyes in we're full of excitement how we transfer that into actual practical impact on our lives to quickly review what we talked about last time we I presented the technology and asked the question of whether this technology merely serves a special purpose to answer specific tasks that can be formalized or whether it can be through through the process of transferring the knowledge learned on one domain be generalizable to where an intelligent system that's trained in a small domain can be used to achieve general intelligent tasks like we do as human beings the this is kind of a stack of artificial intelligence of going from all the way up into the top of the environment the world the sensors sets the data the the intelligence system the way it perceives this world then once you have this you convert the world into some numbers you able to extract some representation of that world and this is where machine learning starts to come into play and then there's the part where I rate I will raise it again today is can machine learning be doing the following steps to that we can do very well as human beings is the reasoning step you know you can tell the difference in a cat and a dog but can you now start to reason about what it means to be alive what it means to be a cat with living creature and what it means to be this kind of physical object or this kind of physical object and take what's called common sense things we take for granted start to construct models of the world through reasoning Descartes I think therefore I am we want our neural networks to come up with that on their own and once you do that action you'll go right back into the world you start acting in that world so the question is can machine learning can this be learned from data or does do experts need to encode the knowledge of reasoning the knowledge of actions the set of actions that's kind of the question open questions I raise it continues throughout the talk today and so as we start to think about how artificial intelligence especially machine learning as it realizes itself through robotics gets to impact the world we start thinking about what are the easy problems what are the hard problems and it seems to us that vision and movement walking is easy because we've been doing it for millions of years hundreds of millions of years and thinking it's hard reasoning is hard I propose to you that it's perhaps because we've only been doing it for a short time and so so think we're quite special because we're able to think so we have to kind of question of what it's easy and what is hard because when we start to develop some of these systems and what you start to realize that all these problems are equally hard so the problem of walking that we take for granted the actuation and the physical the ability to recognize where you are in the physical space the sense the world around you to deal deal with the uncertainty of the perception problem and then so all of these robots by the way this is for the most recent DARPA challenge which MIT was also part of and so what what are these robots doing they they don't have any they only have sparse communication with human beings on the periphery so most of the stuff they have to do autonomously like get inside a car this is an MIT robot unfortunately that they have to get in the car and the hardest tasks they have to get out of the car that's walking so this kind of raises to you the very real aspect here you want to build applications that actually work in the real world and that's the first challenge an opportunity here than many of the technologies we talked about currently crumble under the the reality of our world when we transfer them from a small data set in the lab to the real world for the computer vision is perhaps one of the best illustrations of this computer vision is the task as we talked about of interpreting images and so when you there's been a lot of great accomplishments on interpreting images cats versus dogs now when you try to create a system like the Tesla vehicle that I've often that we work with and I always talk about is it's a vision based robot right as radar for basic obstacle avoidance but most of the understanding of the world comes from a single monocular camera now they've expanded the number of cameras but for the most time there's been a hundred thousand vehicles driving on the roads today with a single essentially a single webcam so when you start to do that you have to perform all of these extraction of texture color optical flow so the the movement through time temporal dynamics of the images you have to construct these patterns construct the understanding of objects and entities and how they interact and from that you have to act in this world and that's all based on this computer vision system so it's no longer cats versus dogs it's it's a huge detection of pedestrians or the wrong classification the wrong detection is the difference between life and death so let's look at cats those were things a little more comfortable so computer vision and I would like to illustrate to you why this is such a hard task which we've talked about we've been doing it for 500 million years so we think it's easy computer vision is actually incredible so all you're getting with your human eyes is you're getting essentially pixels in there's light coming into your eyes and all you're getting is the reflection from the different surfaces in here of light and there's perception they're sensors inside your eyes can burning that into numbers it's really very similar to this numbers in this in the case of what we use with computers RGB images or the individual pixels that are numbers from 0 to 255 so 256 possible numbers and there's just a bunch of them and that's all we get we get a collection of numbers where they're spatially connected ones that are close together are part of the same object so cat-cat pixels are all connected together that's the only thing we have to help us but the rest of it is just numbers intensity in hours and we have to use those numbers to classify what's in the image and if you really think about it this is a really difficult task all you get is these numbers how the heck are you supposed to form a model of the world with which you can detect pedestrians with a with really 99.99999% accuracy because these pedestrians are these cars are cyclists in the car context or any kind of applications you're looking at even if your job is in the factory floor to detect the the defective gummy bears they're flying past that like a hundred miles an hour your task is you don't want that bad gummy bear to get by that your product and the the brand will be damaged however serious are not serious your application is what you have to be you have to have a computer vision system that deals with all of these aspects viewpoint variation scale variation no matter the size of the object is still the same object then no matter the viewpoint from which area you look at that object is still the same object the lighting that moves with lighting consistent here because we're indoors but when you're outdoors or you're moving the scene is moving the lighting the complexity of the lighting variations is incredible from the illumination to just the movement of the different objects in the scene I think about you and this particular one it's Twilight and the light is changing I think you know almost every time I Drive there's one or two things that I see there really that I'm drawing like 200 million years in order to be able to figure out it's not it's a guy who's open his car door and I can't see him but I can just see the light doesn't look quite right on that side of the road and I'm yeah somehow I know I might in my mind it's a person but it seems like a almost impossible problem for the machines to get right I will argue that that the pure perception task is too hard that you come to the table as human beings with all this huge amount of knowledge that you're not actually interpreting all the complex lighting variations that you're seeing you actually know enough about the world enough about your commute home enough about the way the kinds of things you would see in this world about Boston about the way pedestrians move there's a certain light of day you bring all that to the table that makes the perception task doable and that's one of the big missing pieces in the technology as I'll talk about that's the open problem of machine learning it's how to bring all that knowledge first of all build that knowledge and then bring that knowledge to the table as opposed to starting from scratch every time and so Katz the promise gets okay so the to me occlusion for most of the computer vision community this is one of the biggest challenges and it really highlights how far we are from being able to reason about this world occlusions are when what what an inclusion is is when the objects you're trying to detect something about classify the object detect object the object is blocked partially by another object in front of them this is something you think it's trivial perhaps you don't even really think about it because we we reason a three-dimensional way but the occlusion aspect is is makes makes perception incredibly difficult so we have to design is think about this so this image is converted into numbers and we for the task of detecting is there a cat in this image yes or no you have to be able to reason about this image with object in the scene most of us are able to very easily detect if there's a cat in this image we're able to detect that there is a cat in this image now think about this there's a single eye and there's an ear so you have to think about what is it part of our brain that allows us to understand to suppose that with some high degree of accuracy that there's a cat here in this picture I mean the degree of occlusion here is immense and so I promise so this is for most of you some of you will think this is in fact a monkey eating a banana but I would venture to say that most of us are able to tell it's nevertheless a cat you watch this for hours and so let me give you another this is kind of a paper that's often cited our set of papers to illustrate how difficult computer vision is how thin the line that we're walking with all of these impressive results that we've been able to show recently in the machine learning community in this case for deep neural networks are easily fooled paper the seminal paper at this point shows that when you apply network trained on imagenet so basically on detecting cats versus dogs or different categories in inside images if you're you can find an arbitrary number of images that look like noise up in the top row where the algorithm used to classify those images in the image net of cat versus dog is able to confidently say with 99.6% accuracy or above that it's seeing a robin or a cheetah or an armadillo or a panda you know in that noise so it's confidently saying given this noise that that's obviously a robin so you have to realize that the kind of this is patterns the kind of processes it's using to understand what's containing the image is purely a collection of patterns that it has been able to extract from other images that has been human annotated by humans and that perhaps is very limiting to trying to create a system that's able to operate in the real world this is a very sort of this is very clean illustration of that concept and the same you can confidently predict and those images below where there are strong patterns it's not even noise strong patterns that have nothing to do with the entities being detected again confidently that same algorithm is able to see a penguin a starfish a baseball in the guitar in the in that noise a more serious for people designing robots like myself in the on the sensor side you can flip that and say I can take a image and I can distort it with some very little amount of noise and if that if that noise is applied to the image I can completely change the confident prediction about what's in that image so to explain what's being shown so on the left and the column on the left and again here what's the the same kind of neural network is able to predict accurately confidently that there is a dog in that image but if we apply just a little bit of noise to that image to produce that image imperceptible to our human eyes the difference between those two the same algorithm is is saying that there is confidently in an ostrich in that image so another thing to really think about that noise can have such a significant impact on the prediction of these algorithms this is really really quite honestly out of all the things I'll say today and I'm aware of one of the biggest challenges of machine learning being applied in the real world is robustness how much noise can you add into the system before everything falls apart so how do you validate sensors so say a car company has to produce a vehicle and it has sensors in that vehicle how do you know that that those sensors will not start generating slight noise due to interference of various kinds and because of that noise instead of seeing a pedestrian you will see nothing or the opposite you'll see pedestrians everywhere so of course the most dangerous is when it will not see an object and collide with it in the case of cars there's also spoofing which a lot of people as always with security people are really concerned about and perhaps people here are really concerned about this issue I think this is a really important issue but because you can apply noise and convince the system that you're seeing an ostrich when there is in fact no ostrich you can do the same thing in a in an attacking way so you can attack the sensors of a car and make it believe like with lidar spoofing so spoof lidar radar or ultrasonic sensors to believe that you're seeing pedestrians when they're not there and the opposite to hide pedestrians make pedestrians invisible to the sensor when they're in fact there so whenever you have Indulgence systems operating in this world they become susceptible to the fact that everything so much of the work is done in software and based on sensors so at any point in the chain if there's a failure you have to be able to detect that failure and right now we have no mechanisms for automatically detecting that failure so on the data side so one challenge is that we're constantly dealing with is that we are the algorithms in machine learning algorithms that we're using our need labeled data and we have very little labeled data labeled data again is when you have pairs of input data and the ground truth the the true label annotation class that that image belongs to or concept and the it doesn't have to be an image it could be any source of data it's a really costly process to do so because it's so costly we rely every breakthrough we've had so far relies on that label data and because of its cost we don't have much of it so all the problems that come from data can either be solved by having a lot more of this data which I believe is most people believe it's too challenging it's too challenging to have human beings annotate huge amounts of data or we have to develop algorithms that are able to do something with the unlabeled data its the unsupervised semi-supervised sparsely supervised reinforcement learning as we talked about last time I mention again here so one way you understand something about data when you don't have labels is you reason about it all you're given is a few facts when you're a baby your parents give you a few facts and you go into this world with those facts and you grow your knowledge graph your knowledge base your understanding of the world from those few facts we don't have a good method of doing that an automated unrestricted way the inefficiency of our learners the machine learning algorithms I've talked about the neural networks need a lot of examples of every single concept that they're given in order to learn anything about them thousands tens of thousands of cats are needed to understand what the spatial patterns at every level the representation of a cat the visual representation would cap we don't we can't do anything with a single example there's a few approaches but nothing quite robust yet and we haven't come up with a way this is also possible to make annotation this labeling process somehow be very cheap so leveraging this is something being called human computation that term has fallen out of favor a little bit one of my big passions is human computation is using something about our behavior something about what we do in this world online or in the real world to annotate data automatically so for example as you drive which is what we do everybody has to draw and we can collect data about you driving in order to train self-driving vehicles to to to drive and that's a free annotation so here are the annotated data sets we have the supervised learning data sets there's many but these are ones some of the more famous ones from the very from the toy data sets of M NIST - the large broad arbitrary categories of images data sets and there which is what image net is and there's in healthcare there's an audio there's an video there's are you know there's a huge number of data sets now but each one of them is usually in the scale of hundreds of thousands millions tens of millions not billions or trillions which is what we need to create systems that operate in the in the real world and again these are the kinds of machine learning algorithms we have there's five listed here the teachers on the left is what is what is the input to the system that requires to Train it from the supervised learning at the very top is what we have all of our successes and everything else is where the promise lies the semi-supervised the reinforcement or the fully unsupervised learning where the input from the human is very minimal and another way to think about this so every whenever you think about machine learning today whenever somebody talks about machine learning what they're talking about is systems that memorize that memorize patterns and so this is one of the big criticisms of the current machine learning approaches where all they're doing is you're providing there only as good as the human annotated data that they're provided we don't have mechanisms for actually understanding you can pause and think about this in order to create an intelligent system it shouldn't just memorize it should understand the representations inside that data in order to operate in that world and that's the open question one of them and one of the challenges and opportunities for machine learning researchers today is to extend machine learning memorization to understanding this is that duck the reasoning if you get information from the perception systems that it looks like a duck from the audio processing that it quacks like a duck and then from video classification that it the activity recognition that it swims like a duck the reasoning step is how to connect those facts to then say that it is in fact a duck okay so that's on the algorithm side and the data side now this is one of the reasons compute computational power computational hardware that is at the core of the success of machine learning so our algorithms have been the same since the 60s since the 80s 90s depending on how you're counting the big breakthroughs came and compute so there's Moore's law most of you know the way our the CPU side of our computers works for a single CPU is that it's for the most part executing a single action at a time in a sequence so sequential very different from our brain which is a massively parallel eyes system so because it's sequential the clock speed matters because that's how fast essentially those instructions are able to be executed and so we're we're leveling off physics is stopping us from continuing Moore's Law so Intel AMD are aggressively pushing this Moore's law forward but and there's some promise that it will actually continue for another ten or fifteen years then there's another form of parallelism massive parallelism is the GPU and this is this is essential for neural networks this is essential to the success recent success of neural networks is the ability to utilize these inherently parallel architectures of graphics processing units GPUs the same thing used for video games this is the this is the reason Nvidia stock doing extremely well is is GPUs so it's parallelism of basic computational processes that make machine learning work on the GPU one of the limitations of GPUs one of the challenges is in bringing them to in scaling and bringing them into real-world applications this power usage its power consumption and so there is a lot of specialized chips specialized just from the neural network architectures coming out from Google with their tensor processing unit from IBM Intel and so on it's unclear how far this goes so this is sort of the direction of trying to design an electronic brain so it has the efficiency our human brain is exceptionally efficient at running the neural networks in our heads and the orders of magnitude more efficient than our computers are and this is trying to design systems they're able to grow towards that efficiency why do you care about efficiency for several reasons one of course as I'm sure will talk about throughout this class is about the thing in our smart phones battery usage and this is the big one community I think I think it could be attributed to the big breakthroughs in machine learning recently in the last decade is the you know compute as important algorithm development is important but it's the community of nerds global this is global artificial intelligence and I will show in several ways why global is essential here is is tens of hundreds of thousands millions of programmers Mechanical Engineers building robots building intelligent systems building machine learning algorithms the exciting nature of the growth of the community perhaps is the key for the future to unlock in the power of machine learning so this is just one example github is a repository for code and this is showing on the y-axis at the bottom is 2008 one github first open Institute going up to 2012 quick near exponential growth of the number of users participating and the number of repositories so these are standalone unique projects that are being hosted on github so this is one example I'll show you about this competition that we're recently running and then I'll challenge people here to participate in this competition if you dare so this is a chance for you to build a neural network in your browser so you can do this on your phone later tonight of course on your phone you can specify various parameters of the neural network specify different numbers of layers and the depth the depth of the network the number of neurons in network the type of layers and it's pretty it's pretty self-explanatory it's super easy in terms of just tweaking little things and remember machine learning to a large part is an art at this point it's a more perhaps than even you know more than a well understood theoretically bounded science which is one of the challenges but it's also an opportunity deep traffic is a chance so we've all been stuck in traffic there you go Americans spend 8 billion hours stuck in traffic every year that's our pitch for this competition so deep neural networks can help and so you have a neural network that drives that little car with an MIT logo red one on this highway and tries to weave in and out of traffic to get to his destination and trying to achieve a speed of 80 miles an hour which is the speed limit which is a physical speed limit of the car of course the actual speed limit of the road is 65 miles an hour but we don't care about that we just want to get to work as quickly as possible at home so what the basic structure of this game is and I want to explain this game a little bit and then tell you how incredibly popular it's gotten and how incredibly powerful the networks that people built from all over the world the community has built of this over a single month is incredible and this happens for thousands of projects out there now another challenging opportunity ok so you may have seen this this is kind of ethics most engineers most I don't like I love the love philosophy but this kind of construction of ethics that's often presented here is one that is not usually concerned to engineering so what is this question you know when you have a car you have a bunch of pedestrians do you hit the larger group of pedestrians or the smaller group of pedestrians do you avoid the group of pedestrians but put yourself into danger these kinds of ethical questions of an intelligent system it's a very interesting question it's it's one that we can debate and there's really no good answer quite honestly but it's a problem that both humans and machines struggle with and so it's not interesting on the engineering side we're interested with problems that we can solve on the engineering side so the kind of problem that I am obsessed with and very interested in is the real-world problem of controlling a vehicle through this space so there's it happens in in a few seconds here so this is a Manhattan New York intersection right this is pedestrians walking perfectly legally I think they have a green light of course there's a lot of jaywalking too as well well this car just slide it's not part of the point but yes exactly there's an ambulance and so there's another car that starts making a left turn in a little bit I may have missed it hopefully not so yeah and then there's another car after that too that just illustrates when you design an algorithm that's supposed to move through the space like watch this car the aggression it shows now this isn't a trivial example for those that try to build robots this is this is the real question is how do you design a system that's able so you have to think you have to put reward functions objective functions utility functions under which it performs the planning so a car like that has several thousand candidate trajectories you can take that intersection you can take a trajectory where it speeds up to 60 miles an hour it doesn't stop and just swerves and hits everything okay that's a bad trajectory right then there is a trajectory which most companies take which is most a Google self-driving car and every company that's is concerned about PR is whenever there's any kind of obstacle any kind of risk that's it all reasonable that you can maybe even touch an obstacle then you're not going to take that trajectory so what that means is you're going to navigate to this intersection at 10 miles an hour and you let people abuse you by walking in front of you because they know you're not going to stop and so in the middle there is hundreds thousands of trajectories that are ethically questionable in the sense that you're putting other human beings at risk in order to safely and successfully navigate to an intersection and the design of those objective functions is is the kind of question you have to ask for intelligent systems fork for cars is there's no grandma and a few children you have to choose who gets to die very very difficult problems of course but the problem of when I'm very interested in in streets of Boston streets of New York is how to gently nudge yourself through a crowd of pedestrians in the way we all actually do when we drive in New York in order to be able to safely navigate these environments and these questions come up in healthcare these questions come up in Factory in robust in in armed and humanoid robots that operate with other human beings and that's one of the big challenges another sort of funny illustration that folks that openly I use often to illustrate well let me just pause for a second the the gamified version of this there's a game called coast runners and you're you're racing against other boats along this track and your job is there's your score here at the bottom-left number of laps your time and you're trying to get to the destination as quickly as possible while also collecting funky little things like there's these green these green little things along the way okay so what they've done is the bill Denton system the one the general-purpose one that we talked about last time that learns oops that learns how to navigate successfully through the space so you're trying to maximize the reward and what this boat learns to do is instead of finishing the race it learns to find a loop it can keep going around and around collecting those green dots and it learns the fact that they regenerate with time so learns to maximize this score by going around and round now these are the kinds of things this is the big challenge of our award functions of designing systems of designing what you want your system to achieve is not only is it difficult to the ethical questions are difficult but just avoiding the pitfalls of local optima of vet figuring out something really good that happens in the short-term the greedy what it is that those psychology experiments of the kid eats the marshmallow and can't wait for you know can't delayed gratification this kind of the idea of delayed gratification in the case of designing intelligent system was a huge actual serious problem and this is a good illustration of that so we flew through a few concepts here is there any is there any questions about some of the compute and the algorithm side we talked about today yes so the question was yeah used you highlighted some of the limitations of machine computer vision algorithms machine learning algorithms but you haven't highlighted some of the limitations of human beings and if you put those in a column and you compare those it's our machines doing better overall or is there any kind of way to compare those I mean that there's actually interesting work on image net so image net is this categorization task of where you have to classify images and you can ask the question when I present you images of cats and dogs where our machine is better than humans and when when are they not so you can compare when machines do better what are the fail points and what are the fail points for humans and there's a lot of interesting visual perception questions there I think overall it's certainly true that machines fail differently than human beings but in order to make an artificial intelligence system that's usable and could make you a lot of money and people would want to use it has to be better for that particular task in every single way in order in order for you to want to use a system has to be it has to be superior to human performance and usually far superior to human performance so so it's on the philosophical level it's an interesting thing to compare what are we good at what are not but if you're using Amazon echo your voice recognition or any kind of natural language chatbots or a car you're not gonna be well this car is not so good with pedestrians but I appreciate the fact that you can stay in the lane fortunately you have a very high standard for every single thing that you're good at and it has to be superior to that I I think maybe maybe that's unfair to the robots I'm more of the nerd that makes the technology happen but it's certainly on the self-driving car aspect policy is probably the biggest challenge and I don't think there's good answers there some of those ethical questions that come up well it's it's it feels like so we work a lot with Tesla in Drive so I'm driving a Tesla round every day and we're playing around with it and studying human behavior inside Tesla's and it seems like there's so much hunger amongst the media to jump on something and it feels like a very shaky PR terrain a very shaky policy terrain we're all walking because we have no idea how how we coexist with intelligent systems and so and and then of course government is nervous because how to regulate the shaky terrain and everybody's nervous and excited so I'm not sure there's no same kind of question to Jason a moment thanks a lot legs for another great session [Applause]
Info
Channel: Lex Fridman
Views: 39,553
Rating: 4.7052097 out of 5
Keywords: mit, deep learning, research, machine learning, artificial intelligence
Id: s3MuSOl1Rog
Channel Id: undefined
Length: 88min 53sec (5333 seconds)
Published: Sun Dec 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.