Andrew Ng: Deep Learning, Self-Taught Learning and Unsupervised Feature Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Does anyone have a link to the next talk he referred to at the end of the video, which he gave shortly afterwards?

πŸ‘οΈŽ︎ 6 πŸ‘€οΈŽ︎ u/justonium πŸ“…οΈŽ︎ Aug 14 2013 πŸ—«︎ replies

Mah Ng-a.

πŸ‘οΈŽ︎ 6 πŸ‘€οΈŽ︎ u/alexgmcm πŸ“…οΈŽ︎ Aug 14 2013 πŸ—«︎ replies

At 44:05:

I think someone actually is working on quantum computing, yes.

More info: /r/dwave

πŸ‘οΈŽ︎ 5 πŸ‘€οΈŽ︎ u/Slartibartfastibast πŸ“…οΈŽ︎ Aug 14 2013 πŸ—«︎ replies

I love Ng so much. If anyone has any other neat or related material or resources, please post a comment.

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/robotghostd πŸ“…οΈŽ︎ Aug 14 2013 πŸ—«︎ replies

Here's how to pronounce Ng.

https://www.youtube.com/watch?v=SNYOEgMeSvM

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/zpmorgan πŸ“…οΈŽ︎ Aug 14 2013 πŸ—«︎ replies

Deep stuff and NN are lame. Star trek guys love them and think they attract girls. I do Bayesian. It's classic and deep enough for me, and much less pretentious. Check out your posteriors kids. Oh please...

πŸ‘οΈŽ︎ 1 πŸ‘€οΈŽ︎ u/drki56 πŸ“…οΈŽ︎ Aug 16 2013 πŸ—«︎ replies
Captions
what I want is set maybe high level vision share view some some of the ideas I think are the big ideas in feature learning and deep learning I think the agenda of deep learning as the idea of using brain simulations to make learning albums my brendon easy to use and also make revolution advances in machine learning and AI so come back to this later but you know once upon a time I guess when I was in high school I think I joined the field of machine learning because I want to work on AI but somehow that got lost and instead of actually doing AI we wound up spending our lives doing curve fitting which is not what I signed up to do of and and deep learning was for the first time in many years made me think about the bigger dreams again I should come back and say a bit more about that and again I won't say you know the sort of vision and ideas on a share is really not mine but as I think shared by large community including you know young Jeff Anton yoshua bengio and many others that you hear from in the next couple weeks what about computers do with our data right we want to look at images and label them look of audio listen to audio and do speech recognition have text and do stuff with text and it turns out that machine learning is our best shot and most of these applications today but it is very difficult to get these applications work right so while back I also some of my students at Stanford to use like a state-of-the-art computer vision algorithm to to write the motorcycle detector and this was the result we got and this is typical in computer vision right this is so um well even though learning algorithms works each of those lines is like a you know six months to two years of work for a team of engineers and and we like these algorithms to be less work to build and also maybe perform better so let me start to explain some of these ideas using computer vision but and then I will talk a bit about audio and apply these albums other modalities as well so why is this problem hard right so obviously a motorcycle how on earth could a computer failed to recognize what this is zooming into small part of the image zooming into weather Little Red Square is where you and I see a motorcycle the computer sees this so the computer vision problem is to look at all those pixel intensity values and tell you that all those numbers represent the exhaust pipe of a motorcycle seems like you need a very complicated function to do that and how do we do this so machine learning you know machine learning guys like me say oh just feed the data to the learning algorithm and let the learning algorithm do its job right when I teach my machine learning class I draw pictures like this and this is just not how it works oh so let's pick a couple pixels and let's plot some examples right so take that image there and because pixel one is relatively dark and pixel two is relatively bright that image you know it has has that position in this figure now let's take a different example a different motorcycle image has a this has a bright two pixel one and the darker pixel two so that second image gets past a different location and then let's do this for a few negative examples as well no motorcycles and what you following is that if you plot a set of positive and negative motorcycle and non motorcycle images that your positive negative examples are extremely jumbled together and so if you feed this data to you know certainly a linear classifier it doesn't work so what is done instead in machine learning is that we what won't want to be nice if you could come up with what's called a feature representation if you could write a piece of code that tells you does this image have handlebars in it does this image have tires or wheels in it and if you could do that then your data looks more like this on the lower right and it then becomes easy much easier for say a linear classifier like a swivel vector machine with logistic regression to distinguish the motorcycles from the banal motor signals right but the story goes on so will this in this illustrative example or saying well whether we could write a piece of code to tell us that their handlebars and wheels but we don't actually know how to do that and so in computer vision what is done is actually the following this is how people actually come up with features in computer vision this is kind of notional illustrative example but this is what I'm going to do when I take one well as I go away I'm going to detect edges at four different orientations so look for vertical edges horizontal ages 45-degree ages 135 degree edges and then what this number Oh point seven and the upper right means right that number is saying that the density of vertical edges in the upper right hand quadrant of my image is 0.7 and what and what this number down here says is that the density of horizontal edges you know in the lower right hand quadrant of my image is 0.5 and in case you're getting the sense that sort of like oh my god what on earth is going on there this seems hardly complicated how on earth it will come up with this piece of code you know that's that's that's the point and sadly this is the way the lot of computer vision is done today so more probably this notion of a feature representation is pervasive throughout machine learning in fact I guess I live in I live in Silicon Valley and if you walk around Silicon Valley and look at well where people are spending all the engineering time is often in coming up with these feature representations so let's let's look what's delve deeper right so where do these features come from um since they're the primary lens through which our algorithms see the world this gives them a certain importance right so how about computer vision oh and then in facts are interesting and in fact this notion of feature representations is pervasive you know for vision audio text even other applications so what do you get the features from in computer vision the state of the art answer for where the features come from is that teams of tens hundreds or teams of some of tens hundreds or maybe thousands of computer vision researchers have spent decades of their lives hand engineering features for computer vision the figure on the upper left is a figure that I took from the sift paper thus if paper is the single most highly cited paper in computer vision like lost 15 years and I've read the paper maybe about five times now and I still have no idea what it's doing this is something so complex piece of code that David loathes good friend David will tell you this himself it took David literally ten years I'm knocking he's he'll say 10 years himself of you know filling with pieces of the code in order to come up with the syph feature which works pretty well but you know you have to ask is there a better way to design features than this um that's vision how about audio same thing right teams of tens hundreds of thousands of audio researchers working on features for audio ever CC is shown on the upper right that's actually a pretty clever surprisingly hard to beat but again you know honestly to this day I have a hard time understanding what some bits of the M FCC album are doing and natural language in fact I think most of natural language processing text processing today is unapologetically about finding better features so think about pauses right there's a lot of NLP work on parsers um this piece of software that tells you where the noun phrases are in your sentence I mean why on earth do I care where the noun phrases are in my sentences I really don't need software to tell me that the only reason we spend so much time working on pauses is because we hope that this will give us useful features to then feed to some later downstream application like anti-spam web search machine translation that we actually care about so common features is difficult time consuming requires expert knowledge and we're working with them applications machine learning you know we spent a long time there's a way if you look at apply machine learning where they will walk anew so the company local people do apply machine learning this is really what they spend the vast majority of the time on is coming up with features so can we do better so the next piece in a lot of deep learning there's a like many people in varied like many of you I tend to treat biological inspiration with a great deal of caution and even a healthy dose of skepticism but for me a lot of the some all of my thinking about deep learning has been taken inspiration from biology so I only share of you you know some some cool cool ideas from really by Washington inspiration so turns out there's a fascinating hypothesis that much of human intelligence can be explained by a single learning algorithm since cause the one learning our hypothesis let me share with you some evidence for this right so this one's experiment first done on ferrets in MIT on that great piece of brain tissue shown on the slide that's your auditory cortex the way that your understanding my words now is that your ears is routing the sound signal to that piece of red brain tissue and this processing of sound and then that's how you you eventually get to understand what I'm saying so neuroscientists did the following experiments which is in cut the wire between the ears and the auditory cortex and do what's called a neural rewiring experiment so that eventually the signal from the eyes gets routed to the auditory cortex then turns out they we do this that red piece of brain tissue she learns to see and this is the word see this has been replicated in multiple labs on four species of animals and these animals can quote see in every single sense of the word that I know how to use the word see these animals they can do visual discrimination tasks they can look at things and make correct decisions based on you know an image in front of them using that real piece of brain tissue another example this very piece of brain tissue is your somatosensory cortex is responsible for your sense of touch do a similar neural rewiring experiment and your somatic concerns we call text Lucy um so more generally this is idea that if the same physical piece of brain tissue right the same physical bit of your brain can process sight or sound or touch or maybe even other things then maybe there's a single learning algorithm they can process sight or sound or touch or maybe other things and that it can you know discover some approximation to that learning algorithm when we discover a totally different algorithm but it accomplishes the same thing then that might be a better way to us making progress in AI then hand engineering separate pieces of code in each of these individual application silos which we have been doing for decades now just a few more fun examples it turns out you can plug in other sensors to the brain and the brain kind of figures out how to do of it shown on the upper left is a scene with your tongue right so this is actually undergoing FDA trials of now to help blind people see a system called brain port so the way it works is you strap a camera as your forehead takes a low-resolution grayscale image of what's in front of you run a wire to a rectangular array of electrodes that you place on top of the tongue so the each pixel maps to a point on your tongue it may be a high voltage is a bright pixel and a low voltage is a dark pixel and even as adults you and I today would be able to learn to see you about tongues in like tens they are like 10 10 20 minutes human echolocation well you need to know snap your fingers great I'll click your tongue and there are there actually schools today training blind children to learn to interpret the pattern of sounds bouncing off the environment as human sonar a haptic belt is you know bring a buzzes around your waist program the one facing north to Buzz and then you get a directions and you just magically know we're in office similar to help Birds sense direction you can Fogel third eye into frog and you know the frog learns how to it doesn't work in every single instance there are cases where this doesn't work but I think to a surprisingly large extent it's almost as if you can plug in you know not quite any sensor but almost the large range of sensors onto almost any part of the brain and kind of learns to do of it so want to be cool if we get a learning algorithm to do the same so let's see oh let's take a break could you guys um I think you now know enough to work look at the questions one through three in the handout do you guys want to take a few minutes so just write down do write down what you think is the right answer and when you've done so you know discuss what she wrote down with your neighbors and and see if you agree or disagree for a question one I had D for question two I had auditory cortex learns to see and the question three I don't know different people have different ideas I guess III tend to use the wording that much of human intelligence can be explained by a single learning algorithm but there are lots of other Worthing's that lots of other ways of describing it alright so given this um you know what are the implications for machine learning right so here if we think that without visual system computes an incredibly complicated function of the input right it looks at all those numbers and those pixel values and tells you that that's the motorcycle exhaust pipe and so two approaches that we could try to build such a system as you could try to directly implement this complicated function which is what I think of as a hand engineering approach or maybe you can try to learn this function instead right and in kind of a side comment maybe only for the aficionados the machine learning is that if you look at a train learning algorithm you know a learning algorithm after has trained with all the parameter values there's a very complex thing but the learning algorithm itself is relatively simple most learning algorithms striving like half a page of pseudocode so the complexity of the things we're training usually comes from the complexity of the data rather than the complexity of the algorithm and then that's a good thing because we know how to get complex data you just here which is an image is all around us but coming up with complex algorithms is hard right so here's a here's a problem that I guess I post a few years ago which is you know can we learn a better feature representation for vision or audio or what have you so completely can you come up with an algorithm they just examine examine it's a bunch of images like these and automatically comes up with a better way to represent images than the raw pixels and if you can do that maybe you can apply the same algorithm to audio and have the same algorithm training along a bunch of audio clips and have it find a better way to represent audio than the raw data okay so let's let's write down the mathematical formalism of this problem right which is given a 14 by 14 image X image patch X one way to represent the image patch is with a list of 196 row numbers corresponding to the pixel intensity values the probably one opposes can we come up with a better feature vector to represent those pixels okay and if you can do so then this is what you can do here's a problem called a self-taught learning I guess which is well so in traditional machine learning right if you want to learn to distinguish you know motorcycles from non motorcycles you have a training set with some and this is a pain because there's a lot of work to come up with a lot of pictures and motorcycles it was like tens of thousands of them so in the unsupervised feature learning on the cell for learning problem what you do is instead we're going to give you a large source of unlabeled images we give you an infinite source of unlabeled images because of the web where we all have an effectively infinite source of images and the task is can all those random images up there somehow can pictures of trees and sunsets and horses and so on can that help you to do a better job figuring out that this picture down here there's a motorcycle okay and so one way to do that is that we have an algorithm that can look at these on label images and learn a much better representation of images than just the raw pixels and if that superior representation allows us to then look at a small label training set and if this my superior representation allows us to use the small label training set just do a much better job figuring out what this tested images okay so I guess in machine learning there are sort of three standard three a few common formalisms right there's the supervised learning setting which is the oldest will standard one that most you best know so let's see it goes the record distinguish between cars and motorcycles right so in the standard near old school like 50 year old supervised learning setting 30 year old supervisor is I think you need to come a large training set up above the cars a lot of all the cycles okay um about 10-15 years ago deciding Andrew McCollum Tom Mitchell or maybe even others before them start to talk about semi-supervised learning the idea of using unlabeled data and that was exciting but in semi-supervised learning as is simply conceived um you know the ability to use unlabeled data is great but in semi supervised learning typically the unlabeled data is all images or causing motorcycles and it turns out that this is a semi-supervised learning this sort of model is not widely used because it turns out that um you know rarely do you have a data set of where all the images are get it cause all motorcycles and nothing else but the only thing we're doing is just missing the label so this is kind of useful but isn't why they use whereas in um what I call self-taught learning the goal is to take you know totally random images that maybe cause maybe motorcycles maybe tow the other random thing and two can you can you somehow use this to learn to distinguish the house and rental cycles and one wait is one way I like to think about it is that you know the first time that a child sees a new object someone invents a new vehicle right the first time that you and I saw a Segway we learn to recognize the Segway very quickly just from seeing it once and I think the reason that we learn to recognize a Segway very quickly is because you're in my visual system prior to that had had several decades of experience looking at random natural images just seeing the world and was by looking at these random unlabeled images that allow us that allowed you in my visual system to learn enough about the structure of the world to come up with better features if you will so that the first time you saw a Segway you very quickly learn to recognize what a Segway is right so just to make sure you've got this concept could you please a look at question four and just do that map this to a new example all right so someone called the answer well first first part a pcod second part and third part all right awesome all right that was easy cool so how do we actually do this in order to come up with an algorithm to learn features let's turn one last time to biological motivation turns out that when your brain gets an image the first thing it does is look for edges in the image right so first stage of visual processing on the brain is called visual cortical area view what I think John might have mentioned is yesterday and the first thing it does is look for edges or lines I'm going to use the term lines and edges interchangeably so in your brain right now there's probably a neuron that is looking for a 45 degree line 45-degree edge like this shown on the left with the dark region next to a bright region and there's probably a different neuron in your brain right now there's looking for a vertical line like this one right here okay so um how can we get our software to maybe mimic the brain and also find edges like this what we don't want to do is code this up by hand what I don't want to do is tell the neuroscientist and then you work really hard to right-hand engineer software the replicas does I think what's most much more interesting is we can have an algorithm learn these things by itself and that there is such an algorithm is very old ones like a what 16 year old was out now do 200 thousand a few called sparse coding you thought about this a bit yesterday did you write cool so there we go for this very quickly even a sports coding was obviously conceived as like a theoretical neuroscience model so you know Brunello thousand will tell you right he never envisioned that this would be used as a machine learning algorithm this is like a theoretical neuroscience result used to try to explain you know computations in the brain or something like that right and this is how the atom works is an unsupervised learning algorithm so the way where else is repeated a set of M images X 1 X 2 up to X M so each input example is and let's say an N by n matrix here we go like a 14 by 14 image patch what's false coding does is it learns the dictionary of basis functions Phi 1 Phi 2 up to 5 K such that each of your training images X subject to the constraint that the a J's are mostly 0 sparse and and the way this is implemented is on where they are l1 constraint where both you know minimize the sum of absolute value terms on the other coefficients AJ okay the sparsity penalty sir and so if you do this then on whether I think this is the only equation I have for this first hour so I hope you enjoyed it so same thing in pictures if you train smart coding on natural images every single time you're on it you'll learn the set of basis functions that look a lot like the edge detectors that you know we believe visual protocol error v1 is looking for and then given a test example given a test image X right what what what it will do is it will select out let's say three all of my 64 basis functions here and it will take that test example and explain it or decompose it into a linear combination of in this case just three out of 64 of my basis functions okay so speaking loosely this algorithm has quote invented edge detection right the algorithm was free to choose absolutely any basis functions at once but it shows but you know if you run it every time it chooses to learn basic functions that look like these edges and what this decomposition says is that this image X is 0.8 times H number 36 plus 0.3 times H number 42 plus 0.5 times H number 63 okay so if you well this is saying this is now decomposed the image in terms of what edges appear in this image and this gives a high-level more succinct more compact representation of the image and also probably a more useful one right because it's more useful to know where the edges are in the image that you know where the pixels are moreover this gives us a alternative way to represent the image instead of representing the image patch using a list of 196 pixel values we can instead using this vector of numbers a 1 through a 64 these are the coefficients multiplying into the basis functions just a few more examples so the method invented attention those represent an image in terms a just appear in it and it turns out that a funeral scientists have done that had to invent a chef have done quantitative comparisons between sparse coding and oh and visual cortical area v1 and found that you know it's not as by no means a perfect explanation of visual to everyone but but it matches of unknown so surprisingly well on not all but on many dimensions so that's vision how about other input modalities so this is a slide I got from Evan Smith from his PhD thesis work with Michael Ricky so what Evan did was he applies false coding to audio data and what I've shown here is 20 basis functions learned by sparse coding when trained on natural sounds ok so this is a grid of 5 by 4 you know what 5 by 4 the lower audio clip so I guess audio basis functions so this 20 basis function is learned by sparse coding um what he did was he then went to the cat or the tree system since the biologists in Boston had been you know using electro recordings to figure out what early auditory processing and the cat does and for each of these 20 things learned by his algorithm he found the closest match in the biological data and the closest matches are shown over the in red ok so the same algorithm the only one hand gives a you know decent explanation for early visual processing and on the other hand use may be a you know by no means perfect but but the reason why X which is some explanation to relieve it auditory processing as well and it turns out you can do a similar study on them or the somatosensory processing as well this is work done by Andrew Sackler Stanford where he collected touch data how do you collect touch data right so the way that done Andrew Andrew Sachs did it was um you know so we hold things of our hands all the time when I'm holding this thing but how do you how do you actually collect data for how I'm holding it so the way Andrew Sachs did it was um he took a glove and he took an object and he sprayed talcum powder all over the object and then when you take a glove and you hold this object and then you let go the pattern of talcum powder you know on your glove tells you where you came into contact with the object and moreover the density of talcum powder actually corresponds a little bit to the to the pressure and and he not sure why he did this but then he then actually found so said what type of objects to people hold I won't we don't know so we're collecting data you want to be representative of what animals do so fortunately it turns out that there were two biologists that has spent about a year on the of their lives sitting on some Island watching monkeys and carefully documenting every single way that monkeys pick up different things so thank God I'm a computer scientist right and so Andrew Sachs you know took that distribution of data and he wearing his glove picked up objects using the same distribution of drawers as was documented in these monkeys on an island oh and that was his data I think that story was pretty fun but totally unnecessary oh but but also really gassy oh so training training on data like this that turns out that you learn basic functions using sparse coding there are I should say by no means a perfect match to one to what is known to one is believed to happen in somatosensory cortex but this may be a surprisingly good match dimensions right so that's fast coding um and let me could you take could you do a five and six on the handout so what's the answer four five and six wait what's wait once what is it actually where was it four six again okay come on all right be right all right cool let's accept the right so the bases are the same for every image but is the coefficients they want to vacate those are the features for the specific image okay um all right great so that's sparse coding and it turns out that on the different ways to implement you know sparse coding and what what I just talked about was maybe the the original way biocells net view in 1986 they're much put there there are different ways now I think young mother talked about encode the decode architectures I'll talk a bit more about that later today but I think this intuition of learning sparse features has been kind of key for is one of the ideas I guess that allows us to learn very useful because even from unlabeled data come back to this later as well there are there other ways to do it if any of you are familiar with ICA actually how many of you have heard of the ICA independent components our cool all of you awesome so it turns out that there's a you know deep mathematical relationship between ICA and sparse coding with the turns out the two algorithms are doing something very similar for me personally these days I tend to use the ICA version so sparse coding rather than the version I just talked about but later today also talked about about sparse altering colors with different ways of learning sparse features we'll get to that later but so what do you do instead of you know you just described is you know one layer of one of these spas feature learning algorithms maybe sparse coding maybe as possible to encoder maybe a sparse VPN or sparse RBM and turns out what you can do really building on a jet engines work what you can do is recursively apply this procedure where instead of just going from pixels to images excuse me pixels to edges you can recursively apply this procedure and you know just as you can group together pixels to form agency in group together edges to form combinations of edges and group together combinations of edges to form higher level features so let me show an example that this is an example run by hall actually it's not a michigan professor but what hall after she trained one layer of the sexiest pose dbn and first layer you know Albert merci group together pixels for mages another level up learn significant edges to form models of object parts right so does this I should say this this was an example of sparse coding trained just on pictures of faces so the entire dataset was pictures of faces right and then I'm recursively apply this the next level up and learn more complete models of faces so let me make sure that this visualization makes sense right um when I have this little square here shown here what this little red what does little square means is that I have learned a neuron in the first level that is looking for a vertical edge like that one okay going one level up and and and I've shown all these rectangles the same size but higher up features are actually looking at bigger regions of the image okay it's just a resize all look the same but one level up this is actually looking at the bigger region of the image but one level up you know this rectangle here means that at that next level one of the neurons has learned to detect eyes that look like that great and then at the highest level you know if you look at the upper leftmost square say with that visualization is showing that there's a neuron that has learned to detect faces that you look look a bit like that person okay if you train the same algorithm on different object classes you end up with different decompositions of different object classes into different object parts then more complete models of objects if you train the algorithm on a mix of four different classes of objects so there's an algorithm trained on a data set that includes cause faces bikes and airplanes then you know you end up with at the mid level you get features that are shared among the different object classes where I don't know maybe what I use are we causing motorbikes both a real tire like shapes or your features that kind of share between multiple object parts and then the highest level you get object specific features okay yeah I see yes there is there's a point so because of the nature of the visualization I showed them as though images but yeah if there's some amounts of Indians that's hard to visualize it oh I remember I have a better example later today of a where we're more carefully document the invariant zones have a better example later okay so was this good for um when you when when you hear your research isn't deep learning like me talk you you see people like yawn in me and Jeff intern is hello stories IQs but you know so you can learn features so what is it good for well turns out the Hollywood to benchmark will stand the benchmark in computer vision where the task is to wash a short video clip and decide whether you know any of a small number of activities took place in this video you know whether two people kids to do hugs almost driving so as eating's or was running the budget or activities like the theater computer vision has tried out many different combinations that features last year hopefully soon and Stanford found that by learning rather than hand engineering features he was able to significantly outperform the previously we are all right how about audio turns out you can apply similar ideas to audio so this is a spectrogram which is a different representation for audio you can take slices of spectrograms and apply sparse coding to that it turns out if you do this then on this is a dictionary of basis functions learn for speech I guess I'm not an excellent speech but in impervia probably a slightly optimistic as a reading of these you know the basis functions learn by sparse coding correspond roughly to phonemes is a slightly optimistic interpretation I should say and so but if under this slightly optimistic interpretation when say informally that sparse coding has learned to decompose speech data you know very loosely into it because the phonemes to the parent speech and moreover you can take this recursively apply this idea just as we saw earlier to build higher and higher level features and I guess a few years ago oh how likely so against the timid benchmark is a data set that many speech researchers work on this is one of those data sets where you know if you do point 1 percent better you write a paper and a few years ago homeland was able to you know make what correspondent so I think worked out something like like two thirds of a decade were for progress or something on this data set I just my journey with this chart is outdated I made this child I think back when Pollock was publishing this paper since publishing this paper geoff hinton and others have surpassed this also using deep learning techniques right um and then as referring as was preparing this talk I better for practice I ask my students to help me put together a chart of the social results where you know we or others or whatever holders through the odd benchmark result using deep learning and there were surprisingly many of them from us Stanford from other groups I say yeah I worked a machine-learning for a long time I've never in my life seen anyone technology not go over benchmarks like this quickly is this the whole view the deep learning is like knocking up a benchmark like nobody's business um there's actually a lot more than fits on one slide I think if I put all the ones I'm aware of it'd be about three slides like this right what's left to be done right so and I know that some of you are you know here because uh you want to learn how to apply these things and I know some of you are here because you might be even interested in doing research yourselves and writing research papers yourselves in deep learning and future learning so I want to share of you I'll do this later to talk more about this later as well I'll share of you what I think of as a good way as one as one of many promising directions in which to you know take research for for deep learning I think that's a scaling up so I'm happy how do we build effective deep learning algorithms right how do we get these animals work well well in fact how do you build effective machine learning algorithms you know so let's not back in history right about 20 years ago oh there were east debates about you know when these different supervised learning algorithms so no feature learning your self older neighbors who these supervised learning algorithms and they use be all these debates about you know is this newer and better is my album better so um michelle banko and eric bro did one of the studies that most influenced my thinking where they took maybe four of the state of the art learning algorithms of the day i guess back in 2001 azn's were not yet popular so they didn't actually study SVM's but they took a natural language processing class on which they had a effectively unlimited source of label data and they trained for learning algorithms and plotted on the x axis is a training set size parts on the y axis is the performance is the accuracy all the algorithms do about the same is the amount of data you have and even a quote your algorithm will often lose to a quote inferior algorithm if only you can give the inferior algorithm or data to train on yeah so I think this results like these that has led to this Maxim in machine learning that you know says that often is not was the best algorithm that witnesses who has the most data and then I definitely see this over and over and slip and value of you if you look at think about the most commercially successful websites you know the ones making large amounts of money in that you use per every day many of those algorithms are incredibly simple is like logistic regression but the secret is that those albums will fit far more data than anyone else has so how about so this is supervised learning um how about unsupervised learning so Adam coats um as you who helped prepare this handout a few years ago actually a year and a half ago did this interesting study right where he took all of the unsupervised feature learning algorithms of the day that you know that there's guys like us I guess debate is my averin better is your oven better weather and he took a bunch of these algorithms and ran all of them and they read the model size so for unsupervised feature learning all of us have a have a large amount of data right if you know you're doing if you're learning from unlabeled images from natural images you have an infinite amount of data right news and so the parameters the very is not the amount of data is the size of the models of how many features do you learn those coefficients a1 just in the early example we had 64 coefficients a1 through a 64 for sparse coding but let's set that bigger let's learn a thousand features instead or 10,000 whatever let's learn much larger number features and what Adam found was that you know the album does matter maybe it matters more than know maybe matters more than we respond supervised learning because these albums are less mature but they can be sold the silver is out where the bigger the model the better it does and in fact one interesting historical side so see part n we actually went back and historically traced out you know I mean we like to publish paper and say my oberyn's better than yours yours Valen whatever right went back and traced all the sequence of papers with you know person a published as a result on C far person B published paper saying oh I did better than person C published in Ellicott City even better and then the one says oh I do even better a new idea grow we traced them a couple sequences of that of supposedly benchmarks of advances in benchmark that was supposedly do 200 miles better than yours as I do better and we believe that a lot of those results of the supposed progress was actually because the models got bigger right it's not that my algorithm is actually better is it just I know most law have more time to work and so I trained mine better and so I'm going to write a paper saying my albums better in the stuff I've done the one most reliable way to get better results has been to train a bigger model if I change the algorithm sometimes it makes it better sometimes not but in fact do I you know look at the literature I feel like a lot of work by a lot of different research groups has been in some ways on on trying to get these models to just train bigger right so in this world of a supervised feature learning where all of us have an infinite amount of data you know I feel like we're not limited by what data we have we're much more limited by our ability to process the infinite amount of data that all of us have all right so you know many attempts to come more efficient algorithms parallelization I say John's done very cool work on the FPGA and a second fermentation oh I think I bought I'm going to take credit for bringing GPUs to the deep learning world and and and so on a solid work like this um and in fact looking at this chart my personal interpretation which others will disagree with is that those results were achieved to a very large part because of scalability issues right but this is my personal interpretation which others may disagree with could you through the questions 7 & 8 on the handout so question 7 what do you have which which ones did you check off 2 & 3 cool I'll take your word for it and for question 8 and I just do actually questioning I can't talk everything except DNA computing them that's my answer anyway yeah I thought of putting quantum in there too but I think someone actually is working on quantum computing yes yeah all right cool so um let's see you know what there's something else I could talk about I think I'll do that towards the end um so you know just to wrap up this piece I think talk about the high level vision of less learning rather than manually designing our futures but again kind of for me you know this isn't just about machine learning anymore this is I feel like um can we really learn something about a I especially perceptual ai ai ai and human intelligence is very broad I think you know we're we're starting to get a handle maybe the perceptual part of AI which is maybe like you know 40 to 60 percent of many animal brains right so this big part of the brain and so what I'd like to do is as I say you know thank you for your attention and for your patience with me to do these things I hope that was somewhat fun what I like to do is let's break and later on in the in the next couple sessions where you know dive slightly deeper into technical details go from the basics talk about neural networks and builders of the algorithms also to point out that you know after you later today or whatever if you want to go deeper highly technical tutorial with exercises in everything there's one up there this is the URL is also given at the bottom left of the handout but you can you can check that out later after today you
Info
Channel: Xin Huang
Views: 437,902
Rating: 4.9447384 out of 5
Keywords: MachineLearning, AndrewNg
Id: n1ViNeWhC24
Channel Id: undefined
Length: 45min 47sec (2747 seconds)
Published: Mon May 13 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.