Ali Ghodsi, Lec [1,1]: Deep Learning, Introduction

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay welcome to the planning course I actually try to give you quite big picture of these concept big learning and the materials that we are going to cover through this semester so the first half of the class today is going to be mainly you know motivation in history of deep learning and I'm going also to talk a little bit about course administration marking and so on a part of course administration really depends on how many students are going to register in this course clearly I didn't expect that many people to come to the first lecture and I didn't know how many of you are going to register for the course how many are you are going to audit or just sit in the course so I need to have understanding but I'm going to ask you toward the end of the lecture that how many are you are going to register in this course officially and then based on that you may want to change a little bit you know our course administration and marking this King and so on okay I mean because it's a grad course you know I don't have any TA so I have to think there are many people I can't make any assignment that can as like mark so many assignments because usually grad courses are much smaller than this so we're going to decide ok so deep learning you can think of deep learning as different layers of computational units so usually algorithms that are called deep learning or have different layers of computational units and the goal is that each layer extract or learn one level of abstraction so if you want to you know I'm sitting I'm standing here and if you want to talk about me you can explain to someone that there is a person here this I mean person is one level of abstraction standing you know another level of abstraction in front of this board you know all of them if you have a picture of me standing in front of the screen then you're expressing you're explaining this picture to someone you are using different abstraction you know a person standing screen you know these are all abstraction because in in real world you don't have a screen we have this one and we have another screen and other skin and other screen but V as intelligent agent or human being you know we make an abstraction of all of these objects which is called a screen or we make an abstraction of book of millions objects that exist in the world book itself the abstraction cell doesn't exist in the world that's something that we created so the claim is that deep network is a way to abstract to create or to learn this abstraction in different levels it's quite controversial many people do not believe that that's really what deep network does many people believe that deep network is just a very flexible function that we recently learn how to fit it to data without overfitting some people believe that deep network is really a huge memory but it's a smooth memory so it's a smooth lookup table that we mimic where we memorize the whole internet and then we claim that we have learned everything so there it's pretty controversial or people for it in against it and at this moment we don't want to get to this controversy but later on when we want to see why big models work you may want to know a little bit about this but what I'm going to do first is to as a motivation show you some success stories you know one of the reason that many of you are here is that you have heard a lot about deep network and you know success stories of deep network everywhere so I'm going to show you a little bit of this success stories and demos it was just in the news a couple of days ago that some researchers made a new chess I mean game chess that for the first time this model earned itself how to play chess you know we had just learnt from long time ago it's a very place very good using reinforcement learning but you know it's usually true brute force you know we have to have a tree at each stage you have to know whether the other actions are the possible actions and we have to program this but this one doesn't need any of these steps according to the claim it's going to learn chess by itself in 72 hours so it plays chess with someone else or by the the model plays chess with itself and learn the rules and can play you know at a very professional level it was also in the news about a couple of months ago that some researchers with marriage of reinforcement learning and deep learning was able to you know create models that can learn so far twenty-something Ottery games by itself you know you don't tell them what the rules of this Atari game is just you know please Atari in the layers so I don't know have you heard of word Tuvok or not but it was interesting paper in 2013 and got lots of attention so you can represent words using vectors so basically when you're when you do machine learning you need to do so even before work to work from the beginning of natural language processing we used to do so you know vector representations for any object including text you know we have to the simplest thing that you can do back off the word you know we can assume that in you know say for example you have ten different documents you just continum the frequency of word in each document and then you have a vector okay but word to back use neural network this neural network actually is not that deep but use neural network and find a representation of words so you basically project words in a new space some vector space but these vectors have pretty interesting properties this is one example if you take the vector which represent the word King subtract the vector which represent man and plus the vector which represent woman it's equal to the vector which represent Queen so a king who is not a man but is a woman is queen so very interesting relationship can be learned through these vectors you know the distance between Paris and France is the same as the distance between Ottawa and Canada the same I mean the distance between CEO of a company and the name of the company almost for all important companies are almost the same so we can in fear even new knowledge and information from this vectors it turned out that it works not only for words but you can have a joint representation of words and pictures and images so images can be represented as victors and vorse can be represented as vectors in the same space such that for example if you have this image the vector of this image minus the vector which represent de plus the vector which represent the night you're gonna get one of these images so if it's not if this picture hasn't been taken in day but has been taken in night it looked like something like this or if you have this object - flying so an airplane which doesn't fly but sailing just look like one of those ships more interestingly you know these examples you know these cats and say for example this cat in inbox but cat - box + bowl is one of those images okay there is a paper which shows how to generate captions for images these captions have been generated automatically you know this caption says a woman is throwing a frisbee in a park the input of the model is just this picture and capture has been generated by the model so model basically explain what this image is okay we have seen in the past that you can represent words in lower dimensional space such that close points neighboring points semantically are related like this one on the left on the right you can see that some vector representation of sentences have the same sort of properties you know close by sentences semantically are related you know so we have a vector which represent a few days there is another vector which represent in a few months and you have vector which represent a few amounts ago they are pretty close you know salmon quickly close so you can imagine all sorts of applications in natural language processing when you have this type of representations and they're coming many companies now use I mean neural network is not just research you know they they using neural network in all their applications you know PayPal that you use every day they use neural network to prevent fraud for example there are startup companies that use neural network I mean deep neural network for x-ray and MRI pictures to you know the AG know is different type of disease that was not people are not able to do that this is you know an API that everyone can use and you can see that you know very look-alike images has been detected correctly by big network here so the precision of object recognition has been increased significantly recently using dip that's what the press the the performance of speech recognition techniques has been increased by more than 20 percent recently using dip network I mean most likely you have realized the difference between speech recognition in Google now or Siri for example if you come period with a speech recognition couple of years ago we used to call Rogers and Rogers has you know automated a speech recognition which tries to address different calls and every time that I call Rogers I get crazy you know and I have to say agent agent agent on until they really contact with you because really you say something and it you know understands something completely different I still you know our voice recognition University of you know you called and you say I want to talk to Ali Godsey and you know connect you to James time you know it's still you know these are not I mean these are old variations of speech recognition using hidden Markov models only new generation of speech recognition using the network that you can see it in Google now or in Siri using deep network the the performance is more than twenty percent better than before and really works okay this is quite interesting paper that's published 2014 in cvpr you can recover voice from basically video of vibration okay let me show you this video you stop explaining in our next experiment we have recovered live human speech from high-speed video of a bag of chips lying on the ground but to make things a little more challenging this time we put the camera outside behind a soundproof window this is what a cell phone was able to record from inside next to the bag of chips that later with doors to go and this is what we were able to recover from high-speed video filmed from outside behind some proof glass in this next experiment we will cover music from high-speed video of some earbuds plugged into a laptop computer then we take our recovered sound and use audio recognition software to automatically identify the song that was being played most frequencies of audible sound are much higher than the freight rates of standard video to hold the results we've looked at so far were recovered from video captured by a high-speed camera which can record thousands of frames per second but in this next experiment we show that by taking advantage of artifacts caused by the rolling shutter in most consumer cameras we can sometimes actually recover sound at frequencies several times higher to the framerate of our video letting us recover audio who the new captured on retina consumer cameras there we see is 60 frames per second video of a bag of candy capture of a regular consumer DSLR while our mary had a little lamb' music played through in nearby loudspeaker by using a variation of our technique on the rows and recorded video we were able to recover this audio which includes frequencies more than five times higher than the framerate of our camera okay so there are I mean after the course you can look at this page you know there are quite interesting demos of deep learning let me show you just one or two of them too you know here it's just right this is a course in deep learning and say you know it can basically generate the text in different handwritten style that you can train it you can train it for yours handwritten start you know my son was trying to make a model for a long time to type his homework you know 13 years old and you know make it like handwritten - and it's tough writing so it can be helped in this it can be used in this case okay this is one of the first papers in deep learning where Geoffrey Hinton it's a generative model interestingly a concern 'pl yeah you know it models a very complicated distribution that you can sample from that distribution and you can generate new handwritten digits or new faces when you train it for face for example you know I would want to sample from four and you know it's the model you know can sample can you see there that you know these fours are generated by the model it's quite impressive that you're sampling from distribution and basically you know this matrix you know you're sampling for each pixel but you're sampling in a way that it looks like you know a real for that has been written by a person you can sample from face you know you have a collection of face of a person and you sample from that distribution as if it's a picture of that person but this picture hasn't been taken ever you know it's just one sample of that distribution so it's quite impressive that you can model this okay so that was some success stories and you know demos of deep learning what we are going to do in this course there are these topics that we are going to cover we are going to talk about feed-forward neural network and deep network actually feed forward neural network was the first maybe type of neural network 1980s quite popular and now we are using the deep version of those neural networks now it's quite tricky to optimize and to train these deep models we'll talk about optimization for these models a little bit convolutional networks many success stories that you hear these days are about convolutional networks especially in object recognition convolutional network also are old models by themself but you know we have deep models of convolutional net work or we have right computational power for convolutional networks that vision have in the past so we'll talk about convolutional Network there are cases that your data is not iid data is temporal order of the data is important you have talking about dynamical models you're talking about modeling language when the sequence of word is important so when we want to model data that sequence is important in in those models you know we have your current neural network we have recursive neural network and we'll see those in this course we're going to see autoencoders as well which is a way for dimensionality reduction or for representing for finding your representation of the data and we'll see restricted Boltzmann machine and deep Boltzmann machine that is basically the first deep model that was proposed in 2006 so I'm going to give you a detailed history of neural network after in a few minutes but short the story is that you know at some point perceptron was invented and perceptron leads to neural network in neural network was popular for a while 1980s and then people lost their interest in neural network and for for 20 years and I remember the time that you know in nips conference is a pretty I mean prominent conference and machine learning conference and machine learning they used to unofficially announce the most frequent words in rejected papers and for two years in a row the most frequent words and reject the title of rejected paper was no role and network so it was not popular for a buffer I mean after support vector machine it was not popular until 2006 2006 Geoffrey Hinton published a paper in science that was the first model of dick networked neural network that you're going to see in a few minutes was pretty difficult to train first and then they realized that they can train it using back propagation with back propagation it was it was generally believed that back propagation can be used only for shallow networks if you have one or two layers not more than that if it's deep you can't use back propagation to train those models in 2006 using restricted Boltzmann machine Hin to ensure that how to train deep networks later on it became clear that you can use back propagation for deep network I mean many of deep network models now don't use restricted Boltzmann machine they use just back propagation for training the model so it became again popular due to this success that they have ok we're going to put some emphasis on deep learning for natural language processing because that's my understanding of the area that what you can do in a speech recognition has been done already you know a speech recognition basically somehow reach to its cap using deep network also about object recognition an area that has I mean there's so much room and there are many open questions and open area of research is natural language processing so we're going to emphasize on natural language processing more so in terms of marking in this course we are going to have a group project group of up to three that is 50% of the course mark we are going to have paper presentation and paper critics for paper presentation that my plan is to teach like half of the course about like six weeks of the course I'm going to teach the basics of all of these you know techniques and the rest of the course the second part of the course would be paper presentations by those who have registered in the course so that will come to any percent of your mark for paper critics you know I don't know how many of you have seen this wiki course note that I created a couple of years ago so it's like base it's based on Mickey wiki media it's like like Wikipedia that collectively you can you know create a text or change it so we're going to use that for paper critique in different ways that I'm going to give you the details later on so one way is that when there is a presentation in class say for example person a supposed to present a paper in class other people needs to write a summary of that paper and make contribution on that summary before the presentation and also there is another type of contribution that you are supposed to give your character or summary of a paper but you put it in wiki course and there is one contributor main contributor and there is always other people as secondary contributor to this critics or these papers and through the history of wiki course note I can see who has done what I mean which contribution comes from which is student and 30% of your mark would be based on that again depends on the number of a student this project could be like we may have some mini project like calculus style projects that depends on as I said I'm going to ask soon that how many of you are going really to take this course just technical you know detail about administrative in this course you know all communication in this course would be through Piazza and those who have been registered for the course already received an email to complete their registration on Piazza so instead of personal email we're gonna use Piazza unless it's necessary in final project it's common in graduate courses I have to mention that you cannot use part of your thesis as your final project you can't use a you know a project that you are doing for another course as a project here okay any question about what I've said so far yes the project is sort of open-ended there are different type of projects that people can do you know one type of project is that you have a completely new idea and you want to come up with a new algorithm you think that you can come up with in your algorithm so that could be a very good type of projects in this case a negative result will not penalize you you know I'm going to look at your approach and your you know steps that you are taking makes sense or doesn't make sense so if you turn out that your ultimate model that you're making doesn't work at all or is divorce model ever so we are not going to lose mark as long as you have done something sensible during the term to make this model that's one type of projects another type of project is that you are using not you're not creating in your model what you're using is existing models but you're applying this model to your research area or an application this could be your research area could be a general application it could be you know a result that has been reported in another paper and it's important interesting and you want to regenerate that that could be also a project in this course when it's application it's pretty important not to choose something which is trivial you know you can choose for example digit recognition is application of deep network for example you know because it's you know in one afternoon you can implement it and get 98% accuracy so it should be something challenging that let you to learn something also could be a kegel style project you know there are challenges posted on Cagle that you can choose one of those challenges in fact there is a project now on K gold which in terms of the time feets actually our course it has been posted to any 8th of August and the deadline is 6th of January is it's a classification problem there are images and you basically need to detect right whales in those images it's quite challenging problem you know even as a person it's pretty hard to recognize which one is which you know I was not able to recognize them correctly to label them correctly so it's a very challenging problem there's another problem on Cagle about rainfall there's another one about crime prediction recently all emails of Hillary Clinton was a release on Cal gold just two days ago I mean her personal email there was controversial so there are challenges and datasets there so you can choose a like a girl style project that would be also pretty good project for this course yes no it not big I mean I I don't mean that you cannot include this in your thesis you can include annuities and what I meant was that you can't report something that you have already done in your tesis as your project but it doesn't matter that you do something which is related to your research area and then you included in your teasers that's fine yes it would be mainly papers there is there are there is a mmm you know the draft of one book is online but yes you are benzio the draft of the book but it's not complete I mean if you look at the book in many many cases you know in many pages there's a still you know the sign that needs to be done you know it's very I mean early drafts of a book I don't know any good book that we can use as a reference so it mainly would be based on papers you know this area actually moving quite fast this book and there is a very good book in machine learning by morphe was published setting in 2012 or 13 it's quite a new book there's one chapter about deep network and if you look at that chapter it's quite old now after two years you know because the area are moving so fast so we better use papers but I'm going to post a long list of papers that you can choose from for presentation okay any other question okay maybe I just asked how many of you plan to take the course as credits just I want to have some sort of a status okay okay can I borrow one a few papers so now I have a rough idea but that would help if you just using this paper I'm going to collect it at the end of the class just write your name and in front of that just tell me that you're gonna register in this course I did the course just sitting in the cursor I'm not going to show up next lecture that case you don't need to write in it but otherwise you know that would help yes presenting papers not project just presenting papers but you know part of presentation I have to give you the detailed part of presentation depends on Vicky's course not so you have to write something really there yes you know you are all grad students in related areas so I expect that you have some knowledge of machine learning and you have knowledge of calculus and linear algebra and statistics and probability so I can't point out a single course that I said that you have to pass you have to know this course before but definitely I'm not going to go through all concepts of machine learning in this course I assume that you know the general concept of machine learning it seem that you have knowledge of linear algebra and so ok any other question yes well actually I think you can get a sense of this in the second half of this lecture because I'm going to start you know teaching of like feed-forward neural networks and I'm going to talk about back propagation you're going to get a sense that I prefer usually I prefer to go through details of algorithm rather than just talking about Bigler I prefer to cover fewer number of algorithms in depth rather than many algorithms just okay so a little bit of history of deep network so the first attempt to make a new wrong you know this is still you know it's pretty controversial that neural network or deep network is a model of brain or is not a model of brain I don't know personally I don't believe that it's a model of brain I think it's quite different from us you know you can teach a kid difference between table and chair with one example so this is chair this table one example they learn you can teach them the difference between digit two and three with one example but when you want to teach neural network or deep network this difference you need 1 million examples so I really don't think don't theory it really a reason that we believe that it's a model of brain maybe very simple model I don't know I'm I don't much about biology to charge this but anyway it's quite controversial but it's I mean the the his story starts from attempt of some researchers to mimic brain and to mimic neuron so how we can make a Nora so the first attempt was in 1943 it's also interesting to know the area of people who have been contributing in this so the psychology in logic you know not just computer scientist and statistician so they basically made a new run at that time in 1943 what they made is what we call it today logical gates you know or let's call circa so basically what they made as neuron is and or not you know it's a electronic device which can you know simulate one of these logical operation that was 1943 so the next attempt to make neuron was in 1958 by Rosenblatt who invented perceptron perceptron is not law it's not logic I mean is not logical gate it basically is simply you know when you have a data point it gonna find a weighted sum of each day each feature or each variable of this data point so you're going to find a vated some deserve weights will be multiplied and then you are going to apply sine function to this way that song okay so basically if the result of this way that sum is negative after applying the sine function it's going to be negative one otherwise it's going to be plus one so they wanted to use this as a class we're assume that you have two classes and classes I've been labeled by negative and positive this perceptron was supposed to do this classification so map points either to disk alas or that class so when perceptron was invented it was not quite clear that what perceptron does you know the way that we understand it now was not clear at that time so they didn't know that it's a linear classifier it was a general belief that you know they invented I mean they solve artificial intelligence so at that time even we didn't know how to train perceptron you know the fair the many first attempts using perceptron was just a random choice for weights you know until it works it took a while until we understood how to how we can train them but this is you know from New York Times in 1958 based on an interview with Rosen law that they believed was that they created the embryo of electronic computer that expected will be able to walk talk see write and even reproduce itself and be conscious of its existence you know it was the belief that they solve artificial intelligence and this the perceptron is basically what's supposed to do so there was big dreams about perceptron in many news that is going to happen soon this type of robots this type of machines and so on and so forth so 1958 until nineteen okay that was in 1958 so if you compare these two type of attempts for making perceptrons the first attempts according to some biologists is closer to what's happening in real neurons in the brain but neural network is based the second attempt is based on perceptron is not based on logical gates it's based on perception so perceptron is building blocks of neural network also deep network so it was 1958 as I told you that it was invented but it was in 1960 two years later that some researchers showed how to Train perceptrons in a way which is more precise rather than setting weights randomly so this dream that perceptron solve artificial intelligence was a common belief from 1958 until 1969 1969 a book was published the name of the book is perceptron in this book shattered the whole dream basically they showed that perceptron is just a linear classifier it was quite surprising when they showed that perceptron cannot solve X or problem I mean if you have a problem such that you know there is one class like this and another class like this called X or probe so there was very surprised that perceptron cannot do this problem because they were they were not aware that it's just a linear classifier so yeah by one line you can't make separation here so so that was quite influential book but this book had some sort of I mean propagate some sort of misconceptions of the after book as well you know some people even you can see it in many papers now claim that this book in this richer the researchers showed that even different layers multi-layers of perceptron cannot solve this problem if it's not in that book you know this doesn't claim so but it became a common belief that even with multi layers of perceptron you can't do this okay so we changed the perception toward perceptron quite significantly this work still many researchers try to use multi layers of perceptron but it was not clear how to train multi layers so by that time we knew how to train perceptron but it was not clear how to train multi-layers of perceptron how we can adjust weights in an efficient and systematic way in starting 1969 many people invented and reinvented again back propagation algorithm independently so if you look at the literature there are many people who claim that they did back propagation and they are right but they did it like independent from each other in different years even I mean besides the claims that are quite popular there are claims that you know it was in the literature in Russia way before that there is a claim that it was in French literature right before this but anyway starting 1969 many people invented and reinvented back propagation and it started to apply back propagation to multi-layer perceptrons which is also called neural network and it was quite successful in many applications so neural network became quite important in 1980s up to the 1980s it was quite successful until you know that the time that I told you after invention of support vector machine and some other techniques neural network was not a favorite model anymore there there were many reasons for that you know neural network we're going to see soon one doesn't have a convex objective function and we use gradient descent to Train neural network so it's quite possible that you get a stuck in local mean when you are training neural network so that was you know an issue at that time the other the other problem was that we didn't know how to train networks with many layers and why it was problematic you know if you have multi-layer perceptron or if you have neural network you can change you can increase the flexibility of the model by increasing the number of nodes in one layer you know we can add many nodes in one layer and you can increase the you can increase the flexibility or capacity of your model but in the end there is a theorem those who are familiar with the history of neural network there is a theorem which proof neural network is the universal function means it can fit any model but if you look at the detail of looking at the detail of that proof it basically means that Noora you use a neural network an overfit anything you know it doesn't mean so in this sense lookup table is also a uniform Universal function you know you can learn your training data perfectly but it you can't generalize to generalize you need to have many layers not many nodes in one layer but it was not clear how to train networks with many layers there was one of the reason another reason was that at that time we didn't have enough data to avoid overfitting you know the amount of data is limited we didn't have the right computational power you know in one of the talks Geoffrey Hinton says that you know there was historical mistake that neural network was invented before support vector machine so it was invented at a time that we didn't have a right computational power and then have enough data for that it should have come later now we can avoid overfitting because you know we have models of millions of parameters but we have record of hundred of millions data and we have GPUs we have right computational power so it is starting 2006 it became popular again even if you look at the 2006 paper by Hinton you know Hinton in that paper try to Train the model in an unsupervised manner using restricted Boltzmann machine and mentions that there are three disadvantage or three limitations of backdrop that made him to invent you know a new techniques for training deep networks and according to him these are three main limitations that it requires labels training the a teller which doesn't exist so you have unlabeled data as much as you want but you don't have much labeled it go to Internet there are many images but there is no label further you want to make distinction between Apple and orange there only if few number of them that have been right labels so back propagation needs label training data so it says that it's very slow with multiple layers and it can converge to poor local minima it's three main disadvantage that he mentions in history 206 - 2006 work maybe only the first limitation is valid now that back propagation means label training data and it's true that most success stories that we have heard about deep network is about supervised learning in unsupervised cases we didn't have much success using deep network being a slow is not a big issue anymore because you know we can use GPUs now and it's 100 times faster than before for example and the fact that it can converge to poor local minima it became clear that in very high dimensional space it's not a big issue you can go to a local minima and all local minimize you can argue - even theoretically that many good local minimize exists and converging to one of these local minimize would be sufficient enough okay so so in the next half of the lecture I'm going to start really teaching the course you know I'm going to start by a perceptron and then back propagation and feed-forward neural network so before between these two we are going to take 10-15 minutes break and then roll back and continue lecture let me warn you that if you have seen perceptron it's going to be just perceptron and back for pegasion next lecture so it may board some of you that have seen the detail of this algorithm so I'm going to go through the detail of back propagation how it works and derive all of these formula so if you have seen this I'm a board you but I want to make sure in the first lecture that we are all in the same page so let's be here by 3:15 you
Info
Channel: Data Science Courses
Views: 33,931
Rating: 4.9583335 out of 5
Keywords: Machine Learning (Software Genre), Deep learning, Neural Network (Field Of Study)
Id: fyAZszlPphs
Channel Id: undefined
Length: 56min 48sec (3408 seconds)
Published: Thu Sep 24 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.