Ali Ghodsi, Deep Learning, Oct 12 28 2017, [Lect 6.1]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay in the last lecture we talked about RNN and if you recall you know we talked about gated RNN and this is what actually we use in practice and also if you remember I told you that it's not intuitive at all you know it's not intuitively it's not clear how and why people come up with this model with so many gates but this is based on many experiments with this type of model that and you know different type of scenario that you need the memory to be around or you need to forget forget about a part of the pass and so on so they came up with this type of gate head models and there are many of them the most common one is LST m and the oldest one and illicium is not one type there are many different variations of LST I told you at least 20 has been in the survey that the first author of lsdm are counted up to 2015 I believe and GRU is you know a more restricted type of gated model came after less TM still has many different variations so in in in practice actually when you are making a model you treat this LS TM or GRU as a block box that has two inputs and one input is your observation X the other input is the hidden unit of the previous one and then you have HT and if you need many layers of them just you put them together but inside each of them is what we explained and you don't need I mean in modern packages you don't need to be worried about the derivative and back propagation because they take care of them automatically okay we talked about vanishing gradient a lot but we didn't talk about exploding gradient this is another problem so solution to exploding gradient and and the problem that actually you know the problem of vanishing gradient is more severe so there are many more solutions to handle that the the problem of vanish I mean exploding gradient is that as I mentioned the other day you know if you're if you have a very nonlinear function and search a space it's quite possible that at some points you have a huge gradient and even if you're learning a step is very small you you are going to have a scenario similar to this you are going to have something like a cliff and you're learning a step is a small and you are pretty close to get to the resolve but all of the sudden you face this cliff and has a large gradient and it basically you know if you just look at the tangent of this it gonna shoot you to the other side of the search space so you're gonna be pretty far from the optimum when you was you were very close one step away you're gonna be pretty far and you have to search the space again and again and again so that's problem that's typical problem of exploding gradient solution to this problem the most common solution to this problem is quite happy in terms of mathematics you know you can't have a good justification for that the most common is clipping gradient and clipping gradient means that your gradient is 1000 I assume it's to so that's basically what it is you know no matter what your gradient is clipped and so basically your gradient from now on is going to be maximum of gradient and to maximum of gradient and 6 so you're gonna put an upper bound of your gradient if its larger than something you see it's not that's basically what it is and not only are in R and N in many other cases actually people practice this trick including in drop out this tack actually has many different variations also that you know clip if normalize and then before iteration after iteration but anyway I can't justify this mathematically it's just you know hack that people use to get rid of this problem RN n is as we discussed before is useful when you want to model temporal data your your data has a sequence for example it's quite common in natural language processing and the many problems with natural language processing these days are tackled through what is called sequential sequence models in sequence the sequence models is basically in the most basic form sequence the sequence model in the most basic forms is to LST m1 lsdm is your encoder and the other one is your decode so imagine one natural language processing task say for example you want to do translation from English to French then you are going to have encoder and you're going to have a decoder this is your encoder insert the code in each of them each of this units or lsdm so as input of this lsdm I have an English sentence so this is RN n so can take care of this temporal sequence I'm in this sequence so I have this is a book so the first word comes solace TM second word and so on so lsdm I mean this is the goal lsdm eventually makes a fixed-length a fixed-length victor for any sentence no matter what the length of the sentence is so it's a sentence with three words or a sentence with 100 words you can map it to a fixed length vector at the end of this illicit ian as the output okay so make this input to the decoder but the decoder is supposed to make a sequence okay so again you have back-to-back model which is common in neural network you get input you take the output so your input is angular sentence your outfit is a French sentence what's happening at the middle is that the English sentence will be translated to a fixed length vector and this fixed length vector will be translated to a new set of characters which happen to be French because you train it on French okay many other models you know like correction you know you have grammatical error in the sentence and you want to make correction there are data sets that you have sentence with grammatical error and you have the correct sentence and in this correction actually there are a couple of operator deletion you know basically when they want to correct you know you have to delete this word and you have to insert this word so a couple of finite number of operators will be applied to this sentence to each word of this sentence and give you a new sentence you know delete this word at this word and so on and then again it's back to back by sequence the sequence model you know it and translation grammatical correction text generation you know you have a sequence of words as the input of your lsdm and you want to predict the next word and then you add this next word to the sequence now that's your sequence and then you're going to predict it sorry the word after so by nature most of this natural language processing models have the same or similar structure and similar actually concept are behind yes no it's not compressing that's the way that you train it you know you know this sentence basically you know it's it's first of all it's the the most simplest model you know introduced in this paper what I explained you know there are things that has been added to this like attention model that if we have time we'll talk about it later on in the course but it's basically how you how you train your model you know you decide that each sentence is going to be mapped in in this vector space each sentence is going to be mapped to this D dimensional vector right what's important is that the concept should be the meaning basically should be carried over to the new sentence you know we can think of this as you think about a concept and then you express the concept right if you know two different languages you're taught is the same you know and then you decide to express it in English or it desert express it in French right but what you have in mind is hidden a space which can be mapped to both of these two different languages that's the same you know as if we are capturing or we are computing that he'd in his face which you have in your mind you know how can I turn it back to this hidden space that I had in mind and then now turn it I mean this anything that you have in mind will be mapped to French through this decoder anything that is in English will be mapped to what you have in mind through this encoder it doesn't matter no I mean in different languages you may be able to express a ten word sentence in three words you know it's not a restriction that I have to have the same similar but as usual there are many many tricks involved here to get them to work you know what I'm explaining to you is the base concept you know there are attention models before attention models people used to train a sentence in encoder and make it backward in decoder you know the order of the words is backward in decoder because you know when you get to the last word you're gonna forget what was the first word you know if you change the order it's going to help and then there's a tension model that if you have time we're going to explain it would fix this there are many sort of heuristic and hacks and tricks to make them work but this is the base the base is that map everything to the space of concepts and from concepts get back materialize them method as disconcert in a different language Lent I mean in practice there will be enough for bone because you're training with us training set that maximum length of a sentence so and if you train it with you know very very long sentence I'm sure without the tension model or without this type of tricks is none not going to work because when you get to the end you forgot what would always that first part of the sentence okay so this is basically sequence to sequence model that you're going to use in natural language processing in chat pods for example no chapel is almost the same you know you have the input and you have the answer of them question answering a question you have to answer the base of all of them are pretty similar to each other yes you know no no think of this as conditional probability say your English sequence is X and your French sequence is y and think of this as like y1 y2 y3 and x1 x2 x3 and now think that you want to maximize the probability of Y given X so at each step you know you shouldn't in practice shouldn't wait until the sentence is complete and then I start to translate it suppose that I have one additional word and what should I add to Y to make the probability of Y the I mean the posterior probability of condition world Y given X maximum among all possible word in France with Vorkuta and it's not just word it could be you know a period or it could be question mark you know these are also one of in the set of your characters possibilities so naturally a sentence will end somewhere and and another word comes after you know but don't think of this in practice that you know it's a complete sentence should be fit and then accomplice and I mean in practice you can look at as a sequence
Info
Channel: Data Science Courses
Views: 3,176
Rating: 4.7777777 out of 5
Keywords:
Id: KCm9OdxqGk4
Channel Id: undefined
Length: 15min 56sec (956 seconds)
Published: Sat Oct 14 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.