DeepMind x UCL | Deep Learning Lectures | 6/12 | Sequences and Recurrent Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today I'm going to be talking to you about sequential data and how much deals with this type of data structure before I begin I'm someone who likes people to understand what I'm talking about so if at some point during the lecture something's not clear or you have a question please raise your hand this can be also a conversation so I should be getting the message across to you if this is not happening then that's my mistake and I need to know about it okay so there's no stupid questions literally just raise your hand and ask so with that out of the way to give you a little brief overview over what I'm going to be covering today I'm gonna cover I'm gonna divide the presentation and through three different broad sections we're gonna start by motivating why we care about sequential data in the first place why is it an important thing we should consider in machine learning then from there once we are convinced that this is something we're interested in we'll think about how do we train models that deal with the sequential data what are the challenges what are losses what are we optimizing and so on and hopefully give you the fundamentals on this once we have our train modeled in section two we'll move over to section three where we take a train model and actually use it to do cool stuff like generating sequences and I hope to cover not only different applications throughout this section but also maybe introduce some of the more recent concepts and methods that have been applied to sequence sequential processing and machine learning so we'll start straightaway by motivating why sequential data just a quick recap what do you have guys have done looked at so far in this course is first of all feed-forward neural networks which are networks that take in vectors of a fixed size that should have been covered and you've also talked about convolution convolutional neural networks which are networks that usually deal with images and have these things called convolutions that they just share across an image for example and apply we in this lecture we want to move away from fixed size vectors and images as I mentioned to sequences so what even as a sequence to give you a bit of a formal definition a sequence is a collection of elements that has certain properties let me give you the example of a sentence you can think of an English sentence as a sequence of words so a sequence first of all has elements that can repeat so if you think about a sentence you can have the same appearing several times in a sentence assent and a sequence in a sequence the order of the element matters in a sentence this is the same this is also the case so in a sentence it's very important in what order the words come and it can mean something completely different if the words are in a different order and finally sequences can be of variable length so they can be any number of elements to them and this also applies to our example of a sentence so English sentences can almost have an arbitrary amount of words in them so with these properties if we look at the methods that you've looked so far we can see that they're not very good at dealing with data that have these kind of properties so this one here they're the feed-forward neural networks they usually deal with only fixed size inputs which clashes with our need for variable length and also they don't really take structure that much into account convolutional neural networks again really can deal with non fixed size inputs but this structure is not the right one so how do we deal with this and the question is how can we develop models that actually deal with sequential data because this is a problem and given that this is so hard you might wonder why do we even care about sequences right why do we care about sequences and my answer to that to you is well the sentence why do we care about sequence itself is a sequence and as a matter of fact each of those word is a sequence of letters and going further this whole presentation is the sequence of life so I hope that with this I convinced you that sequences are literally everywhere I set the example earlier of sentences and the English language as sequence of words but you can also think of other things as sequences so speech waveforms is a sequence you can think of videos of sequences of images images themselves can be sequences of pixels we can have more complex sequences so if you think about programs you have different executions that you carry out sequentially and finally for those of you interested in RL and decision making that as well is a sequence of decisions that your agent has to learn so it's fair to say that sequences are very universal and they span a very big variety of tasks and this is why in machine learning we're interested in being able to deal with sequences well okay so some of the motivation I hope they take from this that seek what a sequence is so it's this collection of items where order matters and that is a variable length we know that sequences are widespread not only across machine learning but across applications and everyday life really and finally the methods that we have been discussing in this lecture series so far are not enough to deal with sequences so we need to come up with different methods that can take all of this into account are we clear up until here is the motivation clear we know what we're doing cool okay so now that everyone's convinced that sequences we want to focus on how do we do it in order to think about a machine learning model I'm sure you've had this in the previous courses one of the things you have to develop is what is my data what is my loss what do i optimize and so on so let's go through it a bit one at a time I'm gonna first revise what you guys should have already learned this is supervised learning this is how I would train a normal feed-forward network so if we look at this the data that we normally have in supervised learning are pairs of some inputs and some outputs for example you could have images and labels for the classic classification tasks what you're trying to do then is you are gonna assume there is a mapping from your inputs to your outputs this means there's some mapping between images and labels that you want to learn and we're gonna in this particular course because we care about deep learning we're gonna try to approximate this mapping with some nonlinear function that is parametrized by theta so in this case this means we're gonna have a neural network that learns to approximate this mapping which with some parameters that we can then tune how do we tune these parameters so we're gonna define some loss function whereby we say okay the output of our neural network is going to be compared to the ground truth label for example of an image and then we're gonna take whatever our distance measure is and that's going to be our loss this is how much we penalize our prediction with that Network so once we have this loss what we're going to do then is just minimize it using standard back propagation and update the weights of the network as a function of whatever loss we have is this clear this should have been good when we think about sequences this is a little bit different so in a sequence let me take the example of an English sentence this applies to any type of sequence but I find thinking in terms of sentences it's quite intuitive because we deal with sentences all day long so let's take that example we're not gonna have pairs of necessarily of targets of inputs and outputs instead we're just gonna have a sequence so X is just gonna be for example our sentence and what we're trying to model it's not necessarily a mapping between inputs and outputs we're trying to model how probable is a sequence because we want a machine learning model that generates likely sequences if we are interested in generating or that can estimate how likely an English sentences we don't want it to just be generating garbage sentences so what we're actually using our neural network for is to estimate the probability of that particular sequence the rest looks fairly similar so you can just optimize the lock probability so the probability of that that you off the sentence that your model generates and you can just optimize in this case we're doing the max because we're dealing with probabilities but you can also take the negative and minimize so dad would be the difference is this clear cool yeah do you what do you mean tag yeah and then this would be more of a supervised learning problem yeah absolutely and you can always add information and you can change your loss yes exactly yeah so in a sequence and we'll talk about it in the in a minute about how we actually calculate this probability but you can always add additional losses so for sure yeah I think if you wanna that would be almost like a separate task that wouldn't be as much as like estimating the probability of a sentence more like like you say tagging a particular word so in that case yes I would go to the former loss the moment you have inputs and outputs you're going to be doing something like this sweet cool so let's think now okay what we're trying to do here if we're trying to learn this function over here as I mentioned that measures the probability of our sequence okay so how do we go about that what is a good model that we'll learn the probability of a sequence so let's take the sentences modeling word probabilities it's really difficult because if there's one thing I want you to take away from the lecture is this sentence so I'll try to subconsciously fit it in and modeling word probabilities is really difficult so one way you could think one very naive approach that you can take to this is say okay I'm just going to apply the simplest model assume that the words are all independent and what I'm gonna do is I'm gonna look at some corpus that tells me how probable is each of the single words in the English language right and just say okay the probability of the entire sentence it's just the probability of the individual words multiplied with each other and that's a perfectly valid model if you want so you would take the probability of modeling time step over your word probabilities and so on and it'll give you some probability of your sentence now this model while it's fairly simple and easy is not a very good model of language and the reason for this is that language has structure so one example of why it shows how this is a really bad model if I ask my model this model what the most likely sentence is it's gonna tell me it's the the the the the because this is the most likely word and it's just trying to optimize for that so this shows us that the English language is just not independent words that are thrown around right there is clearly structure in the language and this model is not capturing it so the question is how can we incorporate this structure into our model okay let's go to a bit of a more complicated model and marginally but it's fair enough we're gonna take all of the previous words in our sentence we're gonna call that our context and then we're going to condition our probability on everything we've seen before and this kind of makes sense if you're thinking about a sentence the next word kind of depends of what you said so far so it's a fair enough assumption it's definitely much richer than just assuming independence so what does this mean in this particular example I can give you this part of the sentence modeling more probabilities is really as the context and then I can see condition on this which are X's 1 2 t minus 1 we are going to calculate what is the probability of the X which is the word at time fib the time step T which is the current word that we care about so that would be our target and then there will be different probabilities for different words so difficult we would be really likely because as you know modeling probabilities is really difficult hard is also a good candidate may be fun and definitely not easy so you would have this conditional probability that describes what the next word is this of course would only model a single word but if you wanted to get the probability of the whole sentence the whole the way you would go about it is you would first calculate the probability of the first word and then you can multiply that with the probability of the first word is the second word given the first word which is everything that's seen so far and then the third word is the probability of the third given first and second and so on so you keep doing this calculating this joint probability from all of the conditionals that you've seen so far and this is a method that works very well enough if you use this it does give you a good overview of the structure but there's a pretty big problem with it let me show you an example to give you a bit of an intuition imagine we only care about modeling Peale second word given first word that so that is the simplest and shortest we can do in a conditional that are ready for four words which is not much gives us a four by four table with all the probabilities of the possible combinations of first and second word four words the English language has a few more than four so if we look at this and increasing maybe to 120 this is starting to look pretty big and maybe an English language has more like 10,000 words and that's very limited as well so this table gets huge if we're comparing the probability of all words given all other words and keep in mind we're only we're only really looking at the probability of one word given another word if we're looking at context of like English sentences this is likely to be much much longer and if you think about it this is gonna scale really really badly with the size of vocabulary so however many words your language has which in this case here I'm sure I'm just tense ^ however long your context is this is more than atoms in the universe for a reasonably long sentences so there's a it's fair to say that this is not a table that we can store work with or even really approximate with the data that we have of English language out there so while this type of model where you are conditioning on the context is useful and captures some of the structure it's definitely not scalable and not beyond very short time horizons one way that people went around this especially in the early and off here research is that they said okay rather than taking into account very long context because we know it scales with the size of the context the power of the size of the context why don't we fix the window size of that context so these are called engrams and standing for however long your horizon is so imagine you only care about the two previous words those would be two grams and what this essentially means is that once you have more than two words in the past you only really care about the last two so you only care about the probability of whatever word you're at given the previous two time steps the benefit of this type of model is that again they do as we said this scales with the ^ the number of context points so this kind of reduces it and you have a fixed size but as you can also imagine you're losing a lot of information like if I just tell you is really difficult is really that is not telling you anything about what I really am asking like maybe difficult would not be your first choice of word after just the words Israeli so there are yeah so having said this and Graham's alleviate the problem of scalability a little bit but there are some downsides to it as well the two big ways one is the one I just mentioned it doesn't really take into account words that are more than just n words away so you very quickly lose the context or if you take a large context you again have the scalability problem and the data table even if you reduce the context size is huge and just to give you an impression of how big this is I'm showing you this blog post from Google from a couple of years ago where they essentially release a data set of n grams of size 5 so this is not very large keep in mind five words of context in a normal sentence doesn't give you a lot of an idea of what's going on and what they did as well is they only took engrams of size five that appeared at least 40 times on the internet 40 yeah so 40 like that takes away a lot of engrams and despite taking away all of these they still ended up with one trillion five word sequences that they got off the internet so this is how many we're dealing with when actually thinking about Engram and again this is only five so this does not scale really well so in summary to summarize a bit what we were thinking about modeling these work probabilities is difficult and it scales really badly with with a number of elements that you're considering for your context so the question that we then ask is a machine learning researcher is can we learn this probability estimation rather than having to get it from some big big data matrix can we actually learn it before I move on are there any questions so far with engrams why data sequences are hard to model cool okay so I come back to this question can we learn how to learn can we learn how to estimate this probability and the question is of course yes otherwise the lecture would be over and we're gonna talk a little bit about how we're thinking about doing this if you tackle this problem one way of thinking about it is you're gonna need two elements you're gonna need the first element which is going to take in your context and somehow summarize it and then you need the second element that from this summary predicts you a probability distribution over what the next word should be okay so I'm gonna treat those two a bit separately the first component is this I'm gonna call it some function that is gonna we call it vectorize the context it essentially what we want it is to take in our words process them in some way or other and output this H which is going to be just a tensor or a vector anything and this H we wanted to capture whatever context information we're getting from these words that we've observed so basically we're trying to approximate the probability of all of the contexts with this one H should replace this over here does anyone you normally don't ask questions into the audience but let's try have any ideas of what would be good properties that we would want this F to have so we want to F to somehow summarize the context and I've told you what sequences are important for any suggestions whatsoever there's one very obvious one if you think good one very good we need to have variable input so it cannot be like a neural net we're going to expect a fixed size vector it's gonna we don't know how long our sentence is gonna be over context so yeah very good we need to somehow still keep the order and like this notion of order because if I gave it some other order of these words the context is very different so those two things are very very important I'm gonna give you the other ones because they're maybe less obvious so order matters very good very well then good learner but obviously we're in a deep learning course so we care about things being differentiable one thing that is maybe not as obvious but when you start working a bit with a sequence s it becomes more so you want individual changes to have very large effect so if I just change one word in a very long context it can actually mean the opposite in the English language right so you want a model that can actually capture this the deep learning way of thinking about that is to just have very large networks lots of nonlinearities that somehow capture these very bizarre boundaries and in high dimensional space and finally another thing that will actually touch on a bit more later we wanted to preserve long term dependencies because language depends not only on the previous three words not even on the previous sentence all have been saying so far has depended on almost the first sentence I said in the lecture right so you need to be able to remember that as well very good so this is the the first part looking back now I've mentioned one method which was the engrams how do they do on these kind of desert arada this is a just a little reminder of what an Engram was in case someone was asleep two minutes ago essentially the way we're thinking about engrams is that there is a function that just concatenates the N previous words that's all it's doing it's not doing super brain so the order matters it does in a way because you're only caring always about the last few ones it's not variable ranked by definition it is not differentiable you're not learning anything they pairwise I'm calling pairwise encoding when I say like single words have a big effect because that's how you work pairwise comparing single words it doesn't have that obviously and finally it most definitely does not preserve long term we by definition are cutting it to a very small amount of words so that's not a super great model one that people also often think about when aggregating sequences together is to just add everything together sorry we're basically you're just thinking of F as a big hump so they just take all the words lump them together and say this is your context this one is actually complementary properties to it to the Ngram order does not matter you lose at the moment you're just clumping and everything together however it deals with variable length I mean you can add on as many items as you want that nothing stopping him you can differentiate through it does not have the pairwise encoding and it can preserve long term because you're just literally preserving everything by adding it together but it you can arguably say that it's not a very smart model because it's essentially recreating the first example that I showed you where all the words were independent and there was no real structure so we said that's not a good one in in this Laurent ones or in general so this I think this goes back if I understand correctly more to the first part where we were saying we have some sort of corpus like a probability table that tells that not only how frequent are words but how frequently are word combinations so that was this P of x1 given P x2 so that would be the table and this is a very intuitive way to think about it but like combinatorially it scales quite badly that was that was what we were mentioning so what we're hoping is that these methods actually somehow learned that and like learn implicitly they should be learning how often do I see the word cat on the mat like how often do I see those together versus not right so that's but again it's the metric does that answer your question good cool so this was part one we wanted to take our context and just encode it in some representation age now we have our context somehow summarized vectorize what do we do with it so we now want a second function that takes this context and just produces a probability distribution our desiderata for this are much simpler I'm not gonna delve too much into them but basically again we want these the fact that individuals changes should have a big effect this translates from before and and also the fact that it returns a probability distribution that's really our only kind of concern and there's a fairly simple to do you can just throw sigmoid in there and it should do the job cool so in summary and grams and other simple methods that we've been talking about don't really need these requirements for modeling sequences as we saw in that table with all the crosses and the ticks so how can we build deep networks that actually meet our requirements is it so clear so far what our requirements are and why so far the methods we've seen are not super great at dealing with us cool so I'm gonna move on to probably one of the most important model it's in sequence modeling in which I'm sure many of you have heard of which are called recurrent neural networks recurrent neural networks are a type of neural network architecture that of a specific structure so they have this hidden state H which is stores information it's going to be a state that we're gonna keep modifying and stores information about what we've seen so far so the way this would work is we're gonna have initialized H so our state initialized with zeros whatever you want it to be and then you're gonna give it some input so a word of your sentence or whatever the first element of your sequences and we're gonna update H to be the next H to some particular function okay so far so good the way we're gonna update H is actually quite simple there's gonna be a weight matrix that are some of the parameters we're learning then we're gonna multiply the previous state with and there's gonna be a second wave matrix again learn parameters that we're gonna multiply with the input so we multiply one with the previous state one with the current input and then pass it to at an age so things don't go too crazy and then we just get our next state so so far all we've done is we started somewhere and given that we've seen a single word we updated our internal state okay why is this useful it's because this basically so far we've summarized our context like I said that is step one now we're doing step two we have a summary of the context which is this h1 at the moment and our we're gonna do is multiply it with another learned weight matrix and that's gonna give us our output probability distribution so this over here is going to be a vector that is going to be a probability distribution so all the values are gonna be between 0 and 1 and all of them are gonna add up to 1 and it's gonna overflow words that can be in the English language there's gonna tell us this one there's gonna be the most likely or whatever the probability distribution is we're obviously going to use the softmax here as I mentioned just because then that's going to ensure that we have a probability distribution if you're familiar with the softmax okay cool so once we have our prediction we're just gonna take the next word in the sentence so forget about that we're just gonna take the next word feed that in update our current state to get the next state there we go and feed the next probability distribution and so on and so on and the weight so we're essentially doing this we're one word at a time feeding it into our state variable H updating H and doing this for however long we want there is literally no restriction on the sequence length because we can feed in arbitrarily many words over time and there's also no there's also a sense of order because the network should learn to process them sequentially and therefore keep some notion of what the order of the words or does this make sense is everyone on board with our nuns cool the way they are often also shown are intense they can be summarized as such basically where you're again showing the input the output and this hidden state and you're looping over it with the different words this is just the normal diagram people also talk about unrolling RNs so whenever you're coding it and you want to back prep through it you're gonna have to unroll it and we'll get to that in a minute and it essentially just means you're taking it and spreading it over the difference time stuff cool so that's our model we're gonna discuss quickly the loss how what the losses and how we optimize I don't know how many here are actually interested in the math see details I know there's also people from the public so I'll try to point out what the high-level messages are but if you want to follow more closely the equations are on here but please don't be intimidated they just look sometimes a bit mmm scary whether or not so how do we train this we haven't really talked about the loss yet we've only talked about what we want from a model and how we implement the model so how do we actually update these weight matrices that I show you so what we're currently doing in this task if you think about it it's almost like a classification task where we say our input is going to be the context and our target is going to be whatever the next word is right until for classification as I'm sure you've also already learned one of the very the normal loss functions is to just use the cross-entropy so essentially just saying I'm gonna predict whatever the distribution is over I'm just gonna take the probability of whatever I predicted my model predicted as the next word and then multiply it with a real next word in that sentence and I'm gonna effectively add it up over all of the words in that sentence and that's gonna be my loss so the loss is actually fairly straightforward and the loss is gonna as I said depend on some parameter theorem which are going to be the three weight matrices that I mentioned earlier the one that creates the output the one that takes on the input in the one that just updates the hidden state now differentiating bear with me okay I'm gonna very quickly touched on why differentiating it's a bit different in Ireland's and it has been in the classes you took so far at a higher level it's just different because we have this recursive loop in the middle that's basically all you need to know if you don't care more than that if you've won a few more details let's just really quickly recap what the equations are that we're gonna be dealing with first of all we have just the state update so given some eight how do we get to the next one and as I mentioned we're gonna just multiply with some weight matrix the previous state and then the input with another weight matrix 10h easy how do we predict this y which is the same as probability of the next X we just take the softmax over some weight matrix with the current state also easy and that's the loss that as I mentioned the cross entropy that were just taking with our prediction and the real news were now we have three parameters that we need to optimize for this this and this I'm gonna start with the one that's easiest this one is very easy these two are a bit more complicated and I tell you in a minute why so let's focus on just W want WI is fairly easy to to differentiate because when we're actually just expanding and doing the chain rule we can see that we're just Duran XI ating Y which is this one here with respect to W Y and that's the end of it if you want to do the math this is what comes out of it fine you're just differentiating and using that to update it gets a bit more tricky when we actually look at these other two variables I'm gonna focus on WH for a second and what the problem is when here is once we start when we actually look at the final term of our chain rule over here so we're differentiating this with respect to this so far this is fine but then we also have this term here and this H has come from another equation that also has the same WH in it if this makes sense because you have the same and this is a bit clearer actually if you think about like this because your H depends on another on the same way matrix again and again and again and you have it over time you cannot just differentiate with respect to like that one variable and that's it it's a bit more complicated and this is called back propagation through time which sounds really fancy but all you're doing is you're essentially unpacking this recursive iteration that you've been doing the last intimidating equation essentially all I'm trying to convey with this and you can almost just visually see it is that you're gonna have this is gonna actually break it down in your explicit derivative and it should probably be your plus there and and something that depends on the previous weights and then this itself is then again gonna split into the explicit one and some weights and then this itself can then be broken up into two components and this and this and this and you go like this until the first step okay so all I want you to take away is that you need to unroll over all the time steps you cannot just differentiate once and this is a somewhat more this is a summarized version of that equation you can plug this summarized version in and this is actually going to be your objective so all of this to say is that this is a bit different and we're I'm new actually dealing with recurrent neural networks differentiating is not as simple obviously not is with tensor flow and other libraries this should be much more straightforward and they do it for you and they honorable in time but it's good to be aware that this is a bit different than the things you've been dealing with so far cool was that more or less clear intuition wise we're all on board good yeah okay one of the other issues beyond just having to like differentiate in a somewhat bizarre way is this thing called vanishing gradients which effect which are a big problem in recurrent neural networks and I'd like to give you an intuition of why there are a problem imagine in order to give you this intuition I'm gonna make a very very simple RN an so this is a very gross simplification instead of having an input so before we had the hidden states and we had some inputs and outputs we're not gonna care about those for a minute we only care about the hidden states and the weight matrix it's just gonna be a single scalar for now so we're just updating it with a single scalar and that's about it so we're just taking a state and updating it with a scalar many many times like so essentially if we look at HT so at some point HT is going to be this scalar times the previous time step which itself is gonna be the scalar times the previous time step and so on and so on so you can also rewrite this as HT is gonna be this initial state times this scalar to the power of however many times steps you took if that makes sense so you just multiplying that one scalar over over over over over again and then times the initial state it's clear from this it's because you are multiplying this with itself so many times that as as if WH some of this scalar is bigger than one your state will go to infinity because it will multiply itself over time and grow as the the more time steps you take the more it will grow and grow and grow so H will get to go to infinity if it's smaller than one it'll go to zero what does this mean if the weights of your neural network deviate too much from one thick radiance and the valleys are all gonna explode and go everywhere which you can then not learn from does this make sense intuitively okay good if I lose you honestly just raise your hand and I'm happy to go over this again I've actually made a small example mm-hmm to clarify this also a little bit because some of you might claim ok Marta that's fine but this guy here is normally bounded by a sigmoid right so it should be between 0 & 1 the H should not go that nuts and that's true we do do that like that usually there about it but what happens is that the gradients are still affected by this even if the values of the state themselves are gonna stay between 0 & 1 the gradients are going to go to 0 so what I've done here is I've written a little code where essentially I'm running exactly this I am updating the state as I mentioned to you H is going to be just the 10 H so it's bounded by some scalar in the previous state so very simple and then here I'm just calculating the gradient which I can calculate in closed form because it's such a simple thing just calculating the gradient and I'm plotting here if you can see the values this light purple line is the values of the state as I change the value of the weight so I can choose different weights for my weight values for my weight and whenever I'm around one the values are reasonable but as soon as I go away from 1 or minus 1 or 1 this is gonna go to either 0 or 1 which is fair enough at least it's not going to infinity like it was doing before but if I then look at the gradients actually that's a different story the gradients are actually zero everywhere except for quite near the region of minus 1 and 1 and this is very useless because of machine learning when deep learning we depend on the gradients in order to update our models right so if we have 0 gradients there is no signal coming through parameters are not going to learn anything so basically we're just saying if we take too many steps and our weights are not near one we're gonna not have gradients and this is a bit and this is one of the biggest problems that our Annan's are facing because what this essentially means is that you cannot take many steps you can only update it a couple of times before your gradient spanish to zero so if we look at now on our own our list of desiderata how well our names are doing order matters that's really good and it's very important and variable length so those two we achieved by design it is differentiable again by design it doesn't do this pairwise encoding yet and crucially it does not preserve long term dependencies because like I said we have these vanishing gradients that are going to make that I'm gonna force us to only look at a certain amount of steps before our gradients go to zero does this make sense so far yrnn start getting this type of score nice good so basically just to summarize recurrent neural networks can like I just said model is very Bolling's and also trained by propagation but they do have this vanishing gradient problems which is a problem for a modeling long term and maybe I haven't convinced you guys just yet why long term is so important so I'd like to give you a little example imagine I give you your your model a language model and I give you this sentence I said finally Tim was planning to visit France on the final week of his journey he was quite excited to try the local delicacies and had lots of recommendations for good restaurants and exhibitions his first stop was of course the capital where he would meet his longtime friend Jean Pia in order to arrive for breakfast he took the early 5 a.m. train from London to and then if you're a good language model you should look Oh friends and also capital and be able to tell me Paris it's probably the most likely answer to this and this is a huge context like if you look at this and it's not even that incredible you are this is several sentences ago here and this is a very very long sentence as well so it is important to keep even single words that are quite far away to keep these long term dependencies in order to be able to make accurate predictions and this is just a small example you can imagine if you're writing a book sometimes things depend on stuff that you've said in Chapter one so it's it's long-term dependencies are very crucial for language modeling and RNs are definitely not able to like capture this so how can we do that good are we so far so good our announce are good we like our Nance losses okay optimization we're not intimidated by me okay good we're gonna move to Alice TM you're like our Nan's plus plus and I'm sure many of you have heard also about long short-term memory networks short for ALS TM for short and in order to explain them I'm gonna just very quickly show you this model of an Ironman because your should be familiar with it now so we have the previous state H we passed some input combine those and passed through a sigmoid next date good because the next one is going to be more in this same structure so an LS TM is this this can look quite daunting the first time you look at it but we're gonna go through it step-by-step and it actually is not as bad as it looks the first thing to point out is that not only does it have the state H that we had earlier it has this long term state see over here that it has in parallel to this H state here so we're keeping those two internal states cell states they usually afraid so and in addition to this addition to this cell state it has a set of gates that can be learned that modulate how the input is or is not applied to those internal states let's go through them independently because they're less daunting this way the first gate we take into consideration is the forget gate they forget the gate the job of the forget gate is given your current input and your age what do you need to forget or erase from your long-term memory right and all you're gonna do is you're gonna combine those possums through some network and then pass them through a sigmoid which means the values are gonna be between 0 & 1 and if somewhere there's a 0 it's gonna be multiplied with this and just erase whatever was there because it's multiplying it with 0 if that makes sense so this is a gate that it's just gonna basically regulate how much of the previous information is allowed to pass through and how much we forget then we have the input gates now that we've erase stuff we should add some new information again based on our current info and some the previous state we are going to combine those almost in the same way that we combined this so basically we're just creating some pseudo cell state gating it with the same mechanism of 0 & 1 to see which information we actually want to pass and we're just adding that we're adding that to the cell state and that basically updates the long-term memory okay so in the first step we erased in the second we add new information and then finally we obviously need to update this state here because this one also needs to be updated and this is going to be updated again with our input but also with the long-term state so that's also gonna have a saying in what's get to state in our shorter term memory and this then looks a bit less intimidating if you think about it in the forget state the input gate and the output gate and by using these mechanism of gates that are all learned the model can learn how to store information that is relevant for later on without having the vanishing gradient problem I'm sorry and this is also why LS TMS are essentially at the core of sequence learning so up to this day whenever you are dealing with sequential data in any capacity usually LS TMS are there they pop up everywhere so I'm sure you guys have heard of them and it's because they're actually very reliable at doing exactly what we would just set ourselves out to do this is why LS TMS are good we more or less understand the gating mechanisms and so on yeah in this case it's because you have a lot of additions and you're not only just multiplying so you have that from as far as I understand it's because of these gates how the information flow you're not just multiplying the same value with it on and on and on I see that's from my intuition about it because essentially because you have this long term I'm struggling expressing what I think because you have this long term module over here when you even when you're differentiating you're before you were multiplying and essentially you're this was going to zero but because you're influencing it via this Plus these gates are basically stopping the vanishing gradients from going and destroying everything if that makes sense yes I'm sorry so GRU yes I think that so this is one way of showing it in the diagram I think essentially what this is trying to tell you is that this here is gonna have the same shape as this because we're just adding it on so you're essentially saying first the way this is normally done is first you calculate some sort of new cell state but instead of adding that in their youth you almost have to like your own little forget gate over here and you're multiplying these two so you're forgetting part of the order like you're regulating how much of this new cells that you actually want to pass and then I mean you could also you could put this two in one box but this is actually how it like the two intermediate steps there I carried out if that makes sense there's also dr use which I should mention which were actually developed much more recently which you can think of as like simplified versions of an LST em they have the megiddo mechanism is a bit simpler I'm not gonna go into details but if you guys are interested they're on these slides and you can just go back to them they're a bit simpler and I think it depends a bit on the particular applications they train faster but sometimes LST ms are a bit stronger so it really depends what you should be using but cool so if we consider LSD MS and I guess gr youth they have very similar behavior to our nm so of course order matters they're blanks differentiable great none of this pair was encoding but it does preserve a long term so this is something that we've now gained with Alice TM so we're slowly moving up the tick list which is really good so in summary for this little section L STM's and gr use overcome the gate vanishing gradient problem with this gating mechanisms that I mentioned and as a result of being able to overcome this they're actually everywhere in machine learning research and I'll come back to that in a minute cool are there any questions about any of this hi yes yeah both of them actually have kids it's just a different make AD mechanism which arguably is a bit simpler so they only have two gates one that they call reset gate and one that is updated and it's just the way they deal with the with the gating so I mean for example they also don't have this long stream state so it's just a single state but again they don't have the vanish ingredients because the way that the information interact is not just like multiply multiply multiply so you don't get that same problem but there's no need for this long-term C cell state basically they're probably half your parameters and they might be quicker to train and so on so there it's and they're simpler so there also are might be simpler to implement and so on so if you write down this is simpler update then the gating mechanism of lsdm which is a lot but LS CMS are more involved but tend to be maybe a bit more robust in experience yeah cool any questions other than that sweet cool we're doing good for time yeah sweet so we're gonna focus on generating sequences so so far what I've been talking about is okay we have a sequence model how do we train it and now that we have a train model let's assume that's all over what can we do well we have trained our model using this cross-entropy to be able to properly predict the probability of a sequence but that's not a super exciting application if I tell you I have a model can tell you how likely a sequence is in English great that's usually not what people train these models for a more exciting application is to use it to generate likely sequences right now that you know it knows how to see it probable as sequences or not it can also generate them and that's why these models actually are quite interesting so we're gonna focus on generation how would you go ahead and generate so just looking back at the RNN I'm doing this on an RNN but this cell could also be an LSTA more or any other recurrent mechanism if you take the RNN from before i could and i wanted to generate a sentence i could just feed in my first word say modeling this then output the probability distribution as I mentioned earlier and then instead of just picking the next word in the sentence because I have the ground truth sentence what I'm gonna do is just feed the most likely word or sample a word from the distribution and feed that as the next input so it's Auto aggressively creating a sentence so that's gonna give us the next probability and so on so we can just feed whatever out would we get into the next and so on and that's gonna generate our sentences or sequence so I'm gonna go through several applications and several examples and usually in each of those I'm trying to introduce something new in this particular case I've been talking a lot about language of sequences I'd like to talk about images the sequences of pixels and this is a model that appeared actually now a while ago 2016 I guess which was a generative model for images called pixel marinum and essentially it was treating images as just a sequence of pixels which is total a total valid way of describing an image right if you think about it so very much like before when we were modeling the probability of a word given all of the other words so far we're not pro modeling the probability of a pixel given all of the other pixels that you've seen so what would the color of this one being given that I've seen all of the other ones and this is more or less how it worked in action so we would start by just having some distribution over what the first pixel of an image is and we can sample that over there and we just get a dark pixel and then to sample the second one we're just gonna sample what's the probability of the second pixel give them that first and we do for the next and so on like the chain rule like we did in the first one and we'll just advance along the image slowly and sample always the next pixel given the previous one and what's interesting about it is that the model then learns from the context of all other previous pixels it learns what the probability distribution of colors is for the next one right so in this particular case it's very likely to be a green pixel and that's why it's mostly centered and quite spiky on that but if we look at this example over here maybe where it's not clear assuming this is a bird is this is this already where the bird starts is the Stila grass what is happening there so it's more uniformly distributed across there was the possible pixel and what's interesting is that depending on the context it's just changes this distribution quite different quite quite a lot between those different contexts and this generates images which like I think if you're far enough look very realistic but nowadays obviously a state of more state-of-the-art models produce quite good images this was at the beginning and even though they're not perfect and some of them are quite blurry and bizarre you can see that it's learned some sort of distribution over what natural images should look like especially considering that you've only been conditioning on the upper half of the images which is was very good at the time this was quite impressive and of course you don't need to always condition in the order that we just explained which is this pixel by pixel this is one of the nice things you could really condition on any order I this is not language where we need a specific order of like first word second one you could imagine also conditioning and defining your order to be by clusters or any other order up and down or whatever as long as you define an order after that it should learn to kind of use that which makes them quite flexible pixel areas do those make sense and how we can yeah yeah so I don't know that I'm necessarily about better because it could learn to really well predict the second picture from the first like imagine it learns the super good approximation for that so I don't know that it would be better or worse I think and if you were to sample different pixels probably the variability is much higher at the beginning because given like the first few you there's a lot of images that can't come out but the variability once you've committed to that bird image you're like well that's probably gonna be green at this point in the grass right so I think and that's what was reflected also in how like narrow or not those distributions were right so rather than saying more accurate I think I would say more certain of what the next pixel should be like if that makes sense cool sweet okay cool so that's that another well we've already talked a lot about natural language but it is then the the most obvious application that always comes to mind when you think about sequences and in particular obviously we've already talked about Ardennes and Alice TMS but I'd like to talk to you about sequence the sequence models which some of you might also have heard of because they have a lot of applications as well in industry so if you think about the RNN model that I've been telling you so far again we have these hidden states that get updated with our input and produce some output but initially I told you that this original state h0 can be initialized to be whatever like normally people just make a vector of zeros that's fine because it's your first date but this doesn't need to be the case you can really just pass any information over here that you want your model to already start with and half right so imagine I could give it some additional context that gets processed and passed through our states and only then once it's seen all of this context do I want it to start generating and you might ask when would I ever want to do this and like okay imagine you want to translate a sentence right so it can give you the sentence in the original language and that's gonna be my input to the context so all it's doing is its summarizing this into there or in initial state it's packing the whole sentence sequence the sequence into the initial state and then once it has that it actually you initialize with the random word and you tell it okay I want the target to be the Japanese translation of whatever the sentence is so it will produce this apparently min sequence and then it will produce two and in sequence based on this is doing Auto regressive Li the way you would generate a normal sentence but it's not just generating any random sentence its generating the sentence that you gave it as context if this makes sense and in sequence the sequence have been very powerful so this way of incorporating knowledge into a model has been very useful and what makes it also very powerful and this kind of opens up the possibilities with rnns is that you can really condition on whatever you want and add information however you want so you could think for example the most simple RNN would just be this network we're gonna have an input then then hidden state and output fair enough but you could also just imagine is passing a single input and asking it to produce loads of different outputs sequentially or you could do the other way around you could give loads of inputs and actually only want one output at the end of the whole thing this is probably this is like the normal iron and and this one is like the sequence to sequence where you have several outputs where it's not producing anything generates this rate and then actually starts producing outputs so this is a very flexible way of thinking of how can I use our enhanced and how can I give them extra information sequence the sequence of actually has actually been used in a ton of applications because of how flexible it is so empty stands for machine translation is at the top so it's very widely used in all of Google systems for machine translations image captions which I'll show some cool examples in a minute but also for a speed parsing generating dialogue generating videos in geometry so there's a couple of references if you guys are interesting but clearly this is very applicable to like a very wide range of tasks the way you can kind of think this is a yeah let me give you an example of like Google neural machine translation which is one of the I guess more relevant applications of sequence the sequence word would also a while ago now and essentially they would do exactly this you would have some context in sum or in some language some context language you would encode this to be your state that I just mentioned and then you give this input in a specific order to the Alice to the RNN down here which is just going to produce whatever the output is and this GN and Google neural machine translation model actually per improved performance on translation by a lot so if this is the old models that were just free space so frequency base that's the blue line and then how humans are yellow they're actually in pre this green improves the baseline by a lot and across languages as well which was very impressive so this was a very good way of just closing that gap between machine translation which was really cool also very cool and what you can do because I mentioned this was quite flexible so far we've actually been talking about okay as a context I can just give some language and then ask it to predict it in a different language but why do I have to give it some language in the first place I could give it an image right so what we can do is just take an image pass it through some neural network like a convolutional neural network and to create my context state and from there have it actually create a sentence that describes that image so this is for image captioning and the examples of these type of image captioning are actually super impressive and one of the points that I want to get across when I show you these examples is that there is still a lot of variety between the best model which would be this one over here and I'm also going to show you what the initial model said so when you first train it without hyper parameter tuning and so on so it's clearly very sensitive to that type of tuning so initial model we're saying a large brown dog laying on top of a couch which to a certain extent fair enough but then if the best model is a small dog sitting on a chair we can have a man cutting a cake with a knife which not really versus a man holding a sandwich in his hand good a pizza sitting on top of a white plate which is completely wrong and a person is cooking some food in a grill yeah and a close-up of a person eating a hot dog versus a woman holding a banana to her face which is really good like if you look at the middle sentence that's like pretty impressive and that is just goes to say that like in all of these it's still quite fiddly and it still requires a lot of hyper parameter tuning and it just doesn't come out of the box but when it does well it's pretty good so yeah so this was sequence to sequence yeah yeah yeah so well it could go either way so here you could either train it from scratch and just train them together and it should learn to actually have a CNN that extracts meaningful representation often what people do is that they do use pre trained models just because we can argue that if you go online and get on any of these Google net or whatever like pre trained Nets it's gonna save you a lot of compute especially if you have a huge CNN then you just plug that in there subgradient and actually just update this bit over here but it could work either way that's just saving you buying some time I think that would be one of the use cases I don't know that necessarily it's been applied like I could see like when I was mentioning you could cluster by you can conditional clusters that could be almost like what you're saying when you have like very rough clusters and you want to like discretize it like very thin and increase the resolution it makes sense so you could do that I don't think anyone is gonna use pics learn and to generate images like there's far better generative models by now you use gansler or anything else this was more to show that we have models that can create a sample in our model conditional probability distributions and as a matter of fact actually a version of pixel iron and was then done in two to four wavelet would that was kind of a precursor for actually something completely different which I'll talk to you in a minute which creates audio signals but it Peaks Lauren and as such is not necessarily used that's just maybe an example of showing how you can do conditional probabilities yeah in a rough way but like you saw the images weren't phenomenal the same way where you were trained on RNN so you would hear you just the first one you feed it nothing you can feed a vector of zeros and then you expect the model to give a pretty good first word which I really would be something like the or a or M right and you take the loss between whatever it outputs and the actual ground truth sentenced because you assume that you have a label data set yeah yeah exactly so it's so ever so some still the cross-entropy and you would just use whatever the caption is of an image and just try to recreate that sentence for that particular one so that obviously is true that then that is fixed to that one sentence that goes with the image so it's not super flexible obviously a human could describe that one image in a million different ways but you're training it for that one particular if that make sense so you're hoping that if you see enough images of cute little dogs eventually it will learn different ways of describing dogs does that make sense oh there was a question somewhere where there as well okay oh yeah sorry I mentioned early and I stood up maybe even more clear when I draw this the this is just showing the recurrence this is definitely an LCM for sure okay and yeah this is a bit simpler than if I put the different gating mechanisms in the cell states and whatnot but all of these are Alice tiems and I mean you could do it with our announcement again like it would very quickly not do super well so yeah Alice seems good well paid attention cool sweet audio wave so this goes back to two wavenet and what this has actually been used for more and more efficiently and actually I take that back because then they was changed so it originally was used for that but then it was changed with convolutions which is exactly we're gonna talk about right now so you can think of audio waves of course also as a sequence and in this case we're not gonna use an RNN we're also going to use an LST em we're going to use convolutions which is what you guys have already talked about in a previous lecture and the way this works is that you're gonna have your input at a very high resolution and you're just going to pass it through several layers and hoping that it abstracts away and then that's gonna produce your next point at the low resolution if that makes sense and you're doing this by using increasingly large convolutions and increasingly fewer of them and these are called dilated convolutions and this is a different way actually of thinking about how to deal with sequences so it is it has this fixed structure of a convolution so it always will only look at that horizon but it does still take into account quite a lot of horizon and at the different levels of hierarchical abstraction between those layers if this makes sense and this particular model which is called wavenet is the one that was developed for the Google assistant voice so whenever you chat to them that's this model generating audio from text and so it's actually managed to really scale it up really well and these convolutions work really really fast if I compare these convolutions to all the other models and we've talked to so far they actually are pretty similar to Alice TMS in that again order matters variable lengths because you're shifting these convolutions along their differentiate both because we can learn them and originally I had put this as like a semi tick but it does preserve long term because it has these very dilated convolutions that look back and then summarize everything into one point but again it's actually fixed size stuff like the past that you can look at really so if there's something beyond that you're not gonna see it so give and take yeah so that's audio another one that is also where sequences are quite important our policies and I don't know if anyone here is interested in RL but wherever there's a policy or any sequential decision-making of course sequences play an important role and this starts already in the in in for tasks such as generating images so you're not even RL yet for example we can have models that sequentially decide where to draw on the canvas or where to focus on when paying attention to a certain canvas in order to generate images and those are all sequential decisions that have to be made a really cool actually application or the vehicle model it came out recently is spiral over here where it is learning to generate images but normally most models generate images pixel by pixel and decide what color they want to do that pixel this one here is actually using a computer program the draws and it sequentially deciding what brushstrokes it wants to do so it's drawing like a human would draw essentially and it creates actually pretty cool faces given that it's obviously given only a few strokes and this is trained on celibate so you can see I've learned how to draw human faces and every Bracton brushstroke is going to be an output of our sequence actions of course then there is big oil applications we have the open ai5 that where state-of-the-art in dota and again obviously very heavily dependent on LS TMS and being able to deal with sequential input data there was alpha star from deepmind which also a state-of-the-art on alpha on Starcraft and to go a little bit more into detail actually this is what the alpha star architecture looked like and again it boils down to the same thing so you have you get in some observations that are processed in some particular way but at the core we have an LS TM that is taking those observations and making decisions right so this was actually quite a crucial part in the agent that that was used for alpha star and then it would output the actions and and whatnot so that was RL or sequential decision-making I'm gonna move on to transformers for the end is there are any questions on the previous topic cool so finally this is the last example that I want to give is transformers I believe you're gonna have a lecture on transformers independently this is just going to give you a bit of like a quick intuition about it and obviously transformers are all often closely related to sequences so just to mention that a little bit this is how I like to think of transformers or how you I would explain them you guys have already seen convolutions and the way a convolution work is you're going to have a convolution made out of in this case three weights and then those weights are then moved along to form these convolutions and it's always the same weight so if you look at this it keeps sliding along the image but it's always the same weights that look too far that move along transformers is a bit different instead of just focusing on a little subset of the images that the convolution goes over it's actually gonna take the whole input and all of them but what its gonna learn is which one within those which one to attend to and this are these are reflected by these weights so depending on how much you want to attend to each of these elements in the input your weight is gonna be stronger or less strong and crucially as you changed the pic the point you're generating these weights are going to change so depending on what your current pointers in time you're gonna actually be learn to attend to different elements of the sequence so that is the difference here the weights don't change here they should change as a function of the input cool and this can be used for generating transformer so usually the way it works in language is that you have your input words and you're gonna try to first create some contexts and representation whereby all of these words are interacting with each other depending on some weights that depend on the attention that is learned so depending on how menteur not they are you're gonna have different pairwise interactions and then you're just gonna with as with the lsdm use that as the context and then start creating the next word from the context you get of the representations the embeddings over here and everything and the the past states and everything you can see it a bit in the width of this lines these are meant to represent how much weight the individual elements have so that's what that is meant to show and this is how you can use this for generating language as well and and what's really cool these models are really good like transformers really meant a big improvement for language generation and I'm sure you might have heard of GPT 2 which is this model that actually came out last year from open AI and it was essentially this huge transformer based language model that had 1.5 billion parameters to optimize and it was trained on this huge huge data side of life 40 gigabytes of text data from a lot of websites and which was really cool as this was a language model that was not only generating good language good sentences but also adapting to the style and the content of whatever this language the the context you gave it once which I find really impressive if we look at some examples this is one of my favorites so what they did is they gave it actually I'll read from here they gave it this sentence as the context to start to start from and then they predicted these sentences so the context they gave it was in a shocking finding scientists discovered a herd of unicorns living in a remote previously unexplored Valley in the Anne's mountains even more surprising researchers was the fact that the unicorn spoke perfect English so this is the sentence that was written by some person that was trying to probably do the most random sentences and then the most predictions were for example the scientist named the population after the distinctive horn of its unicorn these four horned silver wide unicorns were previously unknown to science or now after almost two centuries the mystery of what sparked this odd phenomenon is finally solved which is super impressive if you think about the fact that this sentence is very bizarre and that is it is keeping up with the style of this sounding a bit almost like a journal journal text that has been produced so the purpose in the sentence is it creates are not only very long but also they make sense contextually and as you read on they also keep referring almost to its own text and everything so this is very very very next level text prediction and if we think about if we look back on our table again much like the LSD emma2 fulfills the same requirements but it also has this added pairwise encoding that I mentioned at the beginning where individual words are actually affecting your compare ways comparing every word and depending on that weighing how much they should influence each other and this is what also gives them a very big advantage just to look back I like looking at these things and see how far we've made it I want to compare two sentences the first one was state of the art in 2011 which is not that long ago if you think about it when this whole sentence generation with Arnon came out and they were not using really any context but the sentence I produced was while he was giving attention to the second advantage of school building a 2 4 2 stool killed by the culture saddled with a half suit defending the baha'i idea for annals office which sounds like it could be something but it really isn't and but it's it's it's at least the sentence that yeah this was in the paper so this was originally yeah wonderfully this was a moment of pride which at the beginning when iron ants came out this was really impressive but now we look at what GPT 2 is doing this is an example it was given the synthesis context Miley Cyrus was caught shoplifting from uber chrome on Hollywood Boulevard today and then the model produced the singer was wearing a black hoodie with the label blurred lines in the front and fashion police on the back which is insane like this this is not only a much better sentence when I went upstairs it has popular references it matches the style of the tabloid and so on so to me eight years between those two is super impressive the fact that this is where we've gotten so this is quite exciting I think there's obviously still a long way to go and we're still only dealing with sentences and we're not writing books and whatnot but slowly I think we're we're getting closer just a summary so this has been quick actually I want to kind of emphasize what we went through and hopefully what should stay with the in mind I started by motivating Y sequences are important why they're everywhere and that they really are quite important to machine learning research and why we care about them but you should also be convinced that modeling the probabilities of these sequences is quite hard we cover different approaches so from the very basic and original ones like engrams to more complicated deep learning ones like our names and Alice TMS and moved on to some examples with like dilated convolutions for the audio as well as transformers at the end and hopefully I have convinced you that these models are not only good models but they're also fairly flexible because they're dealing with sequences so they can actually be applied to a whole range of tasks not just natural language generation or audio but actually to a wide range of different tasks in machine learning and with that I like to thank you and if there any questions [Applause] yeah yeah so the question is to repeat because it was quite quiet was whether these type of models that we've talked about can actually only focus on like local consistencies versus if you write a novel you want to have global consistencies and actually storyline and so on I think you would probably need a combination of a few things so this obviously has shown that it does give you local consistencies but you could imagine mixing this for example from the top of my head with some memory module and then that is storing something more about this so at least you could be consistent about more than more than just a few paragraphs you could store it learn to save some crucial elements and then read from that condition on that every time you're producing right so maybe it would learn that there's some characters and so on you would have to learn how to make it maybe more hierarchical you can see the first sentences you'd basically add a memory of two and verse now it like really is writing paragraphs and I think that is because you're abstracting away more and more you just need to learn how to do that so there's nothing saying that you cannot learn how to scale this up into higher levels of abstractions we've not done anything like it's not clear maybe how but it could be even the same principle just like beefing it up to a certain degree yeah yeah so this is a problem I think where that has more to do with the fact that we're using deep learning and deep learning it's very hard to like reconcile with actually this whole symbolic learning and this is actually something that I've been really interested in because you have to be learning all you can hope is that it learns a pattern from the data and if it doesn't it's not gonna abstract it it's not gonna reason of this higher level of abstraction that you'd like it to be like humans like we do right like we cannot make any claims as to at what level it is it is reasoning and clearly like you said from this example it it doesn't learn concepts it doesn't seem to learn concepts the way we do and then apply them and I think in that case it would have to go to other methods if you want it to actually understand concepts to learn concepts first and then maybe use this too then with these concepts generate sequentially but for sure ya know at the moment it's just yeah and also because the idea in my head like deep learning is very under the representations we obtain and deep learning are very under constrained right you're letting the model choose whatever it wants to do so it's very unlikely that they're gonna by chance end up being super symbolic and the best way of actually summarizing whatever you're getting so yeah it's you'd argue could argue that symbols are a very optimal way of compressing information but maybe usually neural networks get away with compressing in a different way which happens to be completely random and then this and then generalize badly yeah that's not a stupid question that's the golden question and machine learning I joined one of the hyper parameters were so when I was showing you the different the different image captions and I was saying some are better than others that is when we're choosing different hyper parameters and that is not only the learning rate that's also the size of your representations how do you wire your network up how many iterations of this and that and you can have some other additional lost terms and whatnot so and the size of your hidden state is one of those variables as well and it's not very trivial because it's also been shown that for certain problems not for this particular one but sometimes you can show that you only need a certain amount of bits to encode something in a neural network so you can you would hope that you do a bottleneck that is small enough it should learn it and it actually doesn't so but whereas if you give it a larger one you can have and then reduce it that works so sometimes it seems like during the training procedure it needs more capacity even though if the problem at the end really doesn't need that much information capacity it seems that during the training process often it does so it's very hard to tell even if you know the final like how many bits you need to encode anything even that doesn't necessarily give you an information usually you say bigger is better and try to like yeah give it as much capacity as you can but then obviously it also trains much lower yeah so there are long story short this goes a bit back to the question we had earlier with the symbols I think because there's not a lot of constraints on these hidden states often they're very entangled representations that are not very human interpreter well in general there are research groups that are focusing on what information can we get out of this and like looking at like unsupervised methods of clustering and seeing what the correlations are and whatnot to understand like you say how are they how are they working but it feels that very often they're quite specific to the one model you train and maybe if you train a different model it word work completely different even if it was the same model just different initialization so I think trying to get this intuitive understanding we're still quite early on like it'd be great but I don't think we have it yet with this type of network yeah yeah yeah you have because you're comparing all to all basically and that's another advantage and this is also wide work so well because you are comparing the words but you're only really comparing them at input level so it's not like you're comparing the whole corpus of English language you're just saying your sentence which is a quite finite amount of words versus whatever you're you're producing does that make sense yeah so it scales worse like you say yep definitely comparing all 12 but there should be I think transformers by now have been applied to pretty much everything I don't know particular examples but I'm sure if you look at it it makes sense because it learns if there are any patterns in the time series later it varied they very quickly pick up what to attend to so if you have any regularities and they would learn these weights between the comparison of word to attend to in there in the previous sequence so they're quite I mean yeah there has there have to be we've tried a couple of things but not that it's super relevant you you
Info
Channel: DeepMind
Views: 36,873
Rating: 4.9205775 out of 5
Keywords: Artificial Intelligence, AI, Deep Learning, Lecture, DeepMind, UCL, Machine Learning
Id: 87kLfzmYBy8
Channel Id: undefined
Length: 80min 27sec (4827 seconds)
Published: Mon Jun 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.