039 - Lena Voita - NLP

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] we'll quote you on that you know that there we will go we'll just put this clip we'll just put this clip live in mathematics you don't need to know anything you don't need to do anything by elena this clip it's gonna be i think it's the do bit you don't need to do anything and then and the funny part is that it fails sometimes so it just teases you randomly right well i guess we'll have to code it up at the weekend hello folks and welcome back to the channel um today we're going to be talking with lena voiter but before then i wanted to spend about 10 minutes or so just talking about some of the confusion i think that i've noticed arising from the kenneth stanley episode so if this is interesting to you then then please watch it otherwise you can skip forward on the table of contents to lenavoita now i wanted to clear up a few things right the first thing is i i think if you take one thing from that episode with kenneth stanley or his ideas that one thing should be the concept of deception and deception is when your objectives or stepping stones they lead you in the wrong direction right assuming that your goal which is also a stepping stone is even the right direction to be going in the first place but it's a really important concept that any search problem has deception which means we cannot trust the gradients of the stepping stones even reinforcement learning algorithms they acknowledge this with reward shaping reward shaping is about creating intermediate stepping stones where you can trust the gradient so any ambitious objective has deception that's that's really important now any goal that requires knowledge that we don't yet have and by the way knowledge is the same as a stepping stone for example how do we build artificial general intelligence or how do i become a millionaire or a billionaire these are deceptive goals which means we cannot trust the gradients of the stepping stone right there's a false compass as discussed on the show some goals are surprisingly undeceptive like landing on the moon or building rockets or building a bridge or some software engineering projects and this is because we already understand what all of the intermediate you know stepping stones are now in the case of landing on the moon actually some folks in china about 3000 years ago discovered uh the key stepping stones to make that problem happen so now we can monotonically increase on the knowledge that we already have and we know we can get to the moon it's a different story of artificial general intelligence if we monotonically increase on gbt3 it's a false compass it will not lead to artificial general intelligence almost everything that happens in the work environment at the moment works on the principle that we should exploit what is already known and not waste precious time with deviations and people self-ideating and innovating i see this a lot right because i've managed teams of data scientists and machine learning folks and all they want to do is build their resumes and experiment with things and be researchers and so on and you know i can completely understand why a lot of the senior leadership in large corporations do not want that to happen right it's horrifying because you think my god people are wasting time and money on things which are taking away my predictability so there's a huge kind of dichotomy between science and engineering science is about exploration and engineering is about exploitation and you can argue that in most businesses the focus is on is on exploitation so if you do any mba course in business the first thing they tell you is it's all about predictability it's about reducing surprise it's about convergence and it's about increasing velocity that's what you learn on an mba course right so um to be a good stakeholder even that i was on a presentation about this the other day um the key there that i was told is that to be a good stakeholder you have to be predictable you have to be obedient you have to treat other stakeholders like black boxes you have to give them the information that you want you have to learn their behavior a bit like you would a machine learning algorithm and then you have to be able to predict what the output is so give them the answer that you want to give them and then you can predict the output it's a similar thing when you have direct reports you want them to be as predictable as possible you it's no coincidence that in the corporate world there is a certain behavior that emerges of obedience and consistency and it's all about scale because to do to do things fast in a large scale organization with you know role fragmentation multi-disciplinary teams you need predictability and that's why you have a certain type of people in these uh kind of you know institutions so you know if you want to scale yourself in a large corporate what do you do you make yourself transparent and you'd be as predictable as possible right so there's a test what would tim do if if people can answer what tim would do it's a good test that you can scale yourself if i know what my reports do it's a good test that i can delegate to them and i don't need to be in the meeting anymore because i can trust that they're going to do the right thing and okay so none of this should come as a surprise um but for good or bad it's the opposite of stepping stone collection right this is all about velocity it's about exploitation it's the opposite of innovation innovation is about surprise a lot of mba folks will tell you that they're creating an innovative culture they're not they're doing the complete opposite they're creating an agile high velocity culture there's a there's a difference so when you read kenneth's book and keith said that he was making quite a hyperbolic argument you know like a a black and white argument but i don't agree i think the most important thing that that kenneth was pointing out in his greatness cannot be planned book is that there is a tyranny of objectives now that it's not hyperbolic if you understand what he meant so in in science and institutions in the business world right now even in education there is no divergence and the reason for that is the objectives are fixed so we get teachers to monotonically improve on grades right so that is a convergent behavior divergence is all about discovering new problems right it's about problems it's not about solutions so what should be clear is that if you fix the objectives you can't be discovering new problems actually it blinds you to the discovery of new problems that's the really key concept so yes it's certainly true that there is some exploration out there because there are some people in their garages following their own gradient of interest without receiving any money from the government or anything like that but i think what kenneth is saying is that as a society we are not discovering new problems because we are so focused on solutions what he's saying is that the government and large institutions should pour huge amounts of money into discovering new problems because the the ironic thing or the paradox is that it's only by looking for new problems that we will discover better solutions to the problems that we already have that that's it right so the other misconception is that people think kenneth is arguing that we should abandon objectives and not use objectives that's not true okay so in some of kenneth's early work on quality diversity he was making that argument we should still use objectives but use a different type of objective use an objective which has a kind of behavioral characterization or entropy which we can increase monotonically which means we can trust it better as an objective because it's less deceptive so basically all of the early work in quality diversity was about creating objectives that had a better gradient which had less deception okay what he's saying now is that we should flip the script we should focus on problems not solutions we should focus on having as many objectives as possible because a system that has a panoply of objectives has no objective at all that's the key thing just like evolution we have a panoply or a cacophony of niches people following their own niches and their own gradient of interest that's what he thinks we should do now and in the context of machine learning he's advocating for meta learning he's saying we should meta learn new objectives people have asked does pick breeder have an objective well again it's the same thing right every single person in pick breeder is allowed to follow his or her gradient of interest to its logical conclusion it's certainly true that the gradients will be co-linear in some sense because they're all going to be earthly gradients you know there is a divergence to it because people won't recognize the images but when the images become heliocentric or earthly then it'll become a convergent behavior so we'll talk about pick breeder in a minute now deception is the single most important concept if you take nothing else away from the kenneth stanley video please let it be deception right once you clearly understand this problem you will question gradient-based methods for the rest of your life i can assure you this is it you just can't unknow this information it's a little bit like when waleed subber came on the show and was saying about gbt3 you know it doesn't understand when you say a corner table wants a beer it doesn't understand that the corner table is a person because there's this problem the missing information problem right in in data driven statistical machine learning methods the information is not in the data and once you know this key concept you recognize the problem everywhere it's like a sore thumb it sticks out at you and it's the same thing with deception once you recognize this problem you'll see it everywhere you'll see it everywhere it's a huge problem so yeah basically you will know where the landmines are now once you understand this concept of deception so coming back to pick breeder why don't we look at a phylogeny now this is one of the pictures that kenneth himself bred and it looks like a pig or it looks like some kind of animal and if we trace all of its ancestry all the way to the beginning the key concept here is deception in a search base so what kenneth is pointing out is that the intermediate stepping stones look nothing like the end result so this was truly a divergent process at the beginning this was just individual people following their own path of of you know gradient of interestingness and it's only when it got to this generation where suddenly it became something that was quite earthly or heliocentric so this point of the phylogeny was divergence and this point is convergence and it's really important that we make this distinction now let's use the example of rocketry right so the chinese a few thousand years ago were experimenting with all sorts of random things with fire and water and goodness knows what else i think there's something here called a surface running torpedo and it's only when they got at some point in the lineage here that it started to resemble conceptually what a rocket is so once you get to the key stepping stone where you can now converge or you can trust the gradient of monotonic optimization then it becomes an engineering problem so this is what they did in the 1960s they put a they put a man on the moon so they had all the knowledge they knew what the stepping stones were and they could use a gradient-based optimization and they could get a man on the moon this is basically what we do in the work environment at the moment we know what the stepping stones are and we are ruthlessly improving the velocity of the process because it's an engineering problem right i personally believe that gpt3 is here so gbt3 is at a stage where we have a stepping stone we have an objective but it's a deceptive objective so if we optimize on this gradient we're going to go in the wrong direction artificial general intelligence is down here we're going to be going in the wrong direction now another key concept is is pick breeder a good analogy for real life because a lot of the argumentation here hinges on that principle i strongly believe it is right and that is because we know for sure that many ambitious search problems even you know in computer games for example they have deception of course they have deception so if deception exists and if gradients have a false compass then why are we using gradient-based methods to find good solutions to these problems it's completely balmy now i wanted to give another example here so this is the famous skull in pick breeder and as kenneth was saying right that the early stepping stones do not resemble the skull so if we trace it all the way back to the beginning as you can see it's just a kind of pattern it's just a weird gradient pattern and the way neat works is that it hinges on this principle that kenneth has observed in evolution which is that there's a monotonically increasing arrow of complexity and that's why the topology of the underlying cppn the convolutional pattern producing network is increasing in complexity so when we look at the phylogeny here we have some stepping stones that look nothing like a skull and all of a sudden we have something that looks very anthropomorphic on the fourth stepping stone and at that point it becomes a convergent process so you can see in pickport that loads of people now kind of branch from this anthropomorphic image and create lots and lots of interesting children so at this point it becomes a convergent process but at this point it was a divergent process and the irony is if you try to find an anthropomorphic image directly you probably wouldn't find it it's you're more likely to find it by accident by following your own niche or your own gradient of interestingness that's the key thing with with picbreeder now it's interesting to have a quick look at this as well so as i was saying on the previous episode a pattern producing network so it's it's just a neural network that takes in a spatial input and gives you an rgb output and you might be interested to know well what was the topology of that network for the skull pattern well this is it this is the skull generating pattern producing network and it's absolutely fascinating right i mean the uh i i think that the color tells you what the weight of the connection was and then these things here are different activation functions and and different neurons and as you can as you can see this then produces the skull pattern and this was learned through a combination of the neat algorithm being human augmented by folks following their own gradient of interest absolutely fascinating so the other thing i wanted to do is take you folks on a bit of a tour de france of some of the approaches for doing a novelty search and and the lineage of those approaches now in in the show we were quoting this this paper actually abandoning objectives evolution through the search for novelty alone and this was one of the seminal early bits of work in quality diversity and novelty search so i think when kenneth wrote his book greatness can't be planned he was probably thinking along these lines that was what he had in his head now in the paper they have an example of a hard maze which has deception and they give an example of what a genetic algorithm using neat with a the distance function as the goal to the reward would look like but i'm going to fill in the gaps here so what i've done is i've imagined what would it look like if you had hard maze which was a reinforcement learning algorithm with epsilon greedy now the whole point here is that search problems have deception so if you if you have the wrong objective which has deception which is to say you can't trust the gradient of optimization it means that you're going to get stuck in some kind of local minima so i'm imagining that if you had a reinforcement learning agent and you trained it on lots of different episodes probably the coverage along the map would not would not look very good it would look something like this okay so the objective here is fixed and this is why we have reward shaping and reinforcement learning because we can't trust the gradient of the objective we have to create stepping stones manually where we can optimize on on the sub-gradients of those stepping stones now the next thing he did was he this is what he showed in the paper so he said well what if we have a genetic algorithm using neat with the distance to the goal as a reward now a genetic algorithm it's all about the diversity it's a population based method so being able to cross over diverse instances it actually gives you better coverage around the map right so still one goal which is the distance to the objective but we are introducing this concept of diversity preservation and crossover the next thing that kenneth showed and this is the key idea with novelty search is what if we do not optimize at all on the distance to the goal right and personally i would argue that it's cheating because in many in many situations it's a partially observable problem and you don't actually know what the distance the goal is so you know there's an argument to do this anyway but what we do here is that it's still a genetic algorithm it still has one objective but we design the objective in a very clever way we use a behavioral characterization right so the objective now is map coverage right and the really clever thing about using this this behavioral characterization which has which is kind of like entropy or map coverage is it means that you can monotonically increase it so if you monotonically increase it you trust the gradient and it means that a gradient based method will actually give you better map coverage so so that this is not abandoning objectives this is basically designing better objectives but the the irony here is that it's a better designed objective so you could still use a gradient based method all he's doing is he's designing an objective that has a gradient which is trustworthy the other thing i wanted to talk about with genetic algorithms in general is that they are convergent right so you might think by having diversity preservation by having a kind of entropic behavioral characterization that you would overcome this convergent behavior but you don't you don't right it's still convergent and the reason it's convergent eventually is you have a fixed objective right so if even this genetic algorithm with neat using the behavioral entropy as a fitness function if you look at the convergent behavior it will still converge it's still an asymptote this is why we need to move towards meta learning the problems looking at the problems as well as the solutions because if the problem is fixed a problem is the same thing as a stepping stone right so if you're if you're imputing the knowledge of what the stepping stone is into the algorithm then by definition it's going to be subject to deception you need to be learning new problems so the other thing is before we get to poet stanley has been doing some other work so you may have heard of this go explore which is um kind of like an evolution of the behavioral characterization problem so i would say this is still an incremental step of quality diversity and it's overcoming two of the major problems that they saw with quality diversity algorithms which is namely detachment and derailment so in this detachment example they show what that looks like so you might have an intrinsic reward that's distributed throughout the environment and then an algorithm might start by exploring the purple a nearby area with an intrinsic reward and then by chance it might explore another equally profitable area and then exploration fails to rediscover the promising area that it's detached from so you get detached in this region because it had no novelty so the idea there is to you know almost start from a region that was discovered before just to get around the detachment problem the other problem they talk about is derailment which is where there's no mixability between nodes in in the phylogenetic tree which means that the agent was derailed from returning to the state that that is once in so i see this is very very incremental really so in my opinion all of this is still old hat this is not interesting the really interesting thing that's happened is with poet now poet was released by uber stanley was working for uber a few years ago and this is where things really got interesting so in poet you have a meta objective and you learn objectives through meta-learning so in poet you have these agents and you have environments and the environments kind of monotonically increase and you swap the agents over so all of the um the environments are basically the same thing as problems right so the idea is that you're focusing on problems not just solutions and you're meta learning a curriculum so there's a really good image here showing a curriculum that was learned as part of poet and i showed in the video last week an example that jeff cloone talked about where there was a really complex environment and in order to overcome the deception of that environment we needed to kind of cross over with one of the simpler environments and then come back again so the idea here is that we are learning a curriculum but the curriculum is so weird that we wouldn't have been able to come up with it ourselves remember in order to do things like artificial general intelligence or to solve complex problems we do not know what the stepping stones are we have to learn the stepping stones or we have to discover the stepping stones and this is precisely what poet is doing it's a stepping stone collector right now the other fascinating thing that i thought about is why don't they use something like poet for language modelling so you could have a meta objective which is your perplexity and then you could have some kind of increasing curriculum of tasks which could be meta-learned and that may well overcome the deception in the world of language modelling right why don't they do that maybe they will do that i don't know and i also said in the show that this is very similar to france while shelley's conception with the art challenge well the whole idea of francois chole's newer program search conception is that first of all it's a search problem and secondly you are meta-learning to solve problems that you don't know about right i think he called it developer aware generalization so actually francois choulet and kenneth stanley they're they're going in the same direction we need to be thinking about learning problems that we are not yet aware of in order to solve problems that we are aware of now okay fantastic anyway i really hope that that clears up some of the confusion or maybe it's just created more confusion i don't know but let's move on to lena voiter thank you very much hi there welcome to this conversation with lena voita as you know we at machine learning street talk are always extremely excited to talk to new guests and guests from different fields and sometimes we get a bit too excited so this time we actually forgot to press record for the first 10 15 minutes for this conversation i'm so sorry about this yeah it's well it's it's our bad as well now fortunately we caught it in time we need to build like this could be a fun machine learning project you know that so tim in your chair will build like a taser do you know michael reeves from youtube should watch michael reeves he's like an engineer so he does a lot of stuff with tasers so we'll build a taser into your chair and then whenever whenever like we have a machine learning model that detects when like the conversation starts when you do like the intro like oh we have a guest here and when the record is not active on zen caster it just tases you so i'm now the sort of pre-scriptum clown that is supposed to tell what went on during these first 10 minutes we were just introducing the basic notions of how language models can fail specifically machine translation models and in general sec to sec models so in a sec to sec model and let's take the example of machine translation you do usually have some kind of an input sentence in one language and then you have some kind of an output sentence in another language now initially you have none of the output sentence but you are building the output sentence usually token by token you can do this in an auto regressive manner in that you predict the first token and then you feed the input sentence plus the first token again into the system to predict the second token and then you take that in predict the third token and so on this has been giving very good results however there is a problem with it what you do when you train these systems is that you have the source and the target sentence readily available so you can always feed in not the token you yourself predicted but you can actually feed in the true token of the target sentence and that is called teacher forcing now teacher forcing is very attractive not only because you train your model with actual good data but also because the samples if you do it like this are independent meaning you can parallelize it you don't have to do weird loops during your training and so on however at training time you do have the target sentence and you feed in the correct token every step at test time you feed in the token that you yourself predicted so there's a mismatch there between the training inputs and the testing inputs so just to give an example if you translate the german sentence uh dikatse ist like is like eats it it means the cat eats um if you go about and predict the first token you input the german sentence plus nothing and the system might predict the and then you input the german sentence plus the word the and the system predicts the second token now the correct token would be cat but let's say your system is confused and outputs dog so during training you know that there should be cat so you your best option is to feed in the word cat again during testing you don't have this you output dog so you feed in dog again and you can quickly see that if your model makes some mistakes then it sort of builds up on these mistakes and the input of the model gets further and further away from what it has seen during training but we discussed different failure modes of these models if you think about what a model like this has to do it has to do two very different things first of all it has to translate the source sentence so that means that the tokens that it outputs must somehow be the tokens in the source sentence but just in the different language so if there is a token cut in german there should be a cat somewhere in the english sentence however the second objective is to make the output sentence as grammatically correct as possible in the target language so a very easy approach would simply be to translate each token from language a to language b but that would be a horrible translation not because the information isn't there but because it's grammatically incorrect so there are these two conflicting objectives and it can be that during testing one can just sort of overpower the other and that leads to failure modes if the translation information objective wins out you're going to see really horrible translations but that contain sort of all the information of the source sentence these models even they tend to repeat themselves simply because so they they go like cat cat cat cat cat just because that's very probable given the source sentence and it doesn't really look at what it's produced so far in the output so it doesn't care about grammar the other failure mode is the opposite when it doesn't know anymore how to translate the source sentence so it says like well i'm screwed here i might as well just you know give you a grammatically correct sentence in the target language a hallucination a hallucination a hallucination at least i'll get some low loss that way and so it will continue it will sort of start the output sentence and it will continue completely disconnected from what it should translate as long as the sentence is grammatical because that is still a low loss output so if you can't satisfy the two objectives it will prefer one over the other and that will sort of uh give rise to these different failure modes of the model so that was what we introduced in these first few minutes and we are of course very sorry that we lost that footage and we are extra extra careful and checklisty in the future which probably means that it will for sure happen again because i just jinxed it alright i hope you have fun see ya yeah so i did run a little competition on our discord server i was curious if any one of our esteemed members could understand what the hell yannick was talking about just based on these screenshots and we had some interesting contenders actually so sky core says no kitten stepping on the white house i thought that was um quite intriguing actually um an adroit suggestion ex-master 96 says i'm so confused right now well it is quite confusing i'll grant henry the meme guy says has he finally overdosed on amphetamines caffeine and lost his mind well we all know that's why he wears the glasses now don't we um al beebag says please don't let the kadza die he didn't honestly um yannick's really good about stuff like that anyway um i suppose the show must go on hello lena elena your blog is incredible lots of visualizations and animations and i think it really makes it digestible for people to understand so yeah it's great that you've done that thank you do you know um valia fedorova yes yes i met her at icml probably about yeah nearly two years ago now but yeah it's it's cool that you that you know her she's really nice yeah okay so this is a quick detour this is lena's blog lena hyphen voiter.get it's absolutely incredible so on here she has all of her publications of course but she also has this blog and this blog is absolutely beautiful she uses visualizations and graphics in a way that you don't often see actually it reminds me a bit of uh robert lanz who we had on the podcast so let's have a look at some of these articles so um the most recent one is source and target contributions to neural machine translation predictions okay so we'll go through of course she's written a paper on this as well but um her blog makes it so much more easy to understand so in this she's talking about what yannick was saying before right so models hallucinate as you can see there's a kind of continuum of influence between the source so what goes in and the target what came out because as you rely on predicted tokens in in the prefix then the model has a tendency to hallucinate so she talks about this dichotomy between the the source and the prefix and she says that models suffering from exposure bias are more prone to over-relying on the target history and hallucinating so that's you know exposure bias is something that increases as you have more and more prefix tokens that were previously predicted and you're not using teacher forcing as yannick was talking about before okay so models trained with more data rely on the source more um confidently and the training process is non-monotonic actually has some distinct stages which is quite interesting so she says well what influence is the source or the target she said that we may expect that some tokens are predicted based on mostly source information so informative tokens while others are based on mainly target history so determiners but how do we know which information was used and her tldr here is that models often fail to properly use these two kinds of information and this is illuminated in some papers on this matter so if you look at the the attention weights on some of these transformers models some of these models are only paying attention to punctuation or the eos token which means clearly they're not actually paying attention to anything in in the input so they must be hallucinating which is interesting now how do you reason about the contribution from the source so there's this thing called layer wise relevance propagation now i didn't realize this but it's very similar to one of the original attribution methods in computer vision do you remember grad cam where you have an image going in and you want to kind of figure out well which input pixels were most responsible for creating a given prediction and it turns out that you can do this in pretty much any model and that's what lennar is advocating that we do in these transformers models so then of course we can figure out which input tokens were most responsible for a particular prefix so she points out that it's unclear how to use a layer-wise method to something which isn't completely layered because an encoder decoder architecture is is of course different to these traditional cnn type models but she asserts that the total relevance is still propagated through the decoder and the relevance leaked to the encoder is propagated through the encoder layers and she kind of demonstrates that with this animation here in her paper she talks about how to extend this lrp method to the transformer but anyway let's do some experiments so she looked at the total contribution and entropy of contributions right so the total contribution is the contribution of the source which is one minus the contribution of the prefix and she also looked at the entropy of the contributions which tells us how focused the contributions are whether the model is confident in the choice of the relevant tokens whether it spreads its relevance across the entire input now the other thing she points out is she's interested in general patterns she doesn't want to focus on kind of micro patterns so she averages over the entire data set now this is really interesting so when she does this she notices that there's a general pattern so as the target or the prefix token number increases the model increasingly hallucinates because it's paying less and less attention to the source and this this graph here demonstrates that quite nicely and also when you look at the entropy of the contributions there's a really interesting curve shape so at about 10 to 15 tokens the entropy is is increased which means it has increasingly less idea about where to point to in in the source but it's kind of fascinating that there'd be this curvy shape and then the entropy would go down again at 20 and then go up again at the 23rd token now another way to reason about this is rather than using the reference prefixes so the reference translations why not use the beam search translations because they're more regular and they're kind of surprising the model less because you know you're you're putting the model in a comfortable place so to speak so she asserts that beam search translations are usually simpler than references and indeed when when she compares them using the beam search prefixes the um model is more confident about the source so it's hallucinating less as time goes on it's the same thing on the entropy so on on the on the beam search the entropy is lower which means it's more confident about the the source so she hypothesizes that the simpler prefixes are more convenient for the model yeah so one way of visualizing this is a bit like tributaries or little rivers right so if you give a model something that's inside a tributary so inside something that it knows about then it'll just immediately run with it and in this case a tributary means it actually understands some relationship between your source and prefix if you give a model something let's say outside of the manifold or outside of the tributary then the model no longer understands what to do so it'll just start hallucinating and it will just predict something that's grammatically correct because it can't really make any relationship between what you've given it and the source another thing she tries doing is just putting completely random prefixes in there and as you might imagine the model starts hallucinating quicker because it's just got no idea what you're talking about it doesn't fit any frame of reference that the model has so anyway um i really recommend you check out this article it's beautiful finally she asserts that the training process is non-monotonic with several distinct stages so when she plots this over time especially looking at the relation you know of the source contributions to the prefix attention it immediately goes down and then it goes up again and then it seems to converge after about 20 or 30 000 training batches which is really interesting and you know perhaps you wouldn't have expected that to happen so the second article on lena's website is this information theoretic probing with mdl what's mdl it's the minimum description length everyone these days is talking about the minimum description length whether it's francoise charlay in his measure of intelligence paper or actually it's quite funny every time you meet an academic with a with a russian accent they will use the word golem a goal of complexity like every fifth word it's like oh my god of complexity um i had a phd supervisor who was like oh yes i was just going outside for a walk down the road earlier and i was looking at the sky and considering the kuruma ghoul of complexity anyway so it comes up everywhere it's it's basically the minimum description length for anything so it comes up a lot in in compression as well and information theory but anyway lena says that pro being classified probing what does probing mean probing means measuring you know so like an accuracy would be a probe probing classifiers often fails to adequately reflect differences and representations and how they can show different results depending on the hyper parameters so a lot of this is about accuracy isn't a very stable method right so she proposes an information theoretic probing which measures this minimum description length of the labels given the representations and the really cool thing here is that rather than just showing the final quality of the probe so the accuracy for example and this mdl also shows how hard it was to achieve um you know that that particular representation and it's more stable so how do we understand if a model captures a specific linguistic property so in what we do now is we have data we have representations and we have labels and we are measuring to to what extent do the representations capture the labels now standard probing is we just use the accuracy but we know that the accuracy of a classifier is used to measure how well these representations encode the property looks reasonable and simple right wrong okay so she says that while simple probes are very popular several sanity checks have showed that differences and accuracies fail to reflect the differences in the representations right so houston we've got a problem the accuracy of a probe is not always reflecting what we want it to reflect so she introduces this information theoretic viewpoint okay so what we're getting to here the tldr of this article is that regularity and representations with respect to labels can be exploited to compress the data right so machine learning and compression they are basically the same thing when we were talking to sarah hooker she was telling us about this that you know the reason why the lottery ticket hypothesis works you can prune 90 of the connections in a neural network and it still works sometimes it works even better than it did with all of the ones in there so um what's happening with all of those neurons right so they are learning all of the challenging or low frequency attributes you know all of the examples that are sitting uh very close to the decision boundary or you know ambiguous between two classes you know we're wasting most of the representational capacity right so clearly if we had a lot of regularity in our data then we would need fewer parameters to encode it so she has quite an interesting kind of way to represent this so we've got alice and bob and let's imagine that alice has all of the the data and the labels from the data set and bob only has the representations and alice wants to communicate the labels to bob so transmitting the data is a lot of work right so surely alice has got a better way to do this which is compressing the data so the formal task here is to encode the labels knowing the representations in an optimal way so essentially what we're getting into here is the cross entropy between the labels and the representations that is the the data code length right and so learning is compression and the amount of effort is related to the strength of regularity in the data so if there is strong regularity then the data can be expressed with very few examples and vice versa so what we're getting into here is variational methods and by the way when we interviewed carl fristan the other week we were well you know when i was doing my background research i was learning all about variational methods and inferences optimization and of course the kl divergence and so on and all of this stuff's really interesting we didn't properly get into it on the episode so if you folks are interested in us doing an episode on that then more than happy i i know keith would love to do a show on that but this is quite interesting actually so linda's got a section in the article that says i hereby confirm that i'm not afraid of scary formulas wonderful so you have to explicitly say that you want to see it but all of this should be familiar to you if you've studied variational methods and the kl divergence so the tldr is that the uh the mdl methods are stable and the accuracy is not so the mdr is is a much better uh you know measure of probing for machine learning models because it characterizes not only the probe quality but how difficult it was to achieve it the representation right now finally um the other article that we talk about today is this evolution of representations in the transformer she says we look at the evolution of representations of individual tokens and transformers trained with different training objectives so translation language modeling like gbt3 master language modelling like bert so denoising water encoder and she looks at it from the information bottleneck perspective right so she shows that language models gradually forget the past when forming predictions about the future bert style models the evolution proceeds in two stages of context encoding and token reconstruction and machine translation representations get refined with context but less processing is happening she says instead of measuring the quality of the representations obtained from a model on some auxiliary task we characterize how the learning objective determines the information flow in the model so she looks at how the representations of the individual tokens in a transformer evolve between the layers under different learning objectives and uh essentially she uses the information bottleneck perspective to do this but that's just a clever way of saying she's looking at the mutual information so what are the um the three main tasks in these nlp models so the first one is machine translation so given a source and a target sentence predict words in the target sentence word by word um the language modeling and you'll know this from gbt3 of course so that just estimates the probability of a word given the previous word in a sentence and master language modelling which is the birthstar model that's essentially trying to fill in the gaps so you mask out a word in the input sentence and then it will tell you what that word was right so we're interested in how the representations evolve between the layers in these transformers architectures but how does it change depending on the um the task right so all three of the models start out with the same representations of the of the tokens or the sub words that go in and they also have the identity and position but the way that the information flows across the layers is different depending on the objective and that's the main interest here so um her hypothesis is that these these token representations undergo changes layer from layer and the interactions and relationships between the tokens change and the type of information which gets lost and acquired also changes as you progress through these transformers models so the tldr here on this information bottleneck is that the evolution is in squeezing a relevant information about the input while preserving relevant information now if we click on this information bottleneck paper i mean what have we got here so that is we squeeze the information that x provides about y through a bottleneck formed by a limited set of code words x uh this constrained optimization problem can be seen as a generalization of rate distortion theory that's getting a bit heavy going isn't it right let's come back to where we were now the problem is we only have the representations of the individual tokens not the entire input but we can view every model as learning a function from input x to output y now the first approach that she uses here is is computing the mutual information uh with the input so the mutual representation of the representation for the for the successive layers with the with the input with what went in and what she shows here is absolutely fascinating right so depending on the training task for example bert star models these denoising auto encoders the master language model they actually maintain a fairly high mutual information with the input even as you go through the successive layers but machine translation and language modelling they progressively forget about the input so if they're forgetting about the input then what is happening right so what she what she does then is she looks at the mutual information with both what goes in and what goes out and as you can see there's a commensura increase in the mutual information with the output as we go through the successive layers but the the pattern of that commensurate increase changes depending on the training objective um so with master language modelling it looks like a kind of sigmoidal pattern whereas with language modelling it looks like a kind of quadratic curve so it's very interesting so she talks about some tricks she uses to estimate the mutual information in this setting because it's quite challenging she had to use an approximation but if we get to the rub here right the rub is that mass language modeling preserves the token identity more than anything else so when you look at just the token identity through the different layers of the transformer on mars language model it's modeling it's significantly preserved she's got a great illustration of this concept actually so uh she gathered representations of tokens is our were was from lots of different sentences and visualized their disney projections so the act the x-axis here are different layers in the model and what's fascinating here is that the master language model so the bert type model actually preserved the individual tokens throughout the successive layers in the transformers model significantly more than the language model and the machine translation the other thing is that um machine translation best preserves the token position so she took a large number of representations of the same token for different tokens for each occurrence let's look at the position or the top k neighbors and evaluate the average position distance and again when we look at a disney visualization here you can see that machine translation significantly preserves the position of the token compared to language modeling and master language modelling okay so she concludes by saying that with the language modeling objective as you go from the bottom to the top layers the information about the past gets lost and predictions about the future get formed for bird type models the representations initially acquire information about the context around the token partially forgetting the token identity and producing a more generalized token representation the token identity then gets recreated in the top layer for machine translation though representations get refined of context and less processing is happening and most information about the word type does not get lost this is absolutely fascinating this reminds me a little bit of our conversation with simon corblet a few weeks ago because he had a paper which was all about the evolution of representations through you know let's say a convolutional neural network with different loss functions so it's still the same task essentially but it's different loss functions and he demonstrated that actually the representations are pretty much the same apart from the penultimate layers or you know the last sort of 20 of the layers there was a divergence but of course here there's significantly more divergence because the training tasks are significantly more divergent i would say so yeah any kind of um approach like this to reason about the evolution of representations and the behaviour of neural networks i think it's absolutely fascinating and i'm going to really look forward to discussing this with lena today okay now the most important thing that lena has got on her website other than her papers and blogs and so on is the nlp course it says nlp course for you in the top right hand corner you folks must check this out sebastian ruder actually featured this recently on his newsletter and sebastian ruder pretty much is the god of of the nlp world in case you folks didn't know so in this course again in characteristic style for lena she has formulated everything beautifully and she's used a lot of graphics and a lot of really cool layout techniques so she has the concept of lecture blogs seminars and homeworks research thinking related papers and and lots else besides so on the left hand side you can see the structure so she's got an entire course on word embeddings and i mean it's too much for me even to go through but this is absolutely crazy so one hot vectors distributional semantics count based methods word to vec glove research thinking related papers yeah it's absolutely incredible she's also got a course on text classification so in here she's got a general view feature uh features and classifier generative versus discriminative classic methods naive bayes svm neural networks recurrent networks convolutional networks multi-label classification practical tips embeddings data augmentation analysis and interpretability research thinking again she's also got an entire section on language modelling which is what gpt3 does so she starts off with a general framework talking about text probability um engram language models neural language models um generation strategies so things like top k sampling and coherence and diversity and sampling with temperature all the stuff we talked about on the gbt3 video evaluation practical tips analysis and interpretability and research thinking again and she also has a course on seek to seek an attention and transfer learning as i said i mean this is just too much for me to go through now um honestly if you want to have a really good introductory course to everything in natural language processing this is the best that i've seen this really is the best that i've seen so please go to len's website it's lenna hyphenvoiter.github.io forward slash nlp underscore course and of course we'll put the link in the description but this really is incredible so please do go and check that out i will need another cup of tea so i i operate roughly 2.5 percent better on pepperminty yeah peppermint tea exactly welcome back to the machine learning street talk youtube channel and podcast i'm here today with my two compadres sayak the neural network pruner pool and yannick lightspeed culture now today we have an incredible guest we've got lena voiter a phd student at the university of edinburgh and the university of amsterdam previously she was a research scientist at yandex research and she worked closely with the yandex translation team she still teaches nlp at the yandex school of data analysis now um lenna has created an exciting new nlp course on her website which is called lena hyphenvoiter.github.io we'll link it in the description she has one of the most well presented blogs i've ever seen where she discusses her research in an easily digestible manner she uses lots of visualizations and animations which i think really communicate her ideas in an innovative way so picking up the conversation where we left off what is hallucination and exposure bias hallucination and exposure bias are paying too much attention to the prefix hallucination is when you just ignore the source because you want to produce something that's grammatically correct based on what you've already produced so it's not necessarily that you made a mistake in translation it's just that you you just stop start ignoring the input the source and you're just continuing what you do because that's at the beginning of a translation you have no information right all you have is the source and then you you might want to pay a bit of attention but as you translate you get more and more already translated tokens and then the language model part kicks in and then the model can be like oh i know how this sentence is going to finish i know i know i don't even need to consider the other language anymore i already know i can fit like this is the people that say i already know what you're going to ask then they answer a question that but you you're going to ask something different so this is this hallucination is a bit like this as exposure bias is more that the distribution differs between training and inference so in training you always replace you always input whatever token the gold standard has so you input the source and then you input the target up to token k and then you try to predict token k plus one but all the k tokens are like perfect they're from a real good translation but now at inference you want you put in your own tokens and that looks a lot different so when you make a mistake you're all of a sudden in a situation where you've never been before and then you're not trained for this so anything can can go wrong so leno maybe there seems to be a straightforward fix in that during training can we just also input the tokens of the of the language model is that or maybe even incorporate data augmentation so to speak yeah there are variety of different ways to fix exposure by the fix to some extent but yeah it helps to some extent but hallucinations still happen and basically basically and in one part of the paper we are going to discuss here we do look at it as this connection between exposure bias and hallucinations so basically what a model is doing if it's strain was different between objectives and uh if there's a connection between exposure bias and how frequently a model is hallucinating so lena you seem to be doing a lot of investigations into how these models work did you i see on your on your blog and your papers there seem to be a lot of like you look into probes and you look into how sources and targets influence the final predictions and so on do you is that a general interest of yours or how did you go in this direction because a lot of people who go into nlp or so on you know they want to train the models that do something they want to train the models that produce the language and beat the the wmt benchmarks and so on your work seems to focus a lot on explaining the models which i absolutely love i i think there is not enough research in this direction if i have to explain this maybe it's because um from the very beginning of my phd i was already working with the anti-translate team and almost instantly i understood that almost all this stuff which is going on in research claiming that they have some improvements actually they don't if we're talking about really high resource settings and people who are really trying to optimize things and maybe that's why i didn't have much interest in this because it's not like you have to report something but the things which really work are quite different i did manage to get something for yandex translate we did get some improvements but it was nothing of the kind uh the research usually does yeah and maybe the other reason is that i don't really feel comfortable yet in machine learning because it's very it's very practical it's very experimental because i come from mathematical background and there you can feel things you can be sure you proved something and now no one is going to doubt that you got this but here it's is it gonna work is it not gonna work it sounds reasonable but it's snarky it's not working it sounds it doesn't sound reasonable but it's working for some reason and um yeah it's really different from the feelings i am used to so i used to certainty to something which makes me calm so analysis makes me calm because i get to at least a little bit to understand what's going on that's the main reason but it's not what i was supposed to do i was supposed to do context aware machine translation and it was very reasonable it was a really good plan and it was something which is both useful for production because currently they're translating sentences individually and in context the way mta was trying to use like um larger segments to translate the current sentence because there's they can be some kind of ambiguity so to translate one sentence you may need like a wider context and it was a really good research topic and i did i do have several papers in context of amt but like uh arranged marriage shirts contextual empties is nice and i could do that yes but i always have a feeling that it was something that was chosen for me and not really it wasn't it wasn't really my choice and maybe there is something out there which i like more and there is so yeah yeah i find this fascinating and not only do you investigate methods to evaluate you have this these papers on evaluating like representations in transformers and so on or evaluating contributions of source and target to nmt but you even so there is these this notion of probes and this is a paper that i find interesting of these probes to evaluate embeddings and also there you don't go and say i'm going to make a like a a new one or so on but you first say why the current ones don't work and then you go about make a better one in this paper information theoretic probing with mdl you explain really well why these why the these probing methods of these nmt models or of these nlp models don't work would you care to briefly explain what a probe is and why the classic ones are not really uh good yeah of course and i can also explain why am i really careful in explaining why they don't work because i really did have to do that when you're trying to analyze models in an opinion generally you can have different ways to do that and you can have different kinds of questions to ask because in computer vision is simple because you have an image of a cat you have a classifier it says or okay there's a cat and you have an attribution method which shows you some parts of an image which contributed and this is your explanation but in nlp it's not such kind of methods do not work very well and in nlp there's whole like different kinds of things you can ask i'll come to probe and wait let me just give you like a general picture of how i see analysis in nlp where when analyzing an nlp model you can be interested for example in how your inductive biases work or how different model components work because when you create some kind of inductive virus you have an intuition but what is supposed to happen right so you can look at the tension which is quite popular now you can look at different attention heads you can look at neurons and this can tell you something about how your model works also you can look at model predictions is kind of standard thing to do is to look at subject verb agreement and lots of lots of papers doing that basically you have a language model and you like look at predictions and you evaluate how good it is some specific phenomena of language so if you care about language you can do that and probing it's really uh fun stuff because here you don't care about predictions you don't care about what your model generate is generating you don't care about your model architecture and wonder components you just care about the presentations when you're feeding data to a network it can have some layers and each of them you have have some kind of a vector representation of this data and you can be interested in which kind of things these are vectors encode right so you have a sequence of tokens on the input you have several layers you have some kind of output and you in probing you're interested which kinds of things these vectors encode for example do they encode part of speech tags do they encode dependencies do they encode i don't know reference such kind of stuff and what you do is you take some labeled data for example you have some data set where you already know part of speech tags you feed this data to a network you get representations and you train a classifier problem classifier or probe to predict this uh part of speech tags from representations and usually you say that um okay if the accuracy is high probably these representations and code the thing i want them to encode if the accuracy is not so high probably they don't uncover these things and uh for me the first paper with probing i saw it was a 2017 acl paper what do nmt models and learn about morphology something like that by jonathan billingko and carlos and i was so excited about it that was like the first analysis paper i've ever read and for me it was like yay great actually we can ask these kinds of questions we can actually be interested what's going on inside the network other than just getting higher blue score because yeah i was at the adex i was at machine translation team they were all interested in high blue score and so yeah this kind of things was cool but then what happened is then that most of analysis started to do this kind of probing stuff uh they appealed this models like elmer bert and all this all these guys and there were a lot of papers just measuring different kinds of stuff and making some kind of conclusions like okay elmer layer one and chords part of speech tags with accuracy like 96.3 and bird's layer number six and codes versus accuracy like 96.4 birth and codes part of speech labels better and it's not really it's the whole other question why do we trust probing what kind of conclusions can we make out of them do we expect these models to be better just because they encode little phenomena let's i don't think we're gonna answer that kinds of questions but maybe we could talk about why probing uh standard problem doesn't work and what can we do about it there does seem to be an obsession with accuracy and it's such a reductionist number it's the same thing traders have used these greek letters what they've done is they've kind of featurized market conditions and they've reduced them down to these single numbers like alpha and they're just throwing away so much information we spoke to sarah hooker from the google brain team the other week and she was making this observation as well that when you compress models you can preserve headline metrics like accuracy but actually it's on the long tail where all sorts of you could argue it's memorization that all of the underrepresented classes and information is on the long tail and you're effectively throwing it away it's a really good point so much information is only part of the things you want so in a way in the ideal world if you have if you had infinite amount of training data for your probe you'd get a good estimate of mutual information so whether this representations actually contain information about something but the fact is not is not it's not what you want because for example if we take a randomly initialized model and just encode things with a random natural light model you're going to get as high very high accuracy and very often you cannot even according to accuracy you cannot distinguish between trained mode or random nationalized one and this is a problem with accuracy because and with mutual information also because even render nationalized models uh they contain this information about part of speech text because they know the whole sentence right they have this information but the difference is that uh usually we want to know if the model encodes something and it is different well it contains information about something and you can extract it if you put a lot of effort to it or it does so by itself like it already caught the information and you can just easily take it and this is uh the difference and another example of my uh standard probe don't work it was at mlp last year by june hewitt and president they developed these control tasks basically they just used random labels so for example instead of part of speech tags you just assign each word a random label sampled from the empirical distribution of tags it means that you still have if you just look at the text you still have the same distribution but uh for different words and they are assigned randomly so you basically are saying that a cat is a preposition a set is a noun some such kind of things and if you train a problem classify on that just trying to predict not part of speech tags but with random labels you can do that very well and you're not going to see a huge difference in accuracy so basically accuracy can say that your representations encode some random labels and this is crazy because they do contain information about labels but they don't encode this and this is what we're trying to measure yeah i think a similar kind of observation was made by sammy benjio's team as well the classic paper understanding deep learning re thinking generalization or something like that they also made this statement that neural networks can fit random levels quite easily and i think you are also asserting something very similar and i also believe that because naturality space work also lays out the formation from a very core hardcore information theoretic perspective where you start to fit things and when you and then later on in the training when you start to forget things and eventually when you start fitting your neural networks with random levels and stuff there's nothing to forget and there's nothing to fit in the first place so i think that statement holds perfectly fine so yeah one one way of conceptualizing it is maybe saying that if i simply input my sentence into one of these models and let's say the models they don't shrink like a cnn but they actively they just keep layer by layer it except if they forget something it should it's it's probably always possible to extract whatever information you want as you say if you put in a lot of work it doesn't it's not the same thing as saying this model pays special attention to part of speech tags it simply says look i can reverse engineer my way to these part of speech tags and and it says nothing about whether or not they are important and your paper produces or suggests a solution for this where you exactly hit on this notion how hard is it to extract this information would you care to explain that a bit yeah maybe let's start with an example the problem is with accuracy you don't know whether this representations encode something or you just learned to extract this putting a lot of effort and what are how can we distinguish between these two things and to understand what is the difference let me give you an example of a model where we cannot possibly deny that it's a model that encodes something let's let's recall this uh sentiment new on paper so this are open eye sentiment here right so open air guys trained a language model on amazon reviews so they trained a language model on lots of reviews and they found in urine which was responsible for sentiment so unsupervised sentiment neuron and so in this case we cannot i hope that no one can deny that a model encodes sentiment we see this we have this neurons which is responsible for this concept and we cannot possibly say that oh no it's just a problem classifier extracting this now it's a model that encodes sentiment and to distinguish between the cases where we just can predict sentiment and the model which encodes sentiment for example so what is the difference between these two kinds of our representations so if we look at this representations which come from sentiment urine model first of all it would be enough to train a very simple classifier to predict those labels a very simple it would be enough to train a linear classifier and we need only one only one weight for it and everything else is zeros and uh you cannot do that for a random initialized model for example and on the other hand the other way of looking at this is that uh to predict sentiment from this sentiment urine representations you would need only a few examples right because it's really easy using this one urine to predict sentiment you you'll need 10 examples and you're done basically and this is the difference between cases when a model encodes something or you can extract it doing something so this is an intuition and this is this was our our example when we thought about uh how to how to explain this difference how to measure this difference and luckily there exists this minimum description length error framework this notion and surprisingly the practical ways of evaluating this description length explain it in more detail there are the practical ways of length is basically by looking at the model complexity how complex should be your model to extract this information or how hard it is to predict labels using small amounts of data so basically the intuition like transfers perfectly to this these two ways of evaluating description length you mix you say it is not only important how well we can predict let's say sentiment from a given representation but also and you start off by saying how much work we need to put into it but that's it's difficult to quantify and then you go into let's look at the minimum description length of our model how but then that also is like yeah in practice all our probing classifiers are just going to be linear classifiers of the same size but then the the clever thing i find is to say aha but i can i can count the number of examples i need to train that classifier so if i count the number of examples to train it then that gives me like a real number that tells me how hard is it to get this stuff out of a representation that's pretty cool it doesn't give you a number of examples it's just you can look at this amount of effort so when you have accuracy you just have this you have the answer to the question can we extract this information but with description banks you can you have these two parts so one is this final quality part which corresponds to accuracy and basically it's saying the same thing whether there is this kind of information in the presentations and the other one the amount of effort so in the cold length you can just say okay this part of the code says if this information is present this part says how hard it takes to extract it and all together is the thing we ideally want to measure that's that's pretty cool and and with respect to these to these probes what people want to do in these probes is as you say they want to look and they say this layer does this and this layer does this and this layer does this and this is in general is a very fascinating topic and i think there are as many opinions as there are researchers in the field it's like everyone i've read a bunch of these papers and everyone seems to say a lot of different things you do have an interesting paper on the evolution of representations in the transformer and where you also you combine a lot of the similar topics like mutual information and so on as you go through the layers and you find something very interesting with respect to masked language modeling do you what do you find in this paper oh yeah it was a really fun paper actually it's also ended up having an information theory in it still the main story usually we talked about probing there are lots of papers doing probing just measuring different kinds of stuff showing regards and usually that's it making some hypothesis maybe some reasonable statements but so far it's not it's not clear what is the process which defines this behavior and in evolution is what we try to do is to say that uh there is a logical explanation to all this okay there's a general process which is going on in your network and it it's defined by your training objective and everything you're gonna see is gonna is gonna be in accordance to this general process to cut a long story short in a network maybe let's start this way so uh well can i in this paper you're characterizing how the learning objective determines the information flow in the model right yeah and you're looking at the representations of individual tokens in the transform and how they evolve in different layers under the different objectives and you introduced this concept of the information bottleneck so yeah okay the information bottleneck it's a it's a really older it's really old method and originally it was aimed to build a compressed representation of input which contains as much as possible information about output and it was a goal of information bottlenecks so the little two objectives laid out with mutual information and the objective is minimizing mutual information with input while maximizing mutual information about output and in this process depending on the weights of the objectives here you can get different trade-offs between compression and prediction so this compression prediction trade off how much you uh forget about input and how well uh you still can predict output right you point out that there are regularities in data and that's one of the reasons why it is compressible and you're saying that in this process of evolution throughout the layers you're squeezing irrelevant information while preserving relevant information but because you spoke about this this paper from open ai with the sentiment how does the model know if it's an unsupervised pre-training objective how did it know that sentiment was salient and that should be preserved i think it was preserved because it was important for language modeling objective breakers what they trained on was their well the amazon reviews so they were all positive or negative and maybe it was easier for a model to decide the sentiment and then just to generate everything rests like keeping in mind this uh idea of sentiment but do you think there should be a one-to-one mapping between something which is quite anthropocentric for example you know adversarial examples neural networks learn features that we don't even recognize or care about so is it just a happy coincidence that it's learned sentiment which is something that we recognize as a concept uh yes and no yes because every time you think that a model is doing something reasonable and you're happy about it and you're you're ready to say something like oh look the model understands these kinds of things it's not it doesn't right but on the other hand yes if there are some things which are important for your objective it tries to uh put effort to preserve this information and yes sometimes it happens to be something we humans think is important like sentiment or syntactic error dependencies or something like that so information was like it's a really old theory and only recently in 2015 it was shown that if we look at evolution of the layers so we have input we have output we have a sequence of layers we can think about it as uh going through this compression prediction stages so uh what's going on in the network as you go from bottom to top from bottom to top and from input to output is uh you forget information about input and preserve relevant information to output and the authors of the information but on a paper showed that you can think about it about this layers is going through this information bottleneck procedure so in a way your output defines the kinds of information which a network forgets about input so you have input but depending on the output you're going to forget different things you're going to preserve different things right about this input but it's a very simple story because it's when we're talking about like the whole network the whole input and the whole output when we're dealing with language models and when we're looking at the individual tokens it's a bit more complicated because how transformers work right so we have uh individual tokens we have positions at each stage the stroke and interact with the child they exchange information and each at each step each token has a new representations new representation for example i was born today i saw nothing i talked to you guys today i'm gonna have a new representation of me because we communicated i know something more about myself about the world and set myself better in this context and this kind of stuff and that's what these tokens do and when we are looking at these individual tokens we are not we cannot just say about this the whole input the whole output because the whole input is uh your sentence and you're predicting something else and when you are looking at individual tokens so your input is current token and its position and depending on the output for this uh particular token its output can be very different for example for uh left or right language model your output is an x token so if you're looking just at uh these adjusted representations at this position they start from the current token and end up predicting the next token but and we would say okay it's losing information about input uh trying to fix information about output but it's not everything that's going on because here i have for example i am current token i have my goal i need to grow up and become this next token but i also have these other guys around me and they also have to grow up and become this next token however the next organ for them and they need er i need their help to get to help me grow up and become this next token and they also these other guys also need my help to to predict whatever they want to predict and this this is when this our story is different from the standard bottom line exciting because inside the second you have the whole input the whole output and that's everything but here we have uh we still have input for example current torque and we still have output but we also have to preserve some information which needed for other tokens so we cannot i cannot forget everything about myself because i i need to keep this information because other guys need it and this is when it gets uh interested and if we look for example at this objectives will the left-right language model is quite simple we start from the current token we get the next token in the end and in this process uh we can forget something about the current token something is still preserved but if in the end i need to build a representation of some other token i need to forget about myself right it's not only about myself it's about what comes next what's what comes after me but if we're talking about mass language modelling it's very different because so how this mlm model is trained so basically it's vert chain objective it's the main part of the birth strain you start from some kind of token it may be a mask token most of the time right it may be a random token or some part of the time it may be even real token and you have to go out there in the world communicate with others understand if you're if you're a mask that you have to understand who you are if you're random token you you have to figure out somehow that okay i'm random i need to somehow to get to my true self and if you are a correct talk and you still have to understand somehow okay i'm in my place everything is right i just need to like keep myself untouched and go straight to the the finish line and this is a tricky part because on the one hand these tokens have to go out there and communicate with others to figure this out right this has to happen at some point this what they were trained but after that they have to fulfill their goal to predict token identity in the end and so if you look at the chain process so in training almost most of the time your input and your output are different tokens so on the input side you have either mask of some random token on the output you have something which has to be there and if you look at this training setting it's all it's again it's forgetting information about input which is irrelevant and building information about output so in training is everything is simple because it's very similar to left right language model when you have one token on the input some other token on the output but if you take an already trained model what you have is that you start with real tokens they communicate they build some contextual representation some generalized representation of this context and as you go uh up to the highest layers you start losing this contextual information and building a representation about token again so it goes in two stages so first it's can context context encoding when there where tokens each token forgets a little bit about itself and builds a contextual representation like the presentations of the things which could be in this place and then it's talking reconstruction it's where you forget contextual information and try to reconstruct this token and that's why when we are looking at a lot of papers doing problem tasks they say something like okay let's try to predict for example part of speech of the current token using the representations from different layers for example for different models so what you're going to see is for example for machine translation encoder it just goes up because your encoding right the decoding part is out there you don't see it here when you see a left right language model it goes like up and down okay and it is reasonable because language model has to build a presentation of next token it doesn't care anymore about part of speech tags of the current one it cares about the next one and that's what we show and for most english modeling there's also this pattern but for for different reasons as we see right you also first you build contextual representation and then you forget it and in different papers doing probing doing different kinds of probing and doing other stuff for example even if you look at the birth score paper in the appendix they have this graph showing how effective is this bird score depending on the layer they're taking the presentations from so a couple of words about bird score so the idea is to build an evaluation metric based on semantics and for this we used representations coming from bird because they encode meaning in contrast to blue score which encodes uh just engram overlap and in the end they have this graph showing when using vertical you can take representations from different layers and they evaluate how effective this metric is depending on where uh you're taking these representations and what they see is again it goes up and down because no players are not about context anymore they're about reconstructing this token and they forget this contextual information and also this um virtually discovers what's called a people client and so on and so forth i could talk about this forever because basically all these papers are showing the same part and they are following basically all the results are following this general process until it's all explained by this old concept of information bottleneck and i think it's cool at least for me it makes sense finally finally something makes sense so in one of the figures that you have included in the blog post where we see how the mutual information plateaus for a bit in deeper in the layers and then it starts to again increase for mask language modeling so would you attribute words by directionality uh to that because you also mentioned that a single token's representation is also contributed by the other representation naturally i think it is important that it has access to all contexts and not just to left contacts like with the right language model but it's not only about white directionality it's about a chain objective because for example a machine translation encoder it's also bi-directional because it has access to all context right at the same time so this is about this evolution it's about looking information about input and accumulated information about output and if you want the the purpose so imagine these tokens born born blank in this word so they verbs as a sentence or a text they are born just being a token identity with their position and they are born to become something in the end so they already have a goal and what they do in the sentence what they do in this text how they interact with each other which information they take from each other is defined by this uh final objective sometimes i wish we had fun because it would make made things much easier just imagine you you would know what do you have to do what to take from people how to communicate them and basically all your interactions are defined by this final goldman unfortunately differently from language models we don't have one so this in turn kind of depends on the difficulty of the training up objective to some extent so maybe it would further allow us to quantify how difficult is a training objective really is based on this based on the kind of hypothesis that you have made in the paper so yeah really interesting yeah so it's it's really it's really good point because yes roughly speaking your target objective defines the kinds of information you lose about input right and the more complicated your target objective the less things you want to lose maybe more things that you have to encode so yeah of course maybe it could be used the amount of information you lose about input maybe it could be used as a measure of how complicated your train objective is yeah yeah definitely and in all of the training objectives that you have experimented clearly mask language modeling is a winner because of the two qualities it you know gives us as a byproduct for example context encoding and label reconstruction none of the other training objectives allow us to do that at this point do you think there is a how influential is the data set in all of this because you've said for example in these amazon reviews there was a clear neuron maybe this is an extreme example but a neuron for sentiment and presumably and you've said so is because probably these amazon reviews they are either good or bad so that's a very strong a very strong signal presumably if we only had good reviews the model would even even with the best objective would never learn to encode sentiment because it doesn't need to so how much do you have any ideas of or how important is the data set or what do we need to to do when we collect data sets or it's a hard question in generally about our analysis we controlled for the data set just to to exclude influenced data sets obviously the same data just like the same everything the same random seeds just different in objectives but yeah of course depending on train set the things which a model finds useful for the strain objectives are going to be different and this i mean this is a whole different story of how to create training data set and it's a huge problem in machine translation also and there are a lot of lots of kind of adversarial examples and even know that in the sunderland tasks for example the models rely on some triggers and not on their not on semantic things but on some are easily maybe even on some uh triggers which are not really relevant to the task and i don't think i'm gonna have a very good answer to that because yeah it really depends on the task maybe there is a a little bit of a bigger question in that after your investigations and so on what is something that you think most of the community still has a wrong idea about the language models or translation or whatever you're looked into where what is something where you where you feel that most of the community is wrong about yannick i'm just a phd student imagine me answering that i'm not going to have a future in this field this is the place for strong opinions but we clearly have to be honest with ourselves about the results we see in the models and how we interpret them because it's really important for example yeah it's important if if you're talking about the evolution paper and the main outcome maybe we need to understand the general process behind this phenomenon rather than measuring a lot of stuff because once you understand this general process you don't need to measure the stuff and say or the patterns are different we don't know why it is but we observe that they are of course they are they have to be different it's supposed to be so it was their destiny to behave this way there was no other way it could be done but yeah i think let's pose my point from the evolution paper and i mean builder this description then clean yes it's important to understand what's going on it's important how to measure it but it's really a hard question of [Music] maybe the main thing we need to work on is how to interpret the results we can now we can we have lots of methods so we can measure different stuff we have lots of list test sets for problem classifiers we have a lot of benchmarks but so far i think we don't know uh what to take out of this results what does it mean that this model like gives you 0.3 accuracy better on some benchmark and yeah i don't want to go too far down this path because it goes on forever but we've been speaking to some good old-fashioned ai people and they have been telling us that natural language processing is not the same thing as natural language understanding and statistical language models as we can get into a philosophical debate of what does it mean to understand that if we ask gpt-3 something like how many feet fit in a shoe or anything that requires what you might call reasoning it it doesn't seem to work so do you agree that natural language processing is glorified statistical text processing or do you think we can actually make something that understands whatever that means no it's called processing of course it's processing so yes of course our i don't think the model understand because they don't understand things as reader and at the very least you need to if you work at least a little bit in this field you'll see and you'll see what actually your models are doing you'll see that it's just some kind of statistical behavior so there are some statistics in the data they capture the statistics and they get it out for you and they're not going to understand things as we do and maybe an example from context of amt it's how i usually illustrate what's going on in context of empty and why do we need context for example imagine you have a sentence i'm here telling you this and without context it's not possible to understand who is i where is here who is you and what is this that i'm telling you and even with context it wouldn't be possible to understand just from the text what's going on you have to be part of it and maybe we'll get better so there's a whole other stuff with grounding alimony's coming in in play and so on and so forth and i think at least when we are dealing only with text it's not possible to really understand things and even if with this language models being as good as they are if you play with them you still see that it's just just capturing some statistics it's not it's not really understanding it's not it doesn't have this underlying meaning behind it because when you talk about context as well i i think by that you might also mean knowledge or common knowledge and this guy was pointing out that there's lots of missing information in our communication and i we have a common knowledge so we in a way we compress our communication and then you need to perform some reasoning to fill in the missing gaps yeah yeah but to that end one could argue that an entire our models you know lifetimes are entirely limited to the training data it's exposed to but we as human beings we are exposed to much all the other things that we get to see on a day-to-day uh basis so the sense of common sense would of course be you know far more elegant capacity for us because we are entitled and exposed to a whole lot of things whereas a model is only exposed to its lifetime of interacting with the training dataset so yeah there are all kinds of arguments to that line it does raise the question though of what does it mean to understand and i think where these go fi people they say that there's lots of lexical ambiguities and language but with natural language understanding you're trying to get to what was the thought behind the utterance yeah and there's no ambiguity there we always seem to understand what we mean but i don't know how to verbalize that because just as an example we've been talking to gpt3 quite a lot we've managed to get access to it and it's really difficult to ascertain whether or not if i was evaluating it if i was doing some pronoun disambiguation or something it's really difficult for me to evaluate whether or not it has actually got the right answer because it's just given me a bunch of tokens and it might contradict itself it's too smart for you it was too smart for all of us it's just tricking us it's just i'm just a stupid statistical model and it's just manipulating you to take over the the world but yeah it's yeah it's really and it also raises the question of i think what people try to to say is that even though gpt3 and these other language models are only trained on their training data the emerging thing is something like understanding but but i think that's hotly debated and probably no one has a as if he had a real answer it's even hard to define what do what do we mean by meaning and let alone just to evaluate it somehow in a model exactly so so you um switching tops a little bit you do have an nlp course i do yes yes well it's nlp course it's called an opi course it's not full yet um it's something i was working on on my isolation this summer while i was waiting to get a uk visa because it's a long story but yeah i was supposed to i left the index i was supposed to come to edinburgh in the spring uh and yeah basically i had everything planned i had my had my suitcase packed and now the current happened and yay i got stuck in six months in moscow for no reason there's no plans to have an apartment there so yeah and uh this is when um i created this nlp course and this is guys uh if you're struggling full balls no communication this is what came out of me maybe you can at some point you can share what came out of you but this is what i did so it was my attempt to come up with something useful when i didn't know uh what was gonna be like what what teaching was gonna be like in autumn because uh you know i teach an op course at this yandex school of data analysis and uh i was planning to move to edinburgh spring then came back to moscow to teach and then come back and this uh all this currently and i wasn't even sure that we were going to have the course or teaching or all the stuff and yeah this what what got me thinking about it and there were other questions about teaching per se i um i thought about a lot uh in previous years because i had a feeling that um the standard format is not suitable for everyone it doesn't have this individual approach i would love to have for my students because you have a lecture yes you're you have to tell something people right and on the one hand um some people came there just to part the course it's reality because i don't think nlp is the best thing everyone should be interested in we have our own interest right and so for some people it may be too much for other people it may be it too later some people just want to know what's it about other people just they really want to do research afterwards and they are ready to put effort to read papers to all these kinds of stuff so it's about the lecture part and the other question is okay we have a lecture and what is the best way for students to just you know recap some material it's not feasible just to like click through the video or like look at the slides because it's not it's not user friendly you know and the other thing again it's about um usually you know yes you can have like a lecture you can give some links okay read also list papers right and a bunch of papers and when you're starting it's uh it's nearly impossible to read all of them to understand perfectly uh to get the idea to understand how the field is going on uh because uh well papers by themselves they are not usually user friendly and like we talked about this meaning common sense and all the stuff it's uh when you're coming into the field uh when you're trying to read the research paper you understand it perfectly you still need to have some kind of understanding of this like common knowledge of the field right uh at least to get this understand okay these parts they write because just they have to like related work introduction right and this is the main idea and this is why it's cool and uh it's hard um when i started i think it's hard to get this main idea quickly it's impossible for a young student to just skim through the paper get them an idea and move on and to get this feeling of the field and yeah and this is like my anal pickers was uh my version of um how i see be done because uh so it first of all it's a suitable format for students to recap some material to look up to the things because it's like a lecture in the blog post with lots of visualizations and they can play these things for example if they're embeddings they'll have this disney projection you can like basically you can work through this embedding space and to look what's going on uh and yeah so i put like really a lot of effort to the lectures to make them interactive to make them you know there are a lot of stuff where you have to where i'm saying just like okay or just go through the slides and um each student like by himself by themselves can just go through the slides and uh at their own pace you know because at lecture again it's gonna be too slow for someone it's gonna be too quickly for other other people it's it's never gonna be good for anyone uh never and but in this format they have an opportunity to just um do it as they like do it at their own pace just to uh take all the time they need to look at the figures just to like uh look at the interactive stuff and their own um speed right and this is about lectures and um the other stuff i have is um related related papers but it's not just links with papers it's uh little summaries it's also illustrations with all the stuff and what i try to do is to give the students an opportunity to quickly get a general idea what other things are going on in this field for example there's a lecture basically like vertebrating the standard stuff yes there's a statistical ones count count based neural based and all the stuff but after that okay there can be for example gender barriers let's see how this can be mitigated we can look at semantic shift we can look at multilingual things we can look at a theological explanation of what's going on we can look at different ways to evaluate them and so on and so you have a lot of papers and it's just enough to like a short like two sentence to three sentence summary and uh it will take you about like 10 minutes to go over all the papers and just to get an idea of what are other things which are going on and you don't need to spend a week reading all these papers you just need 10 minutes to go over that and if there is something you like you can click to expand this scene and just to read the full version of the things you care about and after that once you get this main idea once you understand it given only the context of the lecture which is nearly impossible if you read the paper right away after that if you're interested enough you can look you can read the original papers i'm not against that don't get me wrong but once you get this main idea you are like ready you're equipped enough to go read the paper and to when keeping this main idea in mind you are ready to to frame all the things you read and this general understanding behind it and well but related papers it's not my favorite part and it's not the favorite part by feedback by the way i was at mlp last week and i got surprisingly i got a lot of our students from different countries contacting me and saying lena have you taken your course that's great to do it more and yeah so it was really great and they but they said they liked more than uh research thinking exercises and this is the part of my favorite part as well i was yeah so my personal favorite part from your courses you encourage students to ask research questions formulate research questions which is uber cool to me i am a big mooc fan by the way and i got to know about your quotes from sebastian ruder's newsletter just as a fun fact for our viewers so immediately i jumped straight to it and when i discovered that section on you know research questions and stuff like that so i found it to be uber cool and yeah it's one of my personal favorites of your course as well yeah so what are these exercises are when i to tandex yeah it may sound crazy but i was a like grown-up research scientist with grown-up responsibilities and when i left i was a senior research scientist and one of the things i had to do was supervise interns and students and we did have some papers with my interns and students there but yeah and the main when you're working with students in this way that it's really hard even if it's a super cool student if they like learned all the lecture stuff there's still this point when you uh there's a difference between when you're given information and when you have to like do something yourself and to do something novel right like even think about ways to extend what you learned so you can for example someone comes to you and tells you here are words embeddings they work like leaks these are their properties and do something with it it's like what what am i supposed to do how am i supposed to extend this what could be possible ideas and this is a really scary point for people because yeah that feeling that you have no idea what to do that you have no idea how to do it and even when you start reading papers it's you read the paper and you don't see straight away well if you see you're super cool you're gonna you're gonna do great in this field but usually what happens is that when you read a paper oh god this is so cool some cool guys did some great stuff how did they come to this how am i supposed to come to this and it's a really that point when you can get scared and you this is the point you can get aware and yeah so here my purpose was to give that feeling that it's not just an information it's not just something static you are given and that's it you it's just enough to be reasonable to try reasonable things and they work if they're not working you just try another reasonable things at some point it's going to be okay so and what is the research thinking exercise so you have a lecture and then based on something you already know i can ask some things for example we saw this and this stuff thinks this is good or what kind of problems can be there right imagine there's some kind of example what do you think about it and uh a student has to think at least for for a minute and they can click on there on the button to expand the answer what i think about it on what other papers thought about it and in this way oh aft after that i can ask okay when we understand the problem uh how can you think this for example our based on this idea and so you have some kind of guidance you have uh in a way someone asking you questions and my idea was to like replace since i didn't know if i was going to teach this course in person my idea was to do everything possible to replace myself and put something out there instead of me and this was like me asking questions guiding students through some some path of thinking i don't know yeah and studies show that when you are when you put an effort to get some kind of information uh you are more likely to learn it other than and not when you're just given some kind of information and you even if you're just like uh read this question you think about it for a minute and you look at the answer it's gonna be like it's very different if i just tell you this it's very different and if you're a really good student which if you're a student if you're listening to this please do it this way just read the question and play that out for a day or two or maybe a week even if you are not thinking about it all the time it's just something is going on in your brain you'll notice something is going on and when you come down well if you think about it you're creating some kind of a framework in your brain where you have concepts and you can hang things off those i feel sometimes when i prepare for street talk episodes i have to understand something in order to explain it and when i do understand it even if i forget it i'll very quickly get it back again if i need to because i understood it at some point in the past and because of a need that understanding part rose from a need because you had to prepare it for the ml street talk episode and need give birth birth to this kind of effectivity to some extent it's not just someone told you about this right you like you put your effort to do that and that's that's that's what counts and if you're willing to spend at least some kind of time to do that first of all you'll see that most of the time it's enough like just based on one lecture of the common stuff uh you see that even reasonable when you've just been reasonable it's usually like it's already quite a lot and you'll see that okay um maybe i'll do it this way and you click and you see okay this paper let's level the paper which did exactly the same and it's not uh and in the first lecture also have examples for example based on uh 1995 paper for account-based embeddings how to make them better and like modern ones and you'll see that the process of thinking the research process is the same so you have uh some understanding of what's going on right now of the current state of things you are just starting being reasonable and you already get some results okay of course in the end er you can like end up very far from what you started and in the end you you can get a paper which looks like really cool and you didn't even realize how this happened at least for me it's like how this happened to me okay but yeah and it's i wanted to just help to people to get this feeling that it's okay it's not that hard it's okay that you just read the paper you don't understand how to get this this is how you can get to know how to [Music] get to learn about this process at least to not just take the knowledge but to put your effort to expand it and i think it's it's gonna be useful i hope it's gonna be useful yeah it's not finished yet it's going to be more papers more exercises yeah i was going to ask you what tools do you use do you use cloud providers did are you a matlab hacker are you a python person what sketch out your tool stack and what what is a day in the life of leno hacking doing all of this stuff day in life after i already knew what your html in case i says or before because when it when it started it was like okay i have this huge idea i don't know how to do what is html how to do all this cool stuff to make it look good so about one month in june i spent just to try and to understand how russia made works how ksis works because i had personal page but it was like from some template i just asserted my name and like but you know this this kind of stuff but the course was different things that was like to design the whole and the whole it would look like and for me it's it's important because it has to give me that feeling i have inside and if it doesn't if it's not right yeah so coming back to the course it's html curses a little bit of javascript but for simple things just like pushing the button and do something for pictures it's uh you'd be surprised it's a powerpoint it's powerpoint print screen and paint painted to give something like this it's amazing what you can do in powerpoint for your experiment when you write your papers and so on yeah for experiment it's python it's python it's python it's uh tensorflow not because i'm a mother cased because but because we had to use it at yandex because uh yeah at the index i used yandex internal framework and it's on tensorflow so i use tensorflow i'm not very happy about it but now it's a strange situation because i'm not happy with tender flow i don't have much experience with pie church so i'm stuck on this one and to be honest i don't really like coding it's my bottom line is my bottom neck i like writing papers i like thinking i like drawing fancy pictures but i don't like coding so you're more of a more of a scientist than an engineer yeah yeah my background was in mathematics and this was because i didn't have to do anything and my main what i liked about mathematics was at least what i started how it worked you don't have to know anything you can just derive everything you want with history with all the stuff even with physics you have to learn things you have to learn formulas you have to learn facts it was a mathematics you don't have to do anything maybe i'm just lazy i don't know you have to know nothing you can't just derive all the stuff you and it's so beautiful you come to the first lecture you have definition of of real numbers right you have this 24x arms and that's it and let's eat you don't need anything after that you can just get it out of it and yeah we'll quote you on that you know that there we will go we'll just put this clip we'll just put this clip live in mathematics you don't need to know anything you don't need to do anything by elena just this clip it's gonna be i think it's the do bit you don't need to do anything but that was the way it was for me and yeah now when i like have to learn experiments because without an experiment i don't know if it works or not you don't need to do that in mathematics you just prove things you're sure everyone is sure everyone is happy you are done you're you're good tears because python is the complete diametric opposite of that because at least with um some statically typed languages you have some idea when you compile it whether it will work or not whereas with python you're stabbing in the dark aren't you yeah true at some point i used to see plus but it was a funny story i when i when i interviewed a free andex for this position at production for the first time they warned me they're going to be coding interviews and i was a mathematician before i never had any kind of coding i had some kind of code and at the university but it was like done and forgotten and for interviews i actually read the first several chapters of strauss troop just to understand okay this is how simple works okay obviously algorithmic side it's it's good because it's like basically solving mathematic mathematical tasks to understand the complexity and that's it and it's the final thing that they interviewed so they have like like c plus accordion interview algorithm interview statistics probability theory machine learning for this first partition and apparently back then when at some point i needed something from production teams and that was there was the guy who interviewed me we were like oh guys we are data scientists we need you to code this and that they're like you have lena she's so good at class you just have to give it to her why but why are you coming to us like in the state where i i never saw like a real simplest code in my life before that yeah that's the problem with interviews and listening code on the desk it's my meditation i can do that i can do like like actually open some framework i if you give me a computer and ask just open something and say plus and write i won't be able to do that amazing lena voiter it's been an absolute honor and thank you so much for joining us today thank you it's been fun thank you amazing anyway i really hope you've enjoyed the episode today as always we've had so much fun making it remember to like comment and subscribe we love reading your comments and we'll see you back next week
Info
Channel: Machine Learning Street Talk
Views: 6,634
Rating: 4.964602 out of 5
Keywords:
Id: Q0kN_ZHHDQY
Channel Id: undefined
Length: 118min 21sec (7101 seconds)
Published: Sat Jan 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.