Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
tension mechanism is just a way to route information between parts of the sequence and that's the genius behind a transformer layer-by-layer the routes information to the correct place in the sequence such that it understands the entire thing at once talk to your conference like I see Melbourne herbs and getting the interest in the research community comes the pressure surpassed the previous AVR [Music] [Applause] [Music] do you remember when GPT 2 was released or indeed wasn't released because it was too dangerous it had about 1.5 billion parameters in it it was a transformers language model it was trained on a corpus of about 40 gigabytes of text but it was highly curated fast forward one year and Microsoft had just released a new version pretty much of the same model that has 17 alien parameters 18 billion doesn't matter it's a lot in the olden days we had these static brittle knowledge graphs they were handcrafted he was incredibly onerous to create these things but if we wanted to do something like question and answering it was the only way we could do it if we wanted to say what year did we win the second world war without some kind of an ontology of information we couldn't capture it these transformer transductive language models that we have you can pre train them on a large corpus of text let's say Wikipedia using this social supervision task and then you can fine-tune it on a downstream task and the incredible thing is that these things just accrue information they accrue knowledge about the world you can ask it how old is Obama and it will know the answer it's incredible the current revolution in neural representations for natural language originated with Michael loves word Tyvek the skip grand model in 2013 and it was such a simple architecture it used a contrast of loss which we spoke about last week and it just asked the simple question does this word belong in the context of this other word just using a fully connected architecture and a negative sampling strategy with a contrast of loss you could project these words into an embedding space that reflected their linguistic context ok fine but the problem is the words were atomic the words didn't really take into account the structure and language has this incredible structure words can modify each other from a long distance and those relationships can be multi scale and and complex so the first evolution along the road was we started to have language models like Elmo which started to use bi-directional LS TMS to model the dependencies that adjacent sequential dependencies between tokens and this was starting to capture some of the the meaning encoded in language now you Alan Thicke came along and this is when things really started to get interesting because you will M fit by Jeremy Howard and Sebastian ruder this was the first model that essentially did transfer learning so you could build a language model that was just predicting the next word using a bi-directional well STM but then exactly the same as what happens in computer vision you can do transfer learning you can just take off the last few layers and then you can put a classifier on and you can get it to do something completely different so pre-trained a language model on on wikipedia take the last layer off turn it into a classifier and do sentiment analysis you've got that transfer of knowledge between the self supervised pre training task and that essentially means we've got more training data and and the language model already knows loads and loads of concepts which will be useful for our downstream tasks now in 2017 was an interesting year this is when a chap called Vaswani came about from google he brought in a paper called attention as all you need and this introduced transformers and transformers are an entirely new architecture in your network design so we have densely connected networks and we have recurrent neural networks and we have convolutional neural networks transformers are an entirely new paradigm an entirely new type of network architecture oran ins and CNN's had been the flavor of the day to build language models they have some interesting properties they could support variable length sequences CN NS could easily be parallelized they had linear or logarithmic interaction lengths between tokens the main issue was that they only modeled local dependencies between tokens in CNN's long-term dependencies required many layers and was lossy even gans now are using self attention due to their ability to model long range multi scale complex dependencies allowing us to capture more symmetries and patterns in the data within the input size limit tape myth 6 in the seven myths of ml research in early 2019 tried to dispel this idea that transformers were more parameter efficient discussing lightweight and dynamic convolutions as an alternative with similar results on machine translation tasks transformers have their own architectural challenges they can only work within a bounded length input size and they have nonlinear interaction distance between tokens and their memory intensive as hell see people used to process the sequences of text using recurrent neural networks but that had a problem namely if you have a long sentence such as J's shared beer with his friend then to understand what his friend means we need to know that Dave was at the beginning of the sentence a recurrent neural network needs to store the fact that Dave was there until it processes the end of the sentence and that can be very difficult to learn so a transformer can do this in one step it can go from anywhere to anywhere are using an attentional mechanism fast forward to November 2018 a chap called Jacob Devlin released a paper called Burt pre-training of deep bi-directional transformers for language understanding this was a game changer not only because there's been a tsunami of NLP papers recently which modify or modulate certain aspects of this but but even more so many of the the titles of the paper pun on the word birds and I think that's testament if nothing else to the to the impact that this paper has had but it was a by directional encoder decoder transformer with 12 encoder decoder layers and it had a pre-training task which was doing a loss master language model and next sentence prediction and then it was also fine-tuned on downstream tasks the larger version of the model had about 340 million parameters which was quite large at the time now because of the bi-directional context it did particularly well on certain tasks downstream so it was state-of-the-art on question answering since birth there's been a tsunami papers modulating some of the key findings that were in Beart these papers even mention Burt or pun on the name Burt in the title so we had things like Roberto and Albert and the still Burt and Burt on stilts but the problem is for the average consumer like ourselves what we really want to do is is to understand the relative contribution of all of these different papers and being able to categorize what their changes our performance is introducing a new self supervise appreciating objective but also increasing the data set size having a more in domain data composition for increasing the size of the model itself that's where this paper is so important he's taking apart all these different factors that can lead to performance gains and isolating them in this large-scale study of transfer learning in the pre-training and fine-tuning pipeline naturally image processing so today we're talking about exploring the limits of transfer learning with a unified text to text transformer by raffle and his colleagues from Google this paper is basically a huge Appalachian study it's just saying what happens if we change all of these different variables and and how do those things affect the performance of these language model across a suite of tasks now one of the cool innovations in this paper is that they created a text to text architecture and what that means is they didn't need to change anything even though they were testing against many different types of upstream tasks now you might think that this is a form of multitask learning and it kind of is but it isn't they still fine-tune it on all of the upstream tasks but they did do a version where they trained it uniformly across all of the tasks and after some tweaking I was surprised to see that the performance wasn't actually that bad and what I mean by having a text to text is in the input text you actually define what the task is so you could say I want you to translate this thing from English to German and then you can say hello my name is Tim and the language model will pick up what tasks do you want to do and it will give you the result as well and incredibly there's no leakage between the tasks it doesn't seem to degrade the performance in any way by having this text to text architecture remember to Like comment and subscribe and we will see you next week okay so here we are welcome back to machine learning Street Talk today we are going to be talking about a really really cool paper that came out of Google and it is exploring the limits of transfer in fact I'm just going to share my screen so that's not okay okay so welcome back to the machine learning Street Talk YouTube channel I'm here with my friends Yannick and Connor and we're going to talk about a really interesting paper today that came out of Google I'm just going to share my screen just to give you guys a bit of a flavour so it's called exploring the limits of transfer learning with the unified text to text transformer and it has four primary authors but it's Colleen raffle at owl from Google and I think that the main thing that raffle wanted to demonstrate here was that since the Bert paper in particular there's been an onslaught there's just been a barrage of different papers with different variations changing the you know the pre-training objective changing the amount of time we train changing the size of the data set and so on and I think what what these guys have really tried to do is to make sense of all of that to kind of categorize the different types of tweaks that all these papers have introduced and to do a huge ablation study to try and you know kind of understand where most of the benefit is coming from now if we scroll down to the the takeaway section I think this is quite interesting so that the first thing is is this paper introduces a new type of language model so called that the t5 model and it's a text to text model so I'll just read this bit out here our text to text framework provides a simple way to train a single model on a wide variety of text tasks using the same loss function and decoding procedure we showed how this approach can be successfully applied to generative tasks like abstract summarization classification tasks like natural language inference an even regression tasks like STS B in spite of its simplicity we found that the text to text framework obtained comparable performance to task specific architectures and ultimately produce state-of-the-art results when combined with with scale so what this means is there are lots of different tasks in language processing like translation and question answering and normally you have a different input architecture but now they've homogenized that they have a unified architecture so you can have one model that does many many different types of downstream tasks yeah but one thing that I think is so interesting about it is with respect to transfer learning like when you're a developer trying to take this model for your application I think like it makes it so much easier to figure out how to fit your input-output into this kind of a model for the sake of this pre-training than fine-tuning pipeline so I'm curious about like what you think about the texted text input output for some specific application compared to say like you know bird or these other ways where you would have to be more careful about how not only how you're defining your input output but also the generality of this pre trained representation is decreased by having to be more specific with how you do input output so for that reason I think the transfer learning that's what I think is the key benefit of the text to text framework is that it makes transfer learning easier yeah so I think this is very much and they say that as well depending on how you do the pre training even in this text the text framework right now maybe we should surely maybe people haven't read the paper so in this framework if you let's say want to do a translation from English to French you would not you would not train a model for that specifically but you would simply say translate to French or translate from English to French : and then you put the text of all of this as text so the instruction itself is also text and then they they define various pre training tasks where basically the model is supposed to fill in parts of this text and I I think the performance now really depends on how you do this pre training and I think that's that's a one part that they show in the paper but I what I love about it is the fact that everything is text now and it is so cool yeah I can also see it like in the multimodal like the image bird like trying to combine the images and yeah I guess sort of having this representation of text is super interesting to me because it makes it so you don't have to be so specific about how you define even the pre-training I think like because even just going from like Bert has this mass language modeling objective that fits nicely into like the sentiment classification because you do that next sentence prediction throughout the pre-training but I think you don't have to be as careful with the pre-training in and how that fits into the downstream task yeah though it's it's it I mean it's always what they what they like we don't we've gone through the takeaways and one like two takeaways go like we can do this and this and this and then we combine it with scale and then it becomes it becomes like state of the of the art again so it is unclear to me how much that scale influences this and how much the the actual objective influences this right there's a lot of uncertainty with respect to all of these transformer family for example this this next sentence prediction task I think in the in the bird paper that was specifically mentioned as you know we try to understand and so on but then in subsequent papers people have some people have claimed this is necessary we just do mask language modeling don't worry about this this pre-training this this next so it to me it is still somewhat unclear where the magic is in these things I don't know how you see that yeah yeah it seems like I mean I definitely think like the Elektra objective with the generator isn't is an interesting way of doing this it seems challenging to I mean I hadn't read span bird before reading this paper and I thought just that little modification that they do seem to show provides more benefits so it's definitely like I don't know is there like a lot of opportunity for exploration and the self supervisor objectives I don't think it seems pretty explored to me like you know it definitely is an obvious some way of doing further I mean they've even tried the shuffling and Excel med of scrambling sentence it seems like they've pretty much tried at all to me with the pre-training objective but yeah it definitely was interesting to see all these little things combined together and like apples to apples compared in this framework I don't think anything else has done that at this level I mean one thing that interests me because I must admit every time I look at transformers there's always a small part of me that worries that I don't completely grasp what's going on and for example the vast one is original paper on transformers was an encoder decoder and then Burt apparently is an encoder and the GPT variants and that you know that the Nvidia what was the large and video one Megatron Megatron that's a good name and Mike Microsoft of course have just brought out project cheering which is about an eighteen billion dollar at 18 billion parameter model and those ones are you know cause or autoregressive language model so those are only decoders so can you guys put a bit of detail on the difference between them so yeah and in in a in an encoder decoder framework you do what people originally did with Ellis TMS right where you have an input and you encode it fully into like a Haydn representation and then you decode from that hidden representation now you can have things like attention and so on going there but basically you have two separate modules one is at the end coder and one is the decoder and with with for example Bert Burgess as you said just the encoder so you put in a sequence and you get out and exactly same length sequence so you with pure Bert you can't do something like produce language technically I mean people do it but technically it's it's kind of shaky how to do that but what you can do is for example classify each location do span prediction and so on but if you modify this bird architecture such that your encoder you just make it wider such that you can put the input and part of the output as the input and then you do this kind of masked attention such that the one token in the output can only ever look at its predecessors right then all of a sudden you now have a classification where you can do language or sequence to sequence tasks using this so I think the the the big thing here is replacing this to module thing with the funky masked attention causally masked attention such that you can do this this decoding well asides from the translation models like when do you think it's interesting to encode only the inputs and then to code like only from the encoders representation like I guess maybe like summarization you could take the original document and then encode it and then only decode from that what the encoders produce but like jet so like I guess the way I've been thinking about decoder only is that it it doesn't make any sense to be encoding like having explicit encoders and decoders with most of these tasks play summarization I guess yeah one want us to say even even the original transformer in attention is all you need was auto regressive in the sense that it had attention on the thing that that the decoder produced it was just that there were two separate modules for that and now there there isn't it's all basically one set of parameters so I think the the main benefit might come from just the fact that there's parameter sharing between all of these things generally seems to help to just make your matrices larger but share them across the across the modules yeah what do you think about parameter sharing it sounds like such an odd thing to me that is it like so I have like parameter it do they both get gradient updates and then is it like kind of like how when you have a quantized codebook that you take all of the updates that came from every parameters that shares that weight or is it just like copying it I mean parameter sharing seems like it definitely seems like a way to save especially the memory with respect to storing the model parameters but it seems like odd share the encoder and decoder parameters I don't quite get like the idea behind that it's strange isn't it because it's kind of similar to the way Siamese networks work but you know rather than happening in parallel what's happening in series so why we're doing the same thing many many times be beneficial and why could you not shortcut I mean I suppose a neural network is like a computer program isn't it the idea is is you are successively making transformations that transform data from one manifold to another manifold to another manifold but why would you need to do that the same way in many intermediate steps right yeah it doesn't make sense it's like the same exact transformation applied again but I guess I guess it's it's got a like yeah I guess it's gotta have something to do with if you just you know it's such a complex series of transformations that if you copy and then paste it eight layers layer it'll be making a significant difference I guess yeah that's parameter sharing idea definitely seems like just like an efficiency thing I guess yeah I don't know I don't know if I've read anything that is doing like a big study on parameter sharing other than like neural architecture search well what I'm what I meant with the original when I said parameter sharing was more like if why why should I why should I have 10 weights in my encoder and 10 weights in my decoder when I can have 20 weights in a module that does both at the same time right it seems that that seems to benefit more no with respect to parameter sharing I would say the way you can make the sense of it is if you think of these layers computing maybe subroutines of a program and then like common sub routines like aggregating features in a certain way sorting them or something like this and if you think of it like this like you a subroutine in a program you can call it all over and it's kind of the same thing but that is speculation yeah I like this concept that you know there's over parameterization in the models as well because several of the the architectures that have come out recently and in particular I'm looking at distill Bert and that has a kind of teacher and student concept it's all about knowledge distillation and the idea is is that you train a normal-size model but then you come up with a student model which is 40% smaller and 60% faster and inference and you just get the student to clone the teacher so you know you just train them against each other and then the student model retains about 97% of the performance but it's a little bit like the lottery ticket hypothesis when you know we asked the question why do we need to have all of this over parameterization in the training phase yeah so I think the the idea of this has been in production in in many places I think this is common in industry that you train big models and then you kind of just push them down because they want to serve them out right these are you it the idea of that is not super new but it's it's pretty cool to see this it applies to transformers because of course they are very very large and demanding and it is still a let's say a mystery why any of this works right because what what is what is surprising is we get this small model that has such a high performance but if we just train the small model from start with the training data we can never get that performance we need to train the big model with the same training data what we need to train the big model and then distill it to the small model in order to get the small model to be so good so it's there is like as a weird interaction there with where you have to argue the Lord these large models something about the over parameterization makes the optimization procedure happy to accept the training data into the model right but then once you have it in the model it seems like you can some sort of transfer the features that it learned by by distilling it like this it is it is still I think this is still one of the weirder aspects of all of this the fact that this distillation works I think I think you did just a plug for your video and the lottery ticket hypothesis yeah no you said it perfectly you said that in a way it's quite depressing that this is the way it works it's a bit like you have a block of clay and you're kind of chipping away at it with SGD and you said because in most of these network architectures they're so over parameterised they actually contain n choose P sub networks there's there's this combinatorial explosion of different sub networks and what at least you know just probabilistically at least one of those sub networks is likely to provide a good inductive prior for the problem but this is a horrible way to go about it how does that pruning scale the Transformers that involve taking out the heads or is it like how to transport house pruning scale to try like transformers how do you find a lottery ticket in the transformer well one comment we can make is James raffle in in this paper he tried over parameterization as one of the tweaks that we'll come to later so he increased the model size and his comment was that he just did it in the feed-forward section he didn't touch the attention it's that's really interesting yeah because the transformer thing is and it's not obvious how you would prune that right because it's like an operation on the set and it's like if you can't just kill out the like attention in an obvious way like you could with it you know like a feed-forward layer yeah I think I think distillation is more of a testimony to like the power of label smoothing and label regularization like just even like even if you have the naive label smoothing wave just like having say 90 percent probability on the correct class and then uniform probability everywhere else I don't think distillation is really like transferring the features from the teacher model it's just like a more clever way of doing this like label regularization I guess like meta pseudo labels is a recent paper that also shows it like just a more clever way of dynamically adjusting the labels in some kind of way seems to be a really good way to to train these neural networks yeah that makes that makes sense though is it is it known how this relation works let's say across architectures because what I can imagine is that by doing this you know by outputting not the labels but the probabilities of the labels you implicitly transfer the features because you your student has the same architectural ID it's a transformer and you're a transformer so by outputting the exact probabilities you're getting on each class that will sort of implicitly transfer the features because you know the same architecture should learn the same features given the same outputs right is it known how this I don't know that is it known how distillation works across architectures like if I distill this maybe to an Alice TM or something oh I know there's a paper that does that that takes the transformer labels and puts it in an LCM but yeah I I mean the feature transformation I think it's more so like transferring its sense of like embedding distances so it's almost like a metric learning thing like is showing you like hey this thing is like it's kind of like a dog also but it's very far from a ship whereas this other cat is closer to a ship like I guess it's more of a like a transfer of the distances between the images in the data set and it is like the features I guess but definitely like super complex thing to try that is a smart way could you could there maybe be like a distillation procedure where you use this notion of it's a distance metric in with and do like some sort of contrast of learning where you say these two things should be closer together than this other thing it seems like something like this some combination like this should be possible yeah it sounds really it kind of reminds me of like how they use kind of like how neural style transfer uses the intermediate features or like yeah definitely a tricky thing but it definitely seems interesting like that idea of distillation with distance I guess and thinking of it like that okay well let's let's carry on reading a little bit of the paper then so the second part we started to talk a little bit about the the encoder decoder architecture and I didn't completely understand that but let's read this section so there's a real contribution around architectures and it says well some of our while some work on transfer learning for natural language processing has considered architectural variants of the transformer we found the original encoder decoder form worked best in our text to text framework although an encoder decoder model uses twice as many parameters as an encoder only for example Bert or a decoder only language model so for example GPT it has a similar computational cost we also showed that sharing the parameters in the encoder and the decoder did not result in a substantial performance drop while of course hovering the total parameter count is that so just to confirm the original Vaswani model was an encoder decoder model or is it referring to as Jannik was saying earlier the original recurrent neural network C to C type models I think the original model was an encoder decoder I just kind of made the connection so because because their original sector SEC frameworks also always talked of encoders and decoders yeah I think it's because and it makes intuitive sense because the RN ends were learning a kind of sequential representation and quite clearly there was an encoder phase where it was you know going from a sequence into this flat internal representation and then a decoder phase but transformers are a little bit different aren't they they don't they don't flatten it into a representation and in the same way the data still seems to remain in the original time domain yeah it's it um though though you cannot wash out you get to wash out the differences once you introduce attention in two hours TMS and things like this I I think it's it's mainly about it's mainly about how you would you know program the things here then then actual ideological or mathematical differences but yeah well what do you what do you think about like the position embeddings like that was when I first read the attention is all you need paper it was such a bizarre looking way of doing the positional embeddings and I think now they're doing these relative embeddings it's like the transformer Excel so yeah I guess yeah that's how they give them time but it's very weird way I think you you just know that you know they tried it without and it just failed hard and then they were like ah crap you know because a transformer by itself it has no notion of positions in the sequence right it doesn't know where the information is coming from it just knows but it you can just query a kind of information it can just address content like by its content it cannot it cannot ask where it's from and I guess they just realized oh maybe the information that something is you know at the beginning at the end or you know - away from this other thing is I mean it just seems like it just seems weird to put it even in but that's how you know you need it right because the original paper wouldn't have put it in if you hadn't needed it but I find the way of doing the position on coding using these sine waves and quite a smart right because what you're doing is you kind of overlaying sine waves and then you one is very slow and then the next one is like faster and by indexing a point in there you can kind of measure distance to each of the in each of the scales so the the large sine wave gives you a large scale distance where and it's it's sort of an encoding of your position where you have you don't the model doesn't need to do arithmetic s-- the model can just think what is my close neighborhood and it goes to the fastest sine wave and it can think what it what things are around me in a large sense and it goes to the slow sine wave yeah so so I guess the way I think about it is it's it's giving the model a ruler because as Conor was saying the transform is Excel that introduced recurrence into the model and and then of course you need to have a relative positioning coding you know for all of the the chunks but just coming back to transformers for a sec I mean in contrast to a recurrent neural network where you're learning the sequential representation so that it has some concept of time built in but I suppose it's actually more similar to the wave reinforcement learning algorithms work it's kind of everything about the past is compressed into now but transformers have this wonderful capability to model I mean essentially you can attend from any token to any other token so you can learn these long term representations but does it then use those positional and codings as a kind of ruler to measure how far away those things are yeah I would I would guess that it is in like that 95% of why these things help is much more simple in explanation it's just like oh this thing is at the start right this is just like a hard attention to the start of the sentence if the start of the sentence says what or who then that is an extreme and extreme bias to understanding what the sentence means right so I think a lot of the a lot of the what these positional encodings give you is very simple but then yes I think some of it has to do with which words are my neighbors which words are far away which words are after me or before me because that is very important right yeah because one thing this brings to my mind is I'm I'm really interested in the concept of symmetries and of course the CN n could could capture a translational symmetry but could a transformer captured same symmetry because I don't like the notion that this position of embedding my hard-code the pattern such that this particular pattern of text had to start at the beginning of the sentence or at the beginning of the input wouldn't it be good if it had if it could learn the pattern in some sort of sense that it was let's say translationally invariant or or even invariant to different scales so it might be a pattern that happens at different scales of language well in some sense it already can if it just pays attention to the correct magnitude of the sine wave right if you if you think of learning a phrase like My Little Pony and you you take the exact let's so on one on one point on the side it is like here on the sine wave and then here and then here and then the other part of the sentence again it is you know if because the sine wave repeats I don't have good words for it but maybe you realize that you have the same pattern of the sine wave in both locations and thereby you can learn if after me if after the word my two words after follows Pony then I you know then I assigned this meaning and on the higher level if you go up the sine wave magnitudes you can learn something like if there is a relative clause after a person name then I assign the following meaning right so you can learn these big patterns on larger scales up in the layers I think this this encoding of the overlapping waves is very cool exactly for that it is beautiful when you know just to riff on that for a minute intellectually I don't know whether that they did an ablation study where they took away the positional encodings but if you think about it the model has no concept of the relationships between the tokens so it can only be using the positional encoding and it raises all sorts of interesting things in my mind because does that mean that it's only looking at the relative distance between the tokens it's not actually looking at the the context of the immediate words surrounding that tokens that it's attending to well who knows I guess if you introduce these relative encodings you have a very direct access yeah to the immediate what is your immediate context but even in the indy like all things so first of all I want to say the appellation I almost guarantee you it does not work without the position on codings because like that's why they're there right from the beginning I think they actually in t5 they use like a learner bol scalar encoding so they get away for I think they get away from the sign encoding okay that's cool like yeah but okay yes I guess the sign though I just don't get it because it's like it seems to like say you it would say like word eat is more similar than word 15 you know what I mean like it's sort of like arbitrarily putting words but you know in a way right in a way yes how I how I see it is I so in my in my bachelor's there used to be a group that did face 3d modeling and they had a machine where they would 3d scan your face and how it worked was quite cool in that it projected patterns onto your skin first there was just black here and white here and then there was black white black white and then there was black white black white black white black white so at the end every little pixel of your face had its unique encoding of black and white across these different scales right it's like a like a binary pattern through time and each pixel was unique in that but it's not like neighboring it's not like neighboring pixels would always share the the biggest part of their so I think of it like this you could capture the two cheeks right because they are equal in their bit encodings except for the very first one is black here and white here but other than that there is a symmetry in the higher layers I I think that it you right of course it doesn't in a way it will say these two words are more similar than these two words but you know if you if you do this across the layers where the lower layer could say my neighbors are important and then the larger layers could say you know my very far away's are important that might do the trick cool okay well let's read a little bit more of the the take homes in the message so the next one is talking about unsupervised objectives now of course in these self supervised language models the the general modus operandi is that there's this pre training task which might do next sentence prediction or a mask language model or or even something else and and then of course you do the fine-tuning later so it says overall we found that most denoising objectives which train the model to reconstruct randomly constructed text performed similarly in the text to text setup as a result we suggest using objectives that produced short target sequences so that unsupervised pre training is more computationally efficient yeah I thought that was really cool too to not reconstruct the entire tart like instead of an like just reconstructing the mass tokens as a part of the whole original thing you only do what's been masked out and I I think that like that definitely probably speeds up the training enormous ly by having this yeah like an like just saying for inviting last I bet that that like doubles the speed of doing this yeah as so and they're saying it's because the output spaces is smaller it just makes it more efficient to Train I find it fascinating by the way that that would even work so you give it an input of thank you me to your party week and the target is for inviting lost and somehow it manages to do that I mean that blows my mind yeah media especially because there's so many different ways to fill that out yeah I can't believe that one works yeah it's fascinating just to look at the evolution here of the different pre training objectives on the self supervised tasks and of course in the Bert style model it was doing something like this so you had a random number of tokens and then the target was the original text so of course that's quite a large output space and then there's the shuffling the words was was I don't think that got particularly good results and then there's this concept of masking the tokens is is this the same as Bert what's the difference between that and Bert I think the key difference is the span and so so I think in Bert they also they had two objectives there's they also swapped some of the input tokens randomly so that's why there's the Apple in there and so but two year party Apple week Andy and the bird model was asked to also correct for that basically well as here in the just mask tokens it's not I think this one is the mass there's a paper called mass and that was the the objective in that paper hmm do you think like you're switching out last for Apple is useful I guess it looks like no well as they say right most of these tasks that the people come up with tend to do equally well but it seems like it is important that you do something like this right because because if you just try you like just take a Bert model and just train it on question-answering or sentiment class it may be sentiment works everything works on sentiment but if like on question answering it just doesn't work so even though they say most of these things perform similarly it still seems to be important that you do them and that you do this kind of thing right but I yeah I was also pretty impressed by the just the tasks that came up with and yeah I think that's cool yeah I think it even looks like a pretty big game to go from 83 to 84 point 4 on the unlike all these glue tasks also sorry I don't know if it necessarily the conclusion of they're all similar is right boy I mean pretty similar but still seems like a decent boost in the yeah yeah but yeah also we've been thinking about like the Electra pre training objective where they have the generator in a different way and wondering like if they had put that into the pipeline like what do you predict that that that would be as good or or would it you know probably be also this performs the same as the others well here you have to refresh me on the Electra obvious the the adversarial version go ahead Cara oh yeah it's just like so you you have your generator that is the one that's going to corrupt the text so it's like it's like a discriminator objective rather than predicting the mass token so you have like a low parameter generator model that's filling out the mass tokens but then the larger scale model is just seeing like what this like lower parameter generator thought it should fill in the tokens with so it's like a discriminator objective rather than like reproducing the sequence so it's giving you like 1-0 predictions of real or fake this is kind of like the Gann framework yeah but I don't know what to think of that it sound it sounds like so I I just I know that again like scans or something like Gans for text or are still pretty far away something like this it might work it just seems like a Gann objective is is a much much information lower information task than the actual language producing thing right but you have to weigh that against the fact that you know for any given gap in a sentence you could fill in like 20 different words and they would all perfectly make sense right and here you are with your training corpus telling your transformer no this one this one is the one that I want right and that that's where where something like a discriminator that just assigns a high probability to all those 20 words can come in and can really benefit you so I have I have no guesses on ultimately whether that's going to be good or bad and I do not believe mostly do not believe any numbers in any papers because ultimately what happens is they they just they come up with their new idea and they just throw more more resources at it or a bigger corpus or it's a little different trick here or it is just so hard to compare these things even if they go through the extreme like pains of of doing all these ablations it it just seems weird what do you think about like the difference between when they have competitions to evaluate the paper so it's like do you think maybe that does a better job of like decoupling these sort of easy things to add on to it because if you have like a KAG of competition it's like everyone is gonna be you know augmenting the data like crazy ensemble in the models and given it as much computers they got I think maybe competitions are a better way to tease apart algorithms like but I guess like the benchmarks don't seem to do that you know yeah this is one of the key things that Sholay is talking about in his you know measuring intelligence paper you know he says that we've centered around that this notion that benchmarking and competitions is the best way to get to get that the best result and CAG or sometimes can produce perverse results because we become so fixated on the benchmark that the winning solution doesn't even generalize to the real-world version of the problem yeah definitely I mean there are let's say there are there are two two arguments here or two two modes one is benchmark or competition is the way we evaluate things right and the other one is someone you know reads all the papers implements them carefully does the same hyper parameters in the same amount of CPU cycles for for all the models and then reports the numbers and both are just crap respectively right if you go for the I actually I'm on the side of the competition because if you say just whatever the highest number you get you know whatever you have to do for it do it right and and and you're maybe not cannot interpret where exactly your gains come from but at least you get like an absolute number this is how far you can push this right with with everything you've got whereas if someone goes and just implements all the things we've seen this there's this kind of famous paper that said all gangs are created equal right and they implemented all these gains and then the authors of the Vossler Stan Gann they came along and said clearly we stated in our paper that our algorithm only works with ID only remember rmsprop or not with Adam or something like this and then in this paper they had our hyper parameter grid is these algorithms right and they said in their paper it doesn't work like that right and then and then so so it seems to me that all options are bad but if push comes to shove I am more on the side of the competitions but I think ultimately you as a practitioner the best thing you can do is look at all of these different papers and just kind of try to crystallize out not really believe anyone individually but crystallize out if you see tricks over and over again if you see architecture types over and over again that's probably what's going to work it does make me think of some of the comments that Shirley was talking about though in in his paper I mean he used the example that in the 1980s you people like Marvin Minsky had quite narrow definitions of AI in it and it seemed to center around how well you can perform a task but it's a different metric for humans and algorithms because if an algorithm can beat the world chess player that doesn't seem particularly impressive but if a human can that's very impressive and that's because that the human is is learning how to become that highly skilled and I think increasingly when we look at it reminds me of when we talked about the the uber trailblazer because that was all about a system that learns and produces its own curriculum and I think that these competitions and these benchmarks they're so narrow minded that they stop us from making progress in what artificial intelligence should be which is being able to learn things dynamically well is there is the question even to make progress in artificial intelligence because it like you know for for I don't know a company like Google it's just if their translation model is better than everyone else's then they get money right that's that's but but the other thing is with regards to intelligence I think if you give a human the same a notion the same amount of time in actual raw let's say compute time when you like neuron firings versus GPU cycles and if you give it the same amount of memory where it can write down all the things I would not be surprised it becomes much better at chess I think the impression of the humans like they're so good because also there's limited time and we kind of get a feeling of how how just hard it is to do that but yeah I'm have a feeling we'll be doing a measure of intelligence talk very soon I think like what do you think about like the dawn bench like the idea of like limiting the computation of resources for a task and for a competition because it seems to me like you can't really have this thing where you have like fair computation for everyone and then still do something interesting with the models and are like interesting on the scale as if it was unbounded so definitely I definitely think like the it seems to me like the best effort for like machine translation and all that wouldn't even be to play with the algorithm looked a bitter lesson to just figure out how we can train a hundred billion parameter transformers or a trillion parameter and it's like like whatever is the difference between like the TPU and the GPU and how that enables these like larger scale matrix multiplies it seems like the chip engineering is a more promising direction for this kind thing yeah and the data the data collect I had some interesting conversations with friend once and I was basically arguing the point that you know you should get equal resources and things like this and his point was very clear that and I found it very good that there are effects that only come out at scale right there are there are effects that these models capture that can they can only capture because they have a billion parameters and because they have 50 terabytes of training data and it is true the only objective real number that you have is if you have unbounded computation if you have unbounded everything how well can you perform now it's an entirely different task to say you get this many CPU cycles because if I want to engineer something in a mobile phone I care about that right but it's it's a narrower field let's say but that's and that's quite a nice segue to talk about the the data sets section of the paper because some of the papers in this area like Roberta for example and they discovered that birth just simply wasn't trained for long enough it was it was under trained and as I think if we'll find out in this T 5 paper and training more is one of the most effectual strategies but just to read this data sex datasets section in the take homes we introduced the colossal Colleen crawled corpus c4 they released this by the way which comprises heuristic ly clean text from the common crawl web dump when comparing see for two datasets that use additional filtering we found that training in domain unlabeled data could boost performance in a few downstream tasks however constraining to a single domain typically results in a smaller data set we separately showed that the performance can degrade when an unlabeled data set is small enough that is repeated many times over the course of three training and this motivates the the use of a large and diverse data set like c4 for generic language understanding tasks so they had they started off with this c4 which is this huge data set it's 745 gigabytes I suppose what surprised me is no sorry that that's filtered it was 6.1 terabytes and after they filtered it it got down to 745 gigabytes and by filtering it they removed all of the JavaScript I think they only took sentences that ended in a period they did it you know I think they did some DG plication but and it's surprising how much they could they could filter away yeah I actually thought another thing that was kind of interesting is like even though it the current pipeline is like there's tons of unlabeled text available and then it's like it almost seems like natural language processing is the thing to benefit from this I don't supervise pre-training than fine-tuning but I but it seems to me like so they're able to get 750 gigabytes of text from the entire Internet and then they have like 20 gigabytes of Wikipedia plus the books with another interesting thing in the paper that you're going through if you go to table 12 in the paper yeah on my way yes like it's like look that the supervised multitask pre training is already it's only like three points behind unglue so it's like what will scale like probably the supervised like people are coming out with supervised natural language processing datasets like every day so it's like I would actually predict that the supervised multitask pre training is going to be the more scalable thing in the future because you can't get that much unsupervised I mean you can you get a lot now but in the people just build out and label these data sets it might be that the paradigm shifts and images is the thing that we think of this unsupervised pre training pipeline as because you can go drive a self-driving car around and get like more images and you could even store in a data set but it seems like maybe the supervised task is what will scale with language that's interesting because I think the take-home from this paper was that it is still better to do unsupervised pre training but it does seem to indicate that the magic of unsupervised training is is simply that you get more data so you you were training for longer with more data and that allows you to to learn better but that they did some interesting experiments didn't they with this so they tried not even doing the the unsupervised preacher at all and then they did something similar to Microsoft's MTD NN which is when you and do multitask learning so and you need to you do a uniform task learning over all of the tasks and then you use separately fine tune for all of the downstream tasks and that seems to significantly improve the performance yeah I definitely think that multistage like do your unsupervised math language modeling and then even just even if you just do the same self supervised task but like on in domain data I think that's another huge characteristic of this is like this gain you get when you're doing the ablation of the data sets and if you're like if you're doing your pre training on the Wikipedia data you'll do better on like squad and you'll do worse on like Cola because it's never seen an unacceptable sentence before in Wikipedia so it has no sense of that yeah maybe maybe that's I mean the large part of the effect is just gonna come from the fact that you have never seen text like you're gonna get in these tasks right whereas with with the images it is gonna be the same images that or it is going to be natural images all the way through it's not like the it's not like a visual question answering tasks all of a sudden has like a big arrow on the image like this thing what is this thing called right and and you've never seen that during training right so I think that that is real the distinction here the fact that they now frame everything in text and there are structures and there is unacceptable sentences and so on that the model has never ever seen during training it can't make it can't make sense out of it so these on task free training might be exactly doing this like providing the model with a sense of the type of data that it's going to get during the during the the actuals fine-tuning stage and that I mean they kind of shoot themselves in the foot a bit here by saying that because puts into question how much you can gain by all this huge pre-training right if ultimately our explanation is going to be well you still need to see the kind of data that your task is on and if you do then you're not that much worse yeah how interested are you in in like meta learning where the definition of it is like domain appetit add app tation or like rapid changing to a new like kind of environment like i like things like kind of like mammal where they try to find like a good set of initial parameters such that when you do domain shift it's like quickly adapts to this you know going from like wikipedia to i don't know one of the like the books corpus or something like what do you think about meta learning in domain data do you think like it's possible to really have like an unsupervised representation that can work for every domain or do you think you necessarily need to have this other mechanism that does domain adaptation I think one way of describing is there are more similarities between language corpora then there are dissimilarities so I suppose by doing that unsupervised training you are capturing as much useful generalizable information as possible but as you say there are many words which mean different things in different domains so in the financial world the word default is a very very bad word I I in with regards to metal learning I always find these meta learning papers it's a bit because we just don't have that many tasks even in NLP and even though people are coming out with new data sets all the time it just seems there's this limited amount of tasks and the kind of limited ways in which things are diverse and then to truly claim you have now have an initialization that works on general language it seems a bit of an over claim of all of these papers i i've seen the same thing with this there was this visual task adaptation benchmark by by google where the same thing read you want to pre train on something like imagenet but then you only have a thousand samples for all of these other tasks it just seems these tasks are all too similar such that it's you can't really then make this that you found something general so I'm a bit skeptical in that way but it's it I think it's a very cool cool field and if possible then I think that especially for practitioners that will be game-changing if the if like you can do something like this I have to say with respect to the data set it's called colossal and I think the community might be running out of words that describe large things there's Megatron and then there's a XL net right and the transformer X and it's what you know so I I wonder how high they can push this world game and also they release this which means that someone at open air I must must have been having a heart attack because how can you release something that's so dangerous like this is like ten times as dangerous as the open AI corpus would have been exactly I mean one one comments I did want to make is it's actually not that big so their data set which they call real news like I think that's or webtext like I think that's their equivalent of what they did with GPT too so they filtered this this corpus on to high quality news articles and I think that meant things that were linked from reddit that had been up voted at least three times and this is only about thirty five gigabytes but the close score is significantly higher than the than the filtered see for ya ever say how the heck is that it's only twenty times larger all the internet texts compared to like that isn't that interesting I think it's like also like kind of this is kind of like we're like the privacy thing gets funny it's like if you were to like go through like everybody's Facebook messages or like their Twitter direct messages and conversations like that I feel like you could get so much more like good text data if you were to like you can't you know break kind of like privacy thing and how enticing well I think especially so it is especially these things are especially wearing because it seems like these transformers are getting so large that in essence they're just kind of remembering the training data and interpolating between the training data because so I've always get the impression that to really make the next step in something like NLP you need a grounding of this knowledge I always think of this surely as example with the ORAC challenge where you you have a picture and these pixels are black and then there are like three blue pixels and two orange pixels going away and it's clearly like something is bouncing off a wall but we can understand it because we live in the physical world where things bounce off things right I think the next step needs to happen in terms of grounding and these models don't have it and so all they can really do and all they do is they interpolate patterns that they see in the training data and these models are getting so large that they almost they I'm pretty sure they just remember the entire training corpus and they kind of figure out how to mix and match right so they they were like oh here's a noun and then the sentence kind of starts with these words so I'm gonna finish it with this example from the training data right or with these mix of these two I'm pretty sure they don't learn very sophisticated things they just learn how to mix and match training their their mind there was a the privacy czar it come comes in through that so even worse right there is this xkcd comic where they start writing the email with the the next meeting of the rebellion leaders yeah so my question on the idea of like language grounding is like what do you ground it in so you could have like these text-based adventure games where you you you know you get this text prompt that's like you're in a dungeon you see this thing on the table what you want to do and you navigate the world with text in that way chatbox are kind of like rounded in that similar way where you have a reaction from what you're saying and then there's things like point scale navigation or the Allen Institute's bro both or thing where you're like navigating a visual world based on like language cues there's also something called read to fight monsters which is where you get to like description of the game you're about to play in language and then you go and play the game so it's like in order to ground the language you have to have the context around it I guess and it's I guess it's unclear how you would define that well I these these these text adventure games and all of these things are it's a nice idea but if you want to make this machine learning capable you need a lot of it right so you need a lot of text adventures and you're only gonna get that by someone writing a procedural generator of the text and that means you're not dealing with language you're dealing with a procedurally generated grammar where you basically just always kind of look for the same patterns of words that the the simulator writer put in it's it's what I mean by grounding is more like if you if I say there is a car around the corner I live in the physical world I know cars or objects there on roads a coroner means a house coroner in this particular case right so there is like a road and there's a road here and there's a house in between and I can't see the car but it's around the corner I love Shirley's concept though that in intelligence is very much linked to the world in the environment that we live in and that that's his major argument why there can be no explosion of intelligence but I really like some of the stuff you were saying before as well about you know neural networks are basically just memorizing machines and you could argue that we are the same I mean philosophically speaking if we if we could remember everything and this conversation now is just us regurgitating things that we've picked up from random places and we're not really thinking fluidly and intelligently of course that's not true I mean what Chalet would say is that we have this problem in deep learning that there's minimal generalization so in order to learn anything we have to do memorization and there is this wonderful paper a few years ago that said that deep learning algorithms don't extrapolate they just interpolate and the the argument that it gave was you know let's try and learn the scale our identity function which is basically f of x equals x so it's the simplest function in the whole world to learn I mean I know but this is really neural network I think they they trained it between -5 and 5 and then they they tested it in inference outside of the training range with a whole bunch of different activation functions and you know what none of them could extrapolate isn't that tragic yeah but how important yeah I guess but it's such a hard concept to define it like back to this idea of like language grounding and this idea of go look at the car that's around the corner is Google just released this like like Street view images data set where you're going through the street view and you're navigating based on language like with enough data you know you can interpolate between vision and language probably you probably can navigate very well with language so then yes so I guess it's like extrapolating is it is it necessary and can we even define it because what because then you're like okay I can do vision and language but it's still not intelligence like what else do you want to add to it like audio true right it's it's at some point because embodied cognition which is basically what I'm talking about right that you live in the world it we only live in the world via our senses right so there is our senses and then our brain that has been optimized to make right to in but it's it's all it's just input signals right that we interact with the world and one can argue that if you give a neural network and off of these different modalities it might be able to make the same connections that that is true yes absolutely and in a super way because like Tim said he can recall everything you've ever experienced ok instantly yes yeah yeah that would be a curse that wouldn't be a good thing I saw I can no sorry going the the notion of notion of this generalization I also don't I also don't know how so if you if you think of this identity function like who's to stew who's to say that we don't want the network to learn a function that goes like this and then back to the mean right it's only because you tell the human or the human kind of does this is this extrapolation but based on what that's my question like do we have the intrinsic notion that the pattern must go on with the identity function and how are we doing this I think that's the big question to answer yeah it's interesting because as you say in in the real world we have some kind of useful inductive prior which allows us to reason about how things in the world relate to each other but in mathematics if it's just a such a simple function why why would it be so difficult to extrapolate that well I think I guess like also one thing is like sort of like why I got really interested in generative adversarial and that works is I think like studying the way like a generator that produces like high dimensional output rather than classification labels is such a more interesting way to look at generalization because you can give it like this data set of cats and you get to be like did you design a new cat according to like my sense of I think it's yeah I just think it's another more interesting like thinking about generalization is models that are like creating things that are not just labeling them that's quite interesting though because is that generalization or not yeah needs like like disentangled representations is another thing I really want to get into is like is like when you can learn how to like take the hat just take the hat off the guy just put sunglasses on them you know like take apart the factors that's super interesting oh yeah so you can do I'm gonna continue going down the paper yes you're sure we keep digressing but it's it's no bad thing I think that's what this whole channel is about it's it's called machine learning Street talk for a reason now they talk about training strategies and I think here what they're talking about is I mean what struck me is if it truly was a multitask model that would be wonderful because they got best results by fine tuning in the end on the respective tasks but imagine if you did just have this language model which was multitask that this is almost like something you get in Star Trek you say to the model I want you to translate something from English to German and it's Hello something or other and the model will without any leakage between the tasks just give you the result they got pretty good results on that I mean that just blows my mind though again they're there you have to to say that it is it isn't really the model is probably not really learning the meaning of the sentence I want you to translate this is probably they're just prepend that that exact sentence to all the translation tasks right and then fine tune it and the model just needs to learn that that particular combination of tokens is now indicating the language it's it's not it's not understanding the I want you to do something oh no no absolutely but I suppose what it is doing is it this huge a model is a computer program it's it's it's kind of partitioning the model it's delegating responsibility so yes some parts of the model is responsible now for doing translation but what's interesting is the interconnected and shared nature it's not it's not like it's saying oh well the you know the top part of the model is going to do the translation it's actually reusing representations I mean it's interesting to talk about the DMT DNN paper from Microsoft because that does multitask learning as part of the pre-training before we fine-tune and the the the the take-home message seemed to be that by training on multiple tasks at the same time it gives you a kind of regularization because you're learning representations that are useful and have utility across all of the tasks yes that's very very believable I think it's it would be extremely interesting if we had better introspection into these text models to see what's what's really going on like how how far up are these things shared or is it is it just the case that maybe just the word embeddings are shared and then the model just divides itself and we're just like okay you two attention heads if whenever there's translation you do that right and you all you others do the the other thing right and then maybe as tasks are similar it would be very interesting to to perform like introspective analyses on exactly these models how much the representations are shared where the subroutines are reused across and maybe you find some some super interesting things so there it was this paper a while ago that did zero shot translation I don't know if you've seen something like this where what you do what I think what you do first is you pre train on let's say English and then you just fine-tune the word embeddings of French and then you can like input English and output friend I don't exactly remember how it went but you can do like zero shot translation without ever having trained on that particular language pair so is that is that that because some of the recent neuro translation models essentially what they do is they translate if it's trained in a multilingual way they will translate all of these different languages into a common intermediate representation so in that sense it could translate between almost anything it's not it wasn't the same thing but I'm gonna find it here because it was pretty interesting and it basically showed that there is something going on like an abstractive language understanding I'm not saying rounding or intelligence but just like an abstractive notion of of language where maybe okay English in French okay together but you know there's something going on that's beyond just a single task and it would be again it would be super interesting to see how these multiple tasks and objectives are handled in the representations throughout the layers of the transformer I wonder whether someone's done a paper on that because when you started to put your hypothesis on the table about the it being partitioned rather neatly with the attention heads that was quite convincing to me so maybe it does do that because if it is just if it were a horrible fully connected neural network then of course it would be very very difficult to understand what's going on but for some reason having clean partitioning across the attention heads that seems like it would be more likely to happen yeah it kind of is like a conditional computation model to with how it routes it to the attention heads a little bit oh not really yeah but like so it's multitask learning and I actually hadn't understood that I thought it actually was understanding that like summarize so that I'd like understood that and then summarized it so I didn't I I thought that it did have a sense of that like if you I thought that one of the cool things about the text of text tasks is that maybe if you had this new task like like who is the pronoun referring to you could like prefix it with that and then the sentence and it would learn it a little faster than if you had say embedded like zero zero zero one zero zero indicates like pronoun disambiguation or something like that yeah yeah I mean it goes to show as well that I love this concept that neural networks are basically computer programs albeit ones that are differentiable because as you say there is actually no difference just to using a piece of text it it's still just processing it in exactly the same way with almost no leakage but do you think it picks up the new tasks faster so this is the this is like to me what's so interesting about text the text being something that can enable transfer learning do you think it picks up the new task but let's say you come up with some I don't know like some prefix tasks in natural language and then you give it the task after the coal do you think it will maybe learn that faster than as if it was like totally zero shot and then like there's some say embedding vector or norm extra like normalization parameters that tell it that this is this new task well I suppose it would be hard to give it an a one hot vector because it because actually the the the tokens are encoded using some word vectors so it's in a much higher dimensional space but assuming that you you replicated something similar I don't see why it would make any difference you could you could just put some garbage there but just always the same garbage right but something that does not in natural language mean the task that it that you mean it to mean you get like so instead of saying translate from English to French you just say I don't know sandwiches are nice but always he always always whenever there is translate from English to French you always say sandwiches are nice right and then this that would be maybe a cool experiment to see whether it actually helps that it has probably seen the phrase translate in that context somewhere in the in the training data and you know there might be a bias towards there now being multiple languages I don't know but it's a cool experiment school idea yeah like maybe the training curve is a little bit faster because it is still taking that summarize and then hitting the go and embeddings table with it and getting a sense of what it means hopefully like the embedding has a sense of summarized yeah oh yeah that that's less exciting to me yeah unfortunately the it's never as exciting as it sounds at first sight I mean just to kind of talk about the the lost you know think that they've really played within in the paper and it's almost depressing that it's the scale and the amount of training that made the largest you know the difference to the results so the baseline on glue and by the way that there's another thing called super glue in it and we can pull up the paper for that but that's a more surface dedicated version of blue like challenges because we were getting too good at the glue benchmark but yeah the baseline was eighty three point two eight and when they did four times as many training steps it went up to about eighty five point three three so that's about two points up and the biggest improve from it was was when they increased the size of their network at you know number of parameters by by two and they did that not by changing anything in the attention heads but by increasing the number of parameters and the feed-forward layers and they also did two times as many training steps and that gave them essentially three points of improvement which is huge yeah I think what what will happen is that like this we were talking earlier about you only get 750 gigabytes of like taking all of the text from the internet and then you know with the more clean datasets you get like 20 or 30 so I think what will happen is that you'll train this really large model and then it'll just do like one pass through it other like you know it won't like I think the size is gonna scale better than training steps because it seems to me like they'll figure out how to train like a hundred trillion transformer before they'll figure out how to get like the corresponding data set size because like the two have to go together with that scaling I guess it's quite interesting as well because there there are some weird things that happen when you increase that the training size when they did the multitask learning for example if you just do a a uniform prior over the tasks and some of them did quite badly and of course it's just a balancing thing because you know you're you're kind of under training on some tasks and over training on other tasks and they came up with a temperature parameter to kind of you know to wait the relevant tasks as a function of how much contribution they had but it's not just the difference between the tasks it's also the concept of the the unsupervised training objective dominating the downstream tasks so there's quite a balance I think the most important thing is the the kind of transfer rate between all of the different training objectives hmm yeah it's cool to see like not cool but you see that same phenomenon of like class and balance like where you have way more like dogs and cats than and then also in this task framework it's also like yes it's definitely interesting like the class and balance problem and then seeing it on the tasks level also but yeah I mean you would think that like the text prefix would be able to like do the like I guess that's kind of why the conditional computation thing is so interesting if you can have this special like sparse routing that comes from the task embedding in might that might be the best way to overcome like class and balance it cause some balance tasking balance or maybe like kind of also it's kind of also like catastrophic forgetting and continual learning in a way I think like a little bit and that I think the latest solution to that from the Jeff clean paper is is like neuromodulation and sort of another kind of sparse routing I think just like that I'm not exactly sure what happens in that paper but yes definitely interesting to think of the composition of the multitask datasets in the size of them absolutely I mean that's pretty much the paper to be honest but I think it was it was really really good just to explore you know if I change this and if I change this what makes the biggest difference and clearly when when you look at the history of all of the kind of the papers let's say in question answering over the last few years it was the Bur paper that really gave us the significant boost and since then we've probably got you know another ten points worth of improvement but is balancing out but when you look at a kind of two year time horizon it was this transfer learning and maybe all transfer learning means is training for longer on more data but that was what gave us that the biggest improvement okay I mean just another comment that might be worth making is just how this is democratizing it's so easy now just a download a pre train model just this unified t5 model that we're talking about I mean you can download it they've published it online or you can train it again yourself and you can just do really sophisticated translation just I guess the point I'm trying to make is a few years ago and if you wanted to do really good speech recognition or translation you couldn't do it yourself you had to use the Microsoft or Google cloud platform you know that it's called cognitive services on Azure because these are things that it was just accepted that you couldn't do it yourself because you didn't have enough data you didn't have a good enough model you had to pay for it now it's really becoming quite democratized yeah I think it's also like yeah like you can just form your own data set and then be like answer these questions about it and there's I also think it's so important to remember that like the MTN MT DNN like multistage thing to where like if you can get like say you want to get all of the deep learning papers and make a data set of that like before you start labeling it you can also do this self supervised learning on that data too and have that pipeline show us something is yeah I mean and that may be like in another way to think about it is why is language different because a high quality language data set apparently as thirty five gigabytes that's nothing whereas facial recognition that seems different you know the google face net paper they've trained it on about twenty six million faces and that's why it works so well and that has incredible value I mean I wouldn't be able to download that from Google I mean if I did it would be very very valuable so it's almost as if language is now becoming more democratized than vision yeah it's just a more compressed representation right I think we've just not figured out yet four images how to compress how to compress them in a way so that we can reconstruct them but ultimately all you care about is what ends up in your brain right so so for language we've we found this good notion that we can encapsulate the sound waves in just words but I just don't I think for images we just have not yet done this and maybe in the future once we figure that out the the datasets might just become smaller in size the question is how much informative content is there maybe one way of looking at it though is that and because language and vision are both there in their unstruck data it's in the unstructured domain but if you think about it vision is much more unstructured than language language is actually very structured isn't it you know you have atomic tokens and you can model very very clear that sort of relationships between where they appear and it's one-dimensional so maybe that's the reason why you actually need much less you know pre training data for language as you do for vision yeah possibly yeah I mean that there's there's it you only have to kind of learn the grammar of a language right to to make sentences at least but in order to learn what what are natural images you have to like learn correlations between all the pixels and no one tells you what's what's good and what's not right I'm not sure maybe it's also because vision is just like almost like a 2d image of the actual 3d world with objects so it's sort of a position yeah of something um yeah I'm not not sure I'm I have another questions what what do you think how far will this this we encode everything as text this task go because I'm pretty sure if I were to produce something like a chatbot or something like an assistant you know like an Alexa I would just hard code all the single commands and that would work much better for now right that's actually a really interesting point because if you you know there is that famous chat bot from Microsoft Cote or Tao and if they you should be incredibly careful releasing anything into the world which learns and which isn't explicitly programmed and not even talking about learning right it's just like do you do you train it with this notion of I'm going to really understand what the human wants in like a deep learning kind of way or you know if I were to do this I would try to understand what it means and then just reg X match to my predefined lists of tasks that it can do well I think we define the reg X list to be so challenging like if we're like on lift customer service and you and I are both frustrated about like the map is made an error I think we're both going to describe our problems so differently and natural language that that's what the problem and then so then I guess it's like this idea of like fast adaptation it needs to quickly adapt to how Conor describes his frustration with the map yeah tend to how quick and then quickly adapt to how Yannick is gonna describe his frustration with the customer service so so yeah I think like that domain adaptation and like getting quickly to the new domain is going to be the key thing that makes this applicable I think there's something interesting here though which is that first of all people think that chat BOTS are all about AI and they're really not they are just a user interface and they are best constructed with static rules the only natural language processing that you get is things like intent recognition so what do I want to do because if you think about it and if you did have some kind of an unstructured output I mean imagine you plug the GPT model into your chat bot and you could say to the chapel how are you feeling today and the GPT model will say I'm feeling great things but what value does that have it's completely useless and it's also unquantifiable if I wanted to test this thing if I'm releasing this thing in my production system and I wanted to write tests and I wanted to understand what it was doing it's just it's completely useless I've got no idea but what it yeah what if it was like it was pre trained on like the chat bot it's like a chat bot and then you can fine-tune it somehow and like all the deep learning papers and then you can ask it questions about that it would be sort of interesting that would be sort of interesting I just I just find this notion of you know formulating the task as text and then what what this paper doesn't do is then how how then can you do yes something like the meta learning that the Conor said is quick at can you can you throw if you do this enough can you throw a task at it that it's never seen in that form given the textual instructions right and then I'm thinking back I've seen a presentation by a Richards soccer a while ago of Salesforce where he pretty much proposed the same thing but he said all of NLP should this could be question-answering so you always have a question and the context and he he had this this also this notion of you can output text to answer the question so you can frame something like next sentence prediction or you can do something like sentiment analysis by simply outputting text right but he also had this notion of pointer networks and these I find have just gone down I don't know if they don't work or if they're just not hip but he always uses these or his group always use this pointer networks where he says you can either output text or you can point to something in the context or in the questions and here's that here's the answer and I think you know think because they already formulated this kind of all of NLP should be in this one task structure very similar like here maybe we're ready for a return like for a resurrection of the pointer Network here just interesting parallels it can remind you like multi-hop question answering like that kind of idea where it's like you need to say like you know you need to answer your question such that you want to go and ask two more questions to like better answer the question yeah that context free question answering is one of the key things that they claim this t5 model does well is that you can be like when was World War two or you know any question and it just knows it because it's memorized it yeah yeah of course but the GPT - was that was also its thing right they but they're they didn't fine-tune they just pre trained the language model and then they posed the question and magically that the answer came out and just right because it was in the training data somewhere yeah I love this parallel between in in the olden days we used to have knowledge graphs that were explicitly computed and we would say in that knowledge graph that Barack Obama is 45 years old and the thing is it was so ownerís to build up the knowledge in the knowledge graphs and and the information was often conflicting so they used to write unit tests you know which we check for logical equivalence and actually that's happening in language models as well so you know my friend Marco Rubio he this concept of coming up with no semantically equivalent adversarial examples or logical tests so what he might do is he might ask the question how many boots are there in this image and the answer would be 3 and then the logical consistency question would be are there any boots in this image and you expect it to be yes and you can use all of these things possibly to generate an adversarial examples which you can then use to you know to pre train or to fine-tune the model or you could potentially use it as a way to say no I don't like this model I'm not going to put it into production but I'm quite interested in in this engineering reality because we have these unstructured models and coming up with some non interactive way to reason about its behavior is really interesting but even humans are terrible at this like humans would fail these unit tests so hard because I don't ever so there is this this thing in like personalities science called the the Big Five personality tests right and it's kind of statistically derived and so on and if you it's a self-report test and if you fill that out they would always ask you the same thing in two different ways they'll be like I often feel sad and then it would be like I don't often feel not sad or they do it smarter but still and then like the times the amount of times where this number isn't the negative of this number is just astounding it was like hey human you said you always feel sad but here you said you're mostly happy what's going on you know what's even worse so when I applied for my current job they gave me these silly psychometric profiles and and they they said I was only an X out of X on this dimension and I said why is that and he said you took ten milliseconds too long to answer the question [Laughter] crazy maybe maybe these I mean I see where these kind of tests are going but it's it's at some point it becomes more becomes more a question of the tester right of how well the tester how well these things are really logically equivalent all right because language is just so subtle and the the the boundaries to things like sarcasm you know the boundaries to things like hyperbole and are so narrow sometimes that it is super hard to write this procedurally again if we could write the these things procedurally we could solve the problem yeah but again this is something where we have the potential because the reason why words net for example I mean that was also a knowledge graph and it was very brittle and it didn't get updated very often and it didn't know colloquialism so it didn't know that this word means something different where we are but possibly the composability and distributed nature of deep learning architectures would allow us to start with a kind of pre trained base model and then potentially and have a high velocity model which is kind of learning colloquialisms that are relevant to me yeah definitely well guys I think we've come to the end of the show but it's been emotional I just want to address maybe viewers that we just were trying different things so you last week we had an author on this was more like interview style the week before we had like many different people we're in a field where none of us was really you know well versed we just kind of someone had something to say at some point so if you know we don't know ourselves what's what is the best and the most interesting so we're looking very much for feedback on what people like and what is interesting and what isn't and should we you know be super prepared and very technical for people that know everything or are most people like wealth you talk about things that I don't even understand you should introduce the the concepts first so I'm just kind of an appeal to the listeners or viewers to leave us comments poke us with planes and remember to hit that subscribe button and the notification bell cos if you don't we will be seriously unimpressed alright these are
Info
Channel: Machine Learning Street Talk
Views: 3,362
Rating: 4.9250002 out of 5
Keywords:
Id: N-7rdJK4xlE
Channel Id: undefined
Length: 100min 9sec (6009 seconds)
Published: Thu Apr 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.