Kaggle Reading Group: XLNet | Kaggle

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay waiting for the streams to catch up with me audio sounds good quickly check audio in the other end that sounds good okay now we're just waiting for YouTube which is always the slowest for reasons that are unknown to me so I should make weird sounds alright looks like we are go good morning everybody well morning for me I suppose it could be different times a day for y'all we are starting a new paper in the cago Reading Club and as usual I opened up to a vote and I remember the last one it was efficient net and the one that we actually read and I definitely remember the name of that paper hi Nicholas yeah this is a good time to make it to the live stream because we're starting a new paper um well whatever people we read last time one by one vote sparse transformers that's what's the one ah no it was one after sparse transformers anyway we've read so many papers they start to blur together a little bit ah this time there was no contest XL net handily beat out all of the other papers so a little bit of information about Excel net I don't actually know that much about it except that it's autoregressive obviously from the title and I'm mostly familiar with Auto regression from time series and the basic idea is that you do regression on time point and based on time points n minus 1 etc so you're regressing on previous points in the time series and I think they can be used in other ways but that's the only place I've run into them personally and this is quite an interesting assortment of folks so some of them are from Carnegie Mellon CMU which I think is in Pittsburgh it's well known as a machine learning sort of computational science University I think they also have a really good robotics program and also Google brain and unless I'm mistaken I think at least one of these researchers is from Apple AI right now or they have transferred to Apple AI because I was looking up all of the author's to tweet about it and one of them had a bio that said that they were from Apple and I think it was I'm just gonna pull up the tweet here so here's a tweet no yeah yeah yeah yeah so rush Salak salakhutdinov I probably got that wrong and that's on me I'm sorry about that it's also at Apple which is interesting because you don't see Apple generally publishing a lot of papers and this may be a little bit of a little bit of a sea change there so that would be exciting and then kwok allah who actually we did in we i I didn't interview with a little bit ago we talked about Auto ml and neural architecture Church so some people that I am more or less familiar with I don't think I know sueleen young or you mean young jaime carbonell a little bit familiar anyway that's the things that I know about the paper that's pretty much it oh the other thing I know about the paper which I think is the important thing that people probably heard about is that it is currently state of the art so in this paper they lay out results that are better than the ones laid out by Burt which we read like three papers ago and Burt is an architecture that does language modeling using bi-directional transformers and if you you remember the sort of the basic ideas that there's kind of its trained on a task where you have a sentence and you mask one word in the sentence and you guess that word in the sentence instead of modeling the conditional probabilities between every word pair or triplet or whatever which was sort of like the more usual way of doing language modeling all right uh hello hello hello I'm getting so many people when teondra David Andrew Andrew that's how that name is usually said Ralph made it to the live stream welcome and connect so hello hello welcome everybody uh and with that other way let's jump in so I do I did put the link in the bio this time not the bio the description so if you want to read along you're very welcome and I will also make it bigger all right abstract with the capability of modeling bi-directional contents denoising auto-encoder biased sorry denoising auto-encoder based pre-training like bert achieves better performance than pre training approaches based on auto regressive language modeling denoising okay presumably we will talk about all of these things in the future however relying on corrupting the input with masks Bert neglects dependency between the masked positions and suffers from a pre trained fine-tune discrepancy okay so the masks is the thing where you are looking at the sentence except for one of the words which is hidden new Bert neglects dependency between the masked positions so they're saying that because there's a mask they're the things that you have masked you can't model their relationship to each other which means that when you take the existing model and apply it to a new problem you get different uh outputs I guess let's see sorry I saw some some comments go by is this model launched by Microsoft better than Bert oh I know there have been some Microsoft models I think Big Bird was a Microsoft model um excuse me sorry this one is out of Carnegie Mellon and Google yeah Andrew says two masks I'm assuming what they mean here is when you are training over successive sentences you will see different things being masked but yeah I think that generally during training they were only masking one thing per sentence I'm this is the abstract so hopefully everything that they say here they'll go into deeper detail and a little bit longer a little bit later on in light of these pros and cons we propose XL net a generalized auto regressive pre-training model so auto regressive rather than Auto encoding that one enables learning bi-directional contexts by maximizing the expected likelihood of all permutations of the factorization order what enables learning bi-directional context we're looking in both ways by maximizing the expected likelihood we want the the Emily which is very common in statistics over all permutations of the factorization order all permutations so all possible combinations what's the factorization order I don't know what the factorization order is and to overcomes the limitations of Burt thanks to its Auto regressive formulation so I think they're claiming here that the model works equally well on its training data and unseen data furthermore excel net integrates ideas from transformer Excel the state-of-the-art raw auto regressive model into pre-training empirically Excel that outperforms pert on 20 tasks often by a large margin and achieves state-of-the-art results on 18 castes including question answering natural language inference sentiment analysis and document ranking andrew said do they mean dependency between two masks in the Train case I think they must yeah I'm hoping they'll go into more detail there because I am a little confused Renato says the new model from Microsoft is called MTD nn yeah I think that was actually an option for one of the papers at some point in the voting in the past I haven't read it or if I have I forgotten all about it which like both equally likely and meeting Gustav are both doing very well Gustav is my hedgehog and he has a pet visit a pet visit a vet visit next week so hopefully that goes well okay so they're saying that as opposed to Bert they can handle one second sorry I would make sure you guys don't pay that a little bit of echo so as opposed to Bert they can handle no I think they can handle the same data sorry I'm trying to pick out the differences between the models so they're saying that they're autoregressive rather than Auto encoding and that this means that they can perform better between the pre-training and fine-tuning so that those are more similar to each other performance on those okay that's what I'm getting so far from this and then of course they get state-of-the-art blah blah blah otherwise people would not be talking about the paper Andrew says perhaps this is an example generative output and not an abstract no I think it is an abstract this is also a fairly machine learning a group of folks I think I don't know how many of these people predominantly work in natural language processing and with language data I'm not super familiar with any of them except koala and I know that he predominantly works on like very core machine learning a Auto machine learning thing mine says we read MTD nn yeah we the channel ah Oh was it the one with the the gated the gated not like not gated rnns but like gated CNN's I vaguely remember multitask learning oh yeah we definitely read that we definitely read that okay so that's the one where you're learning multiple tasks and the errors on each of those tasks propagates through back to the same model okay all coming back to me this is not coffee and it's not caffeinated so uh we'll see how I do thanks Ryan yes we did read that okay introduction oh this is nice it doesn't have two columns which is another generally ACL papers tend to have two columns and this is much more of a ICM le type of format which is another giveaway that maybe maybe these are machine learning researchers and not so much an LP researchers we're just fine it's nice to have you know cross-pollination Alec in says so the autoregressive bit introduces some kind of correlation between the masks physicians will find out introduction unsupervised representation learning has been highly successful in the domain of natural language processing I hate this citation style I hate it I want to see the names in the years in the text that is important information that I need to know for me and my scholarship hate it with a fiery passion the only place I think it's acceptable is those on a poster because then you're on the same page and you can just look down in the corner that is my personal huge pet peeve but I think it is again I see a bowel style so alright well I'm assuming that they're gonna cite like the important ones here and I can't see what they aren't on click over so deep contextualized word representations I don't know if we read that that one's out of Allen NLP and I'm gonna have to scroll back up oh like if you're gonna do this at least don't at least don't alphabetize your citations so that I can like look at a bunch of them in one go and then not even an order mmm okay well you guys can look at those citations on your own time typically these methods first pre-trained neural networks on large-scale unlabeled text corpora and then fine-tune the models on representations of downstream tasks yeah we've seen a lot of that sorry for ranting I'll try and keep it a little bit more no I'm not going to if something bothers me that much I'm just gonna let you know yeah under the shared high-level idea different unsupervised pre training objectives have been explored in literature the literature explored in literature tells to me that says to me that they are talking about this being explored in like literary fiction which i think is not what they mean so there are different ways to do the pre-training or language modeling back in the day it was all sort of like statistical you know Engram sort of base stuff recently people have been trying different things among them Auto regressive language modeling and auto encoding have been the two most successful pre training objectives auto regressive language modeling seeks to estimate the probability distribution of a text corpus with an auto regressive model specifically given a tech sequence x equals x1 through xt auto regressive language modeling factorizes the likelihood into a forward product factorize is the likelihood into a forward product okay P of X is the sum of probabilities of a probability of XT given X less than T or a backward one okay so we are looking at each word in our sequence and we are calculating its probability given previous words in the sequence or given further words in the sequence oh this is just normal language modeling I don't think I've ever heard it called Auto regressive language modeling okay this is exactly what I think of when people say language modeling I my feet are under me again I understand things I'm good to go product of probabilities yeah yeah yeah yeah sorry did I say some definitely product thank you for the correction okay a parametric model eg and neural network that's a weird thing to call a parametric model given that we're in probability space is trained to model each conditional distribution okay so we're given each conditional distribution would be a distribution across the vocabulary for each word in our vocabulary since an AR auto regressive language model is only trained to encode uni-directional context either forward or backwards it is not effective at modeling deep bi-directional contexts on the contrary downstream language understanding tasks often require bi-directional context information this results in a gap between AR language modeling and effective pre-training yarrr says the auto regressive term is more frequent and time series analysis I think yeah that's where I've heard it before but I guess that makes sense because you're modeling the probability of a word at time point t given the words at time points before t I was gonna try and do it with numbers and I didn't start at one so it's a little bit hard to say with word good okay and they're saying that this is really good at going one direction or the other direction but it's a little bit more difficult to go in both directions which is why people have done things like Elmo where you train one model forward one model backward and look at them together or dumb things like Burt where you're looking forward and backwards at the same time given the point that you've masked the word that you've mask sorry where words are points I guess okay I am uh yeah yeah yeah yeah okay so this makes a lot more sense I think this is just like vocabulary that I wouldn't use to describe these particular things but now that I am on the same page this makes a lot of sense in comparison a EE autoencoder based pre-training does not perform explicit density estimation but instead into aims to reconstruct the original data from corrupted input a notable example is Bert which has been state-of-the-art which has been these state-of-the-art pre-training approach given the input token sequence a certain portion of tokens are replaced by a special symbol mask and the model is trained to recover the original tokens from the corrupted version cents desk density estimation is not part of the objective Bert is allowed to utilize bi-directional contexts for reconstruction sorry I'm trying to think about whether the fact that Bert isn't doing density estimation is what allows it to be bi-directional I guess because you don't need to have you don't need to be sort of like looking at change over time Santhosh says I think auto regressive models are used to convert joint distribution into a conditional distribution oh that makes sense as well so it's not just co-occurrence it's you know more linear losers complete sense when you use it in both directions though this is a rewrite of Elmo oh and Nicola says what does density estimation mean so it's I think of it and it's been since I've had statistics I think of it as a probability distribution over a set of things so feel free to jump into the chat and correct me here but I think of it like if it's sunny today and it was sunny yesterday can I use that to determine how likely different weather patterns are tomorrow so will it be sunny will it be cloudy will it be rainy given the information that I have where am I going to sort of like put the the bulk of my probability points if that makes sense yeah so in a language modeling context that would be like if I say I'm going to the blank and then I have my whole vocabulary I'm going to assign probability points to things like you know store park school ball game or like all vocabulary words that have occurred in similar positions in my training corpus Nicola says is it like conditional probabilities yes but over like a set of contexts I think that's my understanding right now I'm again not a statistics professor so if you all have different understandings I would appreciate more you know collaborative your says sounds pretty right to me okay good also all of my like understanding of statistics is just like you know like those like toys that are a bunch of nails and then they're between like Plexiglas and you like put your your hand into it and like the handprint shows up you like you put your face into it and your face shows up it's like that but it's numbers and like things possibly happening I don't know I don't know why that's how I conceptualize statistics but it is Andrew says I'll says I think you got it right thank you helpful anything that is approaching it as what's the most probable thing unknown given what I do know yeah that also sounds right to me all right love a little of so we were talking about Burt everything's available online which I think the Calico community knew because a lot of people use that in the competition but recently closed equal contribution okay so the first two authors are contributed equally which is nice they're both first author orders determined by swapping the one in nine okay so there must be another paper where they both had equal contributions and they're just doing it with the opposite opposite order here okay so as an immediate benefit this closes the affer mentioned bi-directional information gap in a our language modeling leading to improve performance so this is if you have a sequence and you hide one thing in the sequence you can model both you can model it or you can guess it given both previous and preceding information proceeding and post seeing is that a word before and after okay who says oh my god I'm learning thanks we're all learning we're all learning together every morning I wake up and I'm like I'm making heard of so many things and that's a gift I guess I get to explore them your says all probability concepts are more clear to me since I started to study the PI MC 3 library oh that's helpful and Haresh says are you a PhD I am it is in linguistics with a focus in natural language processing and socio linguistics and I've talked about my dissertation before on the channel and I'm not gonna do it again cuz otherwise I'll get sidetracked however the artificial symbols like masks used by Bert during pre training are absent from real data at fine-tuning time resulting in a pre trained fine-tuned discrepancy okay so this is what they mean they mean that the token that you are using to mask never shows up in your vocabulary excellent so and that answers the question that I think hmm Andrew had earlier ah yeah helpful more ever since the predicted tokens are masked in the input Bert is not able to model the joint probability using the product rule as in AR language modeling okay so you don't because you're guessing the masked word you're not giving it you don't know what it is so you can't learn about its co-occurrence with other words is what they're saying but if I remember correctly from the Bert paper they had some that were masked and some that were just the like I think only like thirty percent of them were masked does that sound right it was a minute ago when we read this yeah Roth says if we see the next word after the word then we want to predict the word seems to be learning with next words and previous words I think that's why hi directional settings won't work with language models bird is not a language model it's more like a denoising auto-encoder I think it is a denoising auto-encoder but we it is also used as a language model would be so I am comfortable calling it a language model even though it's not modeling conditional probabilities for me what makes a language model language model is extremely utilitarian like can i plug it into the place where I need a language model in my in my you know processing and less you know specifically does it have conditional probabilities of words that's me uh yeah is that overcome by multiple reigning random sampling during training yeah I would think so but I guess their point about you never see a mask token in real life is you know reasonable data more ever since the predicted tokens are masked in the input okay we did that in other words Burt assumes that predicted tokens are independent of each other given the unmasked tokens which is oversimplified as high order long-range dependency is prevalent in natural language Burt assumes predicted tokens are independent of each other given the unmasked tokens okay so we have like eight we've got like let's say we have eight training samples each of them has one masked word we predict the masked word for each of them but we don't use the predicted word in the rest of our training so we don't know the relationship between our predicted words that were masked during training I think that's what they're saying uh Schroth says a language model has to have generating capabilities which birch lacks I guess I could generate with bird I don't know that you could generate like right-to-left in the normal language generation way oh there's a good question if I wasn't working on something already for my live coding I think that could be a fun life coding project Santhosh says what is denoising in the paper what is the denoising main ICML 2008 who that is and that is vintage yes no tell me what denoising means tell me what denoising means to death oh I see okay so the denoise thing means that we've corrupted the input this is the same terminology that they use in this paper which says to me that this is in the same sort of academic tradition so it must be denoise introduced the masks and the masks are introduced noise and by guessing the words to replace them that's removing the noise that's a good question thank you ah Nicola says assuming the Bert doesn't use the predicted word for further training isn't that a good thing I don't know I think that's an empirical question I think one thing that you could do if you use the predicted word for further training is end up with that thing where existing biases are amplified which we know so if you have an unbalanced class and then you train models on an unbalanced class one thing that we know about sort of like language models and distributional semantics models is that that class and balance will be exaggerated which is part of the reason why D biasing in in word embeddings is such a big deal because if you don't do some sort of D biasing any existing bias will be worse when you apply your model and I'm trying to remember the citation for that and I don't off the top of my head but I do have a kernel on Cagle that reproduces that paper so if you're interested yeah Scott says I think you have to feed Burton betting's through a sequence to sequence layer to get language generation yeah that makes sense ah yeah okay good questions I have forgotten where we are okay so here's what they're talking about the modeling the joint probability I guess that's true in other words Bert assumes the predicted tokens are independent of each other given the unmasked opens which is unsimplified as high order long-range dependency is prevalent in natural language I'd have to go back and review the Bur architecture I don't actually know if they're independent but I I guess I trust these researchers to know what they're talking about do I have any updated version of bio Bert in Excel net I don't somebody in the comments might surah says they're talking to you ok faced with the pros and cons of existing language pre training objectives in this work we propose Excel met a generalized autoregressive method that leverages the best of both a our language modeling and AE while avoiding their limitations oh so many people don't call autoencoders language models I do that's me again extremely utilitarian firstly instead of using a fixed forward or backward factorization order as in conventional AR models excel net maximizes the expected log likelihood of a sequence with regard to all possible permutations of the factorization order are these shuffling words what does that mean thanks to the permutation operation the contacts for each position can consist of tokens from both left and right in expectation each position learns to utilize contextual information from all positions ie capturing bi-directional context ok the ok ok ok ok ok ok so there's no ok so in the product there's no effect of order yeah so the effective order is that you are looking at words a position T given words before that position ok it doesn't matter what order the words before position T are in so I think what they're doing is they are shuffling all the words training a language model on words in a shuffled order shuffling again training shuffling again except they're probably not shuffling they're stepping through all the possible permutations oh this makes me tired to think about that's a lot of work is that what they're doing the context for your consists of tokens from both left and right yeah it would it would if you are if you are doing that in expectation each position learn to utilize contextual information from all positions ie capturing bi-directional context Oh mobis here says excel that is time-consuming yeah sounds like huh I think someone did the math and figured out that it took like two hundred eighty-five thousand dollars worth of compute to train which Yanks that's so much secondly as a generalized AR language model Excel net does not rely on data corruption okay so you're not removing data but in language randomly shuffling words is something that I personally would call corruption and I know the corruption probably has a a specific technical definition here but if they're doing what I think they're doing they're still messing with it they're still doing stuff hence X on the X on that does not suffer from pre-trained fine-tune discrepancy that Bert is subject to meanwhile the autoregressive objective also provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens eliminating the independence assumption made in Bert okay that makes sense because for every token you see all the other tokens in your sequence that you are modeling Andrew says like looking at all possible orderings from a bag of words yes this is what I assume that they are doing Surratt says assume seed as I like to wet Bert will predict after that should be something the model expects after the seed but that's what we want to model predict and that's what the model needs to predict so Bert would so you would have to be like I like to mask and then a squirt to predict what the thing was that came after that it is a kind of noise yeah you're says they're adding noise corrupting the output but I get I get what they're doing I understand what they are going for oh it feels weird to me it feels like it's ignoring syntax okay let's make sure that they are doing what I think they're doing first in addition to in addition to a novel pre-training objective excel net improves architectural designs for pre-training inspired by the latest advances in AR language modeling excel NEX integrates the segment grunts mechanism and relative and coding scheme of transformer Excel have not read that paper into pre-training which empirically improves the performance especially for tasks involving a longer text sequence okay so they're doing something more efficient with their model architecture that we'll talk about soon presumably naively applying a transformer Excel architecture to permutation based language modeling does not work because the factorization order is arbitrary and the target is ambiguous as a solution we propose to reprint parameterize the transformer Excel Network to remove the Samba Gyu ax T okay they are definitely assuming that I know what transformer Excel is I know that it exists which is about as far as I have gotten with this so you can't just take your existing architecture and then use it again because the factorization order is arbitrary I ran them and the target is ambiguous I don't know what they mean by that as a solution we propose to reaper ammeter eyes the network to remove the ambiguity okay presumably they'll talk more about that when we get there empirically Excel net achieves state of the art in eighteen casks seven glue language understanding tasks three reading comprehension tasks including squad and race seven classification tasks including Yelp and IMDB and the clue web o 9b document ranking tasks under set a fair comparison extent experiments excelled not consistently outperforms burped on multiple benchmarks okay uh I will point out that since reading I wrote a blog post awhile ago about a paper looking at baselines and showing that you can if you was he so it was for recommender systems and the authors fine-tune some baselines from like ten years ago and achieved state of the art by continuing to fine-tune so I am a little bit more leery of model comparisons these days but I'm assuming that they're going to compare it to the results reported by the Burt authors which i think is probably fair was he right related work the idea of permutation based AR modeling has been explored in 3211 but there are several key differences all right 32 Jam LR so neural auto regressive density estimation very much not a an LP conference and then masked auto encoder for distribution information estimation ICML actually just add up okay so there is an ACL paper here I'm just gonna look at the references I'm just curious pretty print preprint preprint Europe's dataset common crawl new rips a see em I see em I see W CD I'm not familiar this conference archive preprint preprint ICML i see ice k yeah never heard of this conference preprint I'm sure it's like I'm sure these are all fine conferences I'm just not familiar with them preprint ACL okay one ACL paper so far out of 14 preprint preprint preprint new rips preprint preprint preprint uh LD see this is gonna be about dataset preprint weird preprints preprint preprint preprint vldb endowment no another thing i've never heard of uh something wrong with that citation preprint j ml our nerves i CLR preprint preprint preprint Europe's I Triple E okay so this is not a specifically an LP e this is this work is not specifically situated in the NL PE literature and I think the ACL paper is probably Bert oh it is not okay so this was a burnt paper sorry the burnt paper was published at knuckle which is an ACL conference also I do not like when people cite preprints when the published paper is available and I will get you on it if I'm reviewing I'll write you a little I'm not gonna write an ass you know but imma be like hey this is not the citation for this paper incorrect alright so people have been doing AR based modeling permutation AR modeling those paper titles did not seem to me to be focused on language specifically previous models are order list Oh interesting okay so it is order aware oh interesting so I take back what I said about them ignoring order entirely previous models are ordered us well Excel net is essentially order aware with positional and coatings okay Bert also used positional and coatings and they had like a weird part of the paper where they were like we compute it with a function instead of counting this is important for language understanding because a order less model is degenerated to bag of words link lacking basic expressivity yes ten out of ten correct okay I'm much more on board with this paper down the above different results from the fundamental difference of motivation previous models aimed to improve density estimation by baking an order list inductive bias into the model why Excel net is motivated by enabling a our language models to learn bi-directional context so I think the difference here is that X on that is for language and language is not random in terms of word order as all of you know as language users oh okay let's get our MS math hats on in this section we first review and compare the conventional AR language modeling and Bert for language pre-training given a text sequence x1 DX t AR language modeling performs pre-training by maximizing the likelihood under the forward auto regressive factorization this one where H theta x13 t minus one is a context representation produced by neural models such as RNs or transformers and e^x denotes the embedding of X Andrew says aren't they effectively finding an order that produces the maximum likelihood which may not be the actual ordering in the training case I don't know hopefully I will after we get through this formula okay so we're maximizing theta where theta is the log-likelihood blog probability of X so we want to get the most likely text sequence and we are summing this one's the summing the log probabilities of T 1 through T given so we're looking at the probability of T given everything before that and we're going through our whole sequence so if our sequence is I like cars we're gonna do I probably have I given everything before it probability of like given everything before probably of cars given everything before it ah but we also okay know this is still this is still basic auto regressive functions and we get that by taking the exponent of the context representations for all of the previous words fancy tea its Ross says brewed is not not a regressive Model X on that is yeah Bert is an auto encoded model not Auto regressive fancy T the embeddings of X T I don't know what this energy says I'm scared from this big formula you and me both buddy I would almost always rather have things as code okay these ones I know I'm looking for a weird-looking tea spell spell spell school school I know this is very fast-growing Transpo's okay thank you very helpful ah the worst thing about reading formulas is when there's a symbol you don't know because they're extremely hard to look up I used to send screenshots to my friend who is in a vast ph.d program I was like what does this would need killed and then he left the math ph.d program and didn't really want to talk about math anymore completely understandable and I did not want to bother him super much so although it's called Tao fascinating thank you okay so we have this matrix which is the models representation of the sequence previous to our before our previous sequence and then the embedding for a given sequence and then we flip a Rooney we transpose them on each other I guess and then we take the exponent and so this is for the given so this is for the one that we're looking at as we step through and then this one is X prime William says why are they taking the log of the exponent so they're taking the log here because they want to maximize the log probability and that has to do with the like it's easier to minimize if it's a law than if it's the raw probabilities I believe I remember neglect says this is really stressful to read that's okay we're all here together okay uh and so we're doing the same thing given X Prime what does that mean where are we getting this from summing over so this must be everything in a rope in our vocabulary yeah okay so this is just an English tea except it's not serif for some reason unlike the rest of the fonts so that's helpful Andrew says it's easier to some logs than multiply probabilities oh you have that makes sense Santosh says auto encoding models means you will use transformer network so this is not an auto encoded model this paper is about an auto regressive model but yes the auto encoding model did use transformers okay Oh Andrew says normalizing okay so this is normalizing over all of the words in our vocabulary I think ah here says the log helps to make distances equal numerator versus denominator when you are using fractions helpful I know I somebody definitely stood in front of a whiteboard and pointed at something and told me the reason why we use the log when we do probabilities and maximum likelihood excavation and I think I can even remember the class it was like 11:30 it was right before lunch I was super hungry and it just went very quickly over my head Oh Andrew says the denominator is normalizing the sum to one okay okay I got you I got you so now we're taking so this means that we are making sure that it's a probability that's what I'm getting there how do i says converting something to log makes it more gaussian doesn't it does it matter here it can and I don't think that's the reason we use it here Nicholas says so what does the theta function mean so this whole bit is the context representation that the neural model makes their transformer or the RNN or whatever it is that you're using so this is just a matrix that you pop out of your model and william says is that the word over all the words that came before yeah basically basically yep okay which i think makes sense to me I think I'm picking up with they're putting down I don't know why there's a comma here is that am a singer' it's not a typo I don't know so there's a comma at the end Andrew says if you have a lot of small probabilities you lose resolution converting it to log helps I think that also make sense okay in comparison Bert is voiced based on noising Auto encoding specifically for text sequence X Bert first constructs a corrupted version X hat by randomly setting a portion Ichi 15% of tokens and X to a special symbol mask let the masks tokens be X bar is that what you say it Alec in says the comment is a mouthing okay helpful the training objective is to reconstruct X X flat hat I don't know what that's called from X hat I definitely didn't know what it was called is it mmm oh it's like punctuation before you continue gotcha oh you're supposed to read this in a sentence Bubba J by Debbie Tibbet where H theta etc etc oh okay I don't think I could read that as a single sentence maybe in five years after been reading similar formulas for a really long time I'm very helpful X hat X carrot so X hat and then X carrot okay we knew that uh subhan says log is mana is a mono tonic monotonous monotonic function taking log makes multiplicative functions into addition and then taking derivatives is easy unlike usual UV rule for derivatives I got it right haha I'm gonna call this X flat hat and this one X hat and that's just what I'm gonna call them for right now and if that's confusing I apologize okay so we're gonna take we're maximizing the log likelihood of X mmm given X carrot which is proportional to summing over it is X hat following ELSS notation I've never heard of that what I was taking we all were studying math notation I was studying a syntax like in human languages okay MT equals 1 indicates XT is masked okay so this is 1 when it is masked and otherwise it's zero I'm guessing so we're looking at the log probability of T given X hat where X is a corrupted version okay so we're looking at the corrupted we're given the corrupted version and we're trying to find the maximum no and then we're finding the log probability of X sub T x 1 when it is masked aah and that is the same as summing over the log oh oh if it weren't a log we couldn't some we'd have to do the the product makes sense now and for that we take the hook exponent of H theta is a transformer that maps a length T text sequence X into a sequence of hidden vectors H theta X theta X 1 makes it X 2 etc transposed on the embedding and then divided by going through all of the vocabulary terms so normalizing so the difference between these two things is that instead of trying to guess the next sequence we are trying to guess the masked sequence when it is masked and it if if it is not masked then we ignore it that's what I'm getting here um Ryan says have been driving so couldn't chime in earlier Microsoft new model is called mass tried Excel met Bert base and large and GPT too small on jigsaw Excel that wasn't that good and was bigger than Bert large interesting Andrew says no I know a lot more of the math stuff and I'm still struggling so I commend you all ah William says I was studying creative writing oh hi I'm out of my comfort zone fun fact my undergraduate degree it was in linguistics and English literature ask me about Beowulf don't it's been a minute and Siraj elements of statistical learning which I here is a very good textbook and when I tried to read it there were a lot of symbols what I didn't know what they meant and I was confused and scared and stopped reading Alec and says what is X Prime I am assuming that X prime is each of the words in the vocabulary hmm I'm gonna have to wait a while for the stream to catch up with me that that's what I'm assuming the pros and cons of the two pre training objectives are compared in the following aspects okay independence assumption as emphasized by the approximately equal sign in equation two Bert factorizes the joint conditional probability P bar given P hat based on an independence assumption that all masked tokens are separately reconstructed in comparison the AR language modeling objective factorizes p theta x which was the probability of its the the thing that we're trying to get using the product rule that holds universally without such an independence Association and some clarification so it just means look at all the things then some for the denominator so here X prime is sign is standing in for all of the X's in our in our text sequence that we're modeling or in our total vocabulary of words I don't know which one that is see this would be trivial if I had the code in front of me and I know that they have the code but people don't usually publish the code as the paper okay um [Music] input noise the input to Bert contains artificial symbols like masks that never occur in downstream tasks which creates a pre-trained fine-tune to crepin see replacing masks with original tokens as in 10 does not solve the problem because original tokens can only be used with a small probability otherwise equation 2 would be trivial ayat trivial to optimize in comparison a our language modeling does not rely on any input corruption and does not suffer from this issue and some clarification so it is summing over the total of vocabulary thank you for helpful and then Oh Mohamed asks what does mask mean exactly so the way that Bert works is that you have a text sequence and you replace some of those words with literally with this sequence mask and you're trying to guess what those words are given the mask token and the rest of the words in the sequence so this is something that you are feeding into the model as you're training it and then it never sees later on because this isn't a token in English if that makes sense ryan says not sure if they mention this somewhere but i'm thinking if they use the sine-wave positional encoding that might cause some ambiguity in the order of words yes they mentioned it very briefly that they do use positional encoding and that they're not using an order list model even though previous models are order lists but I'm not entirely sure how that's gonna work presumably they will talk about it more when they talk about there oh it's very pretty uh they're very nice model architecture diagrams context dependency the AR representation is only conditional tokens up to position T well Bert representation has access to contextual information on both sides as result the burst objective allows the model to be pre trained to better can capture bi-directional context so they're saying that the problems whispered are that it assumes that all of the masks tokens are independent of each other also the fact that there are tokens masks that you're training on but the benefit is that instead of just looking backwards you have information about the entire text sequence ah so that's where I think I'm gonna wrap it up because I don't think I have oh yeah that's time I don't think I have time to keep keep going I got a lot to do today um yeah I am really interested to see how they're going to combine both randomly shuffling words and also positional encoding in a way that makes sense to me but I think it's an interesting paper very much more of a machine learning than an NLP paper even though they are working on language and I am glad that they were like hey sometimes in language order is important because they're correct than it is and that is good to hear machine learning researchers talking about all right well thank you for joining everybody that is gonna be it for this week George says what does ambiguity mean in this context oh I know they mentioned it well I think we will get to it later on when they they talk about it a little bit more um I don't know precisely what they mean by it in this context either so we're together in that all right um if you're in the u.s. enjoy your holiday I own I know Canada Day was also this week so I guess happy Canada Day and I will not be back on Friday I will be out in the world doing other things that aren't work but we will be back on next Wednesday and then also the Friday after that and we do have another sequel summer camp coming up so July 16th through 18th so if you come on the 17th for the reading group you will instead learn about sequel so just be warned and we will be picking up the week after that new clash says what's on Friday the 4th and 5th are holidays in the United States so I'm going on holiday yeah it would be a lot more work just to get here without your help I I didn't really y'all's help because again looking at mathematical symbols really difficult okay uh I'm gonna see you guys on next Wednesday I'll talk to you then bye
Info
Channel: Kaggle
Views: 4,131
Rating: 4.8113208 out of 5
Keywords: data science, deep learning, nlp, neural networks, nlu, natural language, python, programming, coding, machine learning, ai, artificial intelligence, kaggle, research, technology, reading group
Id: Ubh03Sh45b0
Channel Id: undefined
Length: 62min 10sec (3730 seconds)
Published: Wed Jul 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.