Kaggle Reading Group: Universal Sentence Encoder | Kaggle

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning I'm gonna do my customary chit chatting for a second while I wait for the stream to catch up with me and I'm actually gonna take these out so I'm not monitoring not ones act like actually make sure yeah yeah okay that's all good excellent oh good morning it is 9 o'clock and I've already had a busy morning I'm working on a lot of little projects right now it's very uh lots of moving parts but the thing that we're doing today I've gotta clean my glasses the thing that we're doing today is starting a new paper and I left you guys on tenterhooks it was a little bit of a surprise so it came down to a one-vote difference between YouTube and Twitter the two places I was having the voting and it was 93 votes for a fishing net 94 votes for universal sentence encoders so it was a super duper close vote and it was actually really interesting Twitter had a really strong preference for efficient net but YouTube had a really strong preference for universal sentence encoders so that's what we're doing and the really cool paper that I would love to read might just read on my own about multilingual translation with low resource languages Kamath third oh it's a cool paper I would recommend it if you are interested before we get started on this paper it's out of the let me make sure I have all my chats on it's out of Google research so some folks were in Mountain View some folks were in New York some folks were in Cambridge not at Google research I guess probably in a different team this has not been published as far as I can tell so it's like the the GPT 2 paper and the last paper we read on sparse transformers so be warned caveat emptor epitope buyer beware whatever Latin for buyer beware is we're gonna have to have to be a little bit more skeptical here yeah and the the paper link is on archive since again it's not published and also they have their modeling code on github in the tensorflow repo and he says vote is ironic since I'm done hello everybody oh and he says I'm using use to analyze Twitter Matthias says first time a question how is this working so we are going to read this paper together and we're gonna start at the top and I was just sort of introducing the paper and we've talked about this a little bit on the Friday live stream where I'm working on an NLP project and it's one of the things that was recommended to me for text encoding so let's get started and I'm literally just gonna read it and if you have questions or things that you find confusing pop them in the chat and I'll try to get to them and if I don't know hopefully someone will the chats generally very helpful on these streams all right universal sentence encoder and this is by Daniel sir care maybe yun feiyang Cheng Ying Chun nanhua Nicole Lim Lim choco Lim taco taco taco Oh probably say John Noah constant Mario Cart Oh sis patties Steve one yuan Chryst are unison sung by Ryan stroke stropping Stroupe probably that's true she passed she rose and Ray Kurzweil and if one of those were your names when I did it bad I'm sorry ELISA is can you send the link of this paper I will do that right after the stream I'll put it in the description of the YouTube universal sentence encoder for me at least it was the top result so hopefully you'll be able to find it Newton says what about the efficient net paper this one got one more vote I maybe I'll put the efficient net paper in the next voting round as well and see if people are more excited about it next week or probably in two or three weeks I will I usually let people vote for the paper because if it's me it's gonna be like my super news favors that I love but yeah I will definitely add it to the vote all right let's get started abstract we present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks so embedding vectors are super popular they're based on the what's it called the co-occurrence hypothesis I think so there's this quote you shall know a word by the company it keeps and basically the idea is you look at the co-occurrence of words in some text span you use that to create a co-occurrence matrix and then you project that matrix down into a smaller space using some sort of dimensionality reduction technique and then that can sort of tell you what words are likely to be used together and the idea is that this tells you something about semantics so words about animals are more likely to be used in the same context words about finance are more likely to be used in the same context that's sort of the general idea and this is sentences in particular so I may actually zoom in a little bit so y'all can see you a little bit better I do one more I can so often people talk about word embeddings and those are individual words so you would take a whole whole lexicon a whole like a dictionary for language you're working in I need to have an embedding for each word and then this is an embedding for a sentence and the idea is that you get sort of like the meaning of a sentence with these these embeddings AJ says how close was it one vote it was I haven't written down here 94 sentencing coders 93 efficient net so very very close link to the paper actually since a couple people asked I'm gonna do that on my phone right now and I will try to read and find links at the same time I don't think this is gonna go well universal I've also definitely been saying universal sentence embeddings I think it's universal sentence encoder yeah so if I say embeddings that's not that's not the paper title give me the paper okay it looks like it's also the PDF is also on what is this the Google research pubs archive interesting on my phone it is not actually the top result to the archive paper well that's loading the models are efficient and result in accurate performance on diverse transfer tasks so you don't have to train separate models for everything you can train one model and then use it on a bunch of different tasks which is good and like good for the environment too which is nice let me copy this go to YouTube I'm getting there you'll have to link in the chat Oh somebody already sent it okay in that case I just need to make sure that I approve it because I think by default there we go by default links won't show so you guys should have the link now Thank You Ellie yeah and I mean the accurate performance presumably people don't publish papers about models that don't work which is in and of itself kind of a problem it's called the file drawer problem where you don't know what doesn't work so you might try it because nobody published a paper saying it doesn't work and then it doesn't work and you've wasted your time and then the next person also does tries it and find out doesn't work because you also didn't publish the paper saying that didn't work so sort of a problem in science in general two variants of the encoding models allow for trade-offs between accuracy and compute resources so I'm expecting a big model and a small model for both variants we investigate and report the relationship between model complexity resource consumption the availability of transfer tasks training data and task performance okay so there if you have like more training data maybe you can do this to get better results if you have less compute how can you get good results with that constraint ah hello comparisons are made with baselines that use word level transfer learning via pre-trained word embedding so that's why I mentioned you have a different embedding for every word in your dictionary as well as baselines that do not use any transfer learning so training custom and betting's usually is the is the thing that people would do there and we talked about that in the reading club a really long time ago it's a video that's called like neural methods speech and language processing I think your speaking language processing chapter five so if you're curious that's sort of what they're comparing against we find that transfer learning using sentences sentence embeddings tends to outperform word level transfer hmm I feel like that is probably gonna work better if you have training data that's from a similar domain as your test data so if you're doing something in a language where the only large text data set you have is like a translation of the Bible which is super common because there's a there's a big nonprofit organization that goes around and translates the Bible into a bunch of different languages and they produce a lot of linguistic data regardless of how you feel about mission work in general it's you know a good resource but if your test data is not the Bible I'd be interested how well this works for really low resource languages with transfer learning via sentence embeddings we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task okay so I'm guessing this will be sort of like Bert we trained a big model we have a little like wafer layer on top that's like hey this is a specific task we're doing and then we get good results maybe we obtain encouraging results on the word embedding Association tests weit targeted detecting model bias Oh interesting our pre trained sentence encoding models are made freely available for download and on TF hub so here's the tensorflow repo and it's also on TF hub but that one said that there was a new updated version and I couldn't click on the link to the new updated version maybe because I my fingers are bad I don't know but that's also available okay Christopher says what are they testing when they mention a baseline to test against so based on is abstract I'm guessing that their baselines are pre-trained word embeddings rather than sentence embeddings and custom embedding so training your embeddings at the same time as you're doing your modeling for whatever task it is that you're using and because they're doing transfer learning probably also comparing against tasks specific models that don't have that access so I guess that would be the same as like training custom embeddings he's my guess introduction limited amounts of training data are available for many NLP trust tasks true especially if you're not working in English and based on this it looks like they're working in English I would be interested to see if they say that words Oh in the paper somewhere given the high cost of annotating supervised training data very large training sets are usually not available for most research or Industry and LP tasks again true many model many models address the problem by implicitly performing limited transfer learning through use of pre-trained word embeddings such as those produced by word Tyvek ones like let me give me my text century there we go word to back or glove or fast text that's another popular embedding choice however recent work has demonstrated strong trapped transferred task performance using pre-trained sentence-level embeddings con you at all and I'm pretty sure that's just the sentence embedding paper sorondo says good resources for the Spanish language um I would maybe check out the Proceedings of what is the conference called it happens two years Elric lr ec language resource evaluation conference see and they have a lot of multilingual datasets that are presented there and also on english so that might be a good place to check su bomb says would this be good for an unsupervised learning tasks of seeing sentence similarities presumably i don't know if they're gonna talk about it or not but my guess would be yes sorry i got dry lips i had a busy morning this morning i didn't have time to complete my moisturizing routine okay andy says anecdotally I've been surprised at how little additional benefit I get from using use embeddings to identify gender of twitter users interesting are you identifying gender from names or from the texts that they produce because I would first of all identifying gender sorry this is a catcher first of all identifying gender is a tricky problem to begin with because the only good baseline data is people's self-reported gender and gender is very complex and secondly ah yeah that's my soapbox if you wanted to know people's gender the best way to do it is to just just ask them Hector sighs polygons for Python sorry Hector has nice basic sediment beddings for different tasks in Spanish not sure of the dialect though and then Andy sighs from text hmm this is one of those papers where if somebody sent me the paper I'd be like why are you doing this do you need to do this is this important for reasons that are good ones there's a lot of sort of like inferring demographic work that makes me feel extremely uncomfortable all right I'm trying try not to get too sidetracked in this paper we present two models for producing sentence embeddings that demonstrate good transfer to a number of other NLP tasks we include experiments with varying amounts of transfer task training data to illustrate the relationship between transfer task performance and training set size okay so they're going to give us a big model in a small model I'm assuming much like Bert Bert big and Burke small I think also transformers there was like transformer large transformer extra large I believe and then they're going to change the amounts of specific training data for each task so presumably they're gonna train their their basic you know sentence embeddings on an enormous data set that you might not have for all languages and then they're going to fine-tune it and then they're going to look at how different amounts of specific labelled training data for different tasks affect performance right Christopher says one size over fits all we find that our sentence embeddings can be used to obtain surprisingly good test performance with remarkably little task specific training data the sentence encoding models are made publicly available on TF hub excellent code available trained models available always happy to see that engineering characteristics of models used for transfer learning are an important consideration we discussed modeling trade-offs regarding memory requirements as well as compute time on CPU and GPU resource consumption comparisons are made for sentences of varying lengths I do like that so this is a Google paper obviously Google researchers or maybe not obviously Google researchers have access to a lot of compute power and time and money that most people don't like even I chose owned by Google I work at Google and for my project that I'm working on I can't be like I need 40 years of GPU time for my fun little foreign chat bot that I'm designing I would have to have an extremely strong business justification for that so it's very realistic to have very limited compute resources and I appreciate that they're thinking about that and they're not just like bigger model more compute more power excellent state of the art I don't know what that accent was all right and then oh that's really nice they have a little bit of sample code in here for how to actually use the model so import tensorflow hub is hub embed hub module HTTP colon bubble blah tff dev slash google universal sentence encoder slash one I believe they're up to two now or possibly three so there's been some additional versions and then embedding embed and then a list of text strings you're interested in embedding so that's extremely straightforward and if you were going to use this on kaggle I'm pretty sure you'd want to make sure that your internet connection was on or it wouldn't be able to fetch this specific model AJ says the big question is it better than Bert it seems like that's something the authors would have mentioned in the abstract slash introduction or is there something fundamentally different between the two that's an excellent question I believe this was actually published before the Bert paper so let's see let's go back to archive so Bert came out in 2018 and was published officially in 2019 this year so it Knakal which was last week and this was submitted oonh in March of 2018 so I believe that this paper predates Bert and I don't remember all the authors on the Bert paper but for sure the first author on the Bert paper is not on this paper so they may be doing separate research projects is it better than Bert I don't know well maybe something we could investigate that seems like an no question and William says could this be used to create a short story creator uh yeah probably pretty much any sort of embedding well actually I don't know actually my gut tells me yes but I don't know enough about the model to be able to say whether or not that would be easy all right model toolkit and that's the right amount of zoom we make available to new models for encoding sentences and to embedding vectors one makes use of the transformer the swanee at all we read that paper if you want to go back and brush up on it architecture well the other is formulated as a deep averaging network ir at all 2015 i never heard of this what's a deep averaging network are you gonna tell us about it okay yeah there's a whole section on it I'm not gonna go off on a tangent now both are implemented in tensorflow and are available from tensorflow hub and here is the link and here's the link that I couldn't click earlier for whatever reason maybe I'm double clicking anyway so they're on to version 2 at this point always maybe ok you just changed the version and that makes it go away so the most recent version is available the models take his input English strings ok English only these are English models there are other languages in the world but I get why you would work with English and that's because there's a lot of training data available and produce as output fixed dimensional embedding representations of a sorry a fixed dimensional embedding representation of the stream listing one provides a minimal code snippet to convert a sentence into a tensor containing its sentence embedding the embedding tensor can be used directly or incorporated into larger model graphs for specific tasks so the so instead of having a the output of your embedding be a matrix of dimension length of sentence by length of embedding you have just one embedding for each sentence so it's a more way of representing the information it does not necessarily say that's my chair behind me it's not okay I've got a hoodie so it's not going to contain all of the information that you would get in a sentence it's a compression of the semantic information it won't help you understand things like negation so I walked the dog and I didn't walk the dog would presumably have very close sentence embeddings so it's important to know that there are limitations to using this type of model so I'm says this is a vector I thought in betting's are a matrix oh it's a tensor anyway the idea is that there's fewer dimensions if you don't have to do one for every single word and you can instead do a single object for the whole thing yeah I think it's it says tensor which three dimensions maybe as illustrated in set figure one the sentence embeddings can be crib trivially used to compute sentence levels semantic similarity which somebody asked about that achieve excellent performance on the semantic textual similarity STS benchmark care at all 2017 that's the same MA thing I've been working on their own benchmark which is fine if you make a resource presumably you're making it for your own use in addition to everybody else so whichever says what did a word base approach look like how do you combine embeddings usually if you have like vectors you can bind them into a matrix and then you either have a sliding window over your text so you're looking at like four words at a time for something like an or an N or Nellis TM or you have a set maximum sentence length and then four words beyond so let's say my sentence length is 50 my maximum sentence like I have a sentence of 20 words long I'd have a word embedding in each of the first 20 slots and then the last 30 slots would be empty if that makes sense those are sort of the two common approaches to using embeddings and then you just sort of like stick the stick it up into your model and just like push it through like I don't know I kind of think of neural what's like um you know like a potato ricer like you squish it and then like the potato comes out you just squish your data in and then the the riced potato comes out I think this may be a very this might be a regional thing it looks like this and you put potato in the top and then it comes out of all of these little holes then it's nice and fluffy it's good for for mashed potatoes anyway that's the metaphor hope for those of you that don't use potato rises as illustrated talking about Figure 1 when included within larger models the sentence encoding models can be fine-tuned for specific tasks using gradient based updates so you can as you're training your model be like hey we want to update our our encoding so sort of similar to the multitask paper we learned where you're updating the embeddings with each pass and it's like which task is updating the embeddings is different and this is different because you're peeling apart each model and then doing fine-tuning on your embeddings for that model you're not updating all of your embeddings with each different model you're training I think that was it wasn't the last papers the paper before last was the multitask learning so let's look at figure one so the point of this figure is that we've got two sets of sentence we got three pairs of sentences and they want us to see that the similar sentences have a high semantic similarity to each other and the dissimilar sentences have a low semantic similarity on a scale of 0 to 1 where 0 is presumably no similarity and 1 is it's the same sentence so the sentence is I like my phone and your cell phone looks great are very closely related to each other it looks like obviously they have a relationship of one with themselves and then about five point five with each other will it snow tomorrow and hurricanes have hit the US have a much have a lower similarity but still higher than with two other random sentences and it looks like that's a round a point for maybe 0.3 and then how old are you and what is your age is very high semantic similarity so maybe around looks like 8.5 ish it's kind of hard to read the the gradient it's very common to to show heat maps with this type of thing Christopher says it was able to detect an interrogative interrogative mood no so why we did not be able to encode in negation I don't think it's actually detecting mood so mood is a it's a grammatical term to distinguish with things between like questions and commands it's possible that they are including punctuation in their model I don't know that that is the case but if they are then it might just be there's a question mark and we think that that question marks are very highly related but you and you are very close we've got two forms of to be that presumably are gonna have very similar embeddings how and what are presumably going to be pretty close to each other because those are both sort of in the same grammatical family and then old and aged I'm imagining are very close to each other so even if these weren't questions so if I said what is your age and I don't know I think I think very similar that's not a question I am old I would expect those things to be pretty closely related yeah I think it's these are probably exceptionally good examples that they have fixed to show off their model um I would be very surprised if they were doing any sort of parsing or mood as part of this pipeline yeah I'm guessing it's probably getting the question marks and also those these sentences are just very similar in meaning so we'll find out though okay encoders so encoders you may remember from the the last paper we read the transformer paper only used the decoder part of the transformer so the transformer was designed for it's a neural network architecture and it was designed specifically for machine translation and the way that most machine translation and other text input to text output models work is that you have an encoder which takes your original text input and turns it into a bunch of numbers and then you have a decoder that takes that bunch of numbers and turns it into a text output so traditionally in the middle you'd have something that people would call like thought vectors sometimes which i think is sort of a weird weird terminology that is over representing the power of the model so the encoder takes word number the decoder takes number to word and I'm assuming here they're only going to be using an encoder as opposed to the last paper where they only use the decoder so that's sort of what the the encoder means here so they have two encoding models where do oh I missed I completely blanked out on these foot footnote so one the encoding model for the Dan based encoder is already available the transformer based encoder will be made available at a later point and pretty sure it's already available and to visit collab dot researched at google.com to try the code snippet in listing one example code and documentation is available on the universal encoder website provided above cool and this would also work on kaggle if you had internet access on your kernel turned on I'm assuming okay I don't want to take a detour and play around with the code I'm gonna get we're going to get through the favorite our two encoders have different design goals one based on the transformer architecture targets high accuracy at the cost of greater model complexity and resource consumption that makes sense encoders have a lot of weights sorry encoders transformers have a lot of weights but you don't have to train them sequentially you can train them in pairs so training time isn't as long as it is for things like rnns or LST M's where you have to do time point one time point two time point three in that order the other targets efficient inference with slightly reduced accuracy okay so this could be something that would be good to maybe put into a production pipeline where you want to make you know calls very quickly so you want to do your embeddings very fast interesting but they don't talk about training time I'm assuming training time is gonna be pretty hefty for both of these years says supposing the model is used in comparing similar questions in Cora for example Oh is forward example yeah you measure similarity by comparing the Euclidean distance or other distance between the two embeddings I feel like people usually use cosine distance but yes that's the general idea and I think they also talked about the word embeddings association test which also looks at distance but in a fancier way I sort of know that the test exists but I'm not super up on the math all right transformers the transformer tsubame says cosine similarity is better and andy says you don't need to do much training that's true if you're working with English because they have provided the English models but if you're using this as a blueprint for your work and let's say Portuguese you do gonna have to train your own models because they did not give you Portuguese language models or like Swahili or I don't know Bharati or whatever language it is that your you're working on that's not English because there's a lot of languages that are in English sorry I'm an English user clearly I benefit from there being a large amount of training data and models built for English but I'm also a linguist and it makes me a little bit salty there's there's a lot of people out there they use other languages that could also really benefit from having access to and I'll Pete tools okay transformers the transformer based sentence encoding model constructs sentence embeddings you the encoding subgraph of the transformer architecture this subgraph uses attention to compute context-aware representations of words in a sentence the take into account both the ordering and identity of all words hello hello so instead of a bag of words model where you ignore order transformers explicitly in code order someone is mowing hopefully you guys can hear that the context aware word representations are converted to a fixed length sentence encoding vector by computing okay here's the here's the question somebody had the element-wise some of the representations at each word position fixed length sentence encoding vector the element-wise some so are adding up individually of the representations at each word position how is that effects oh I see okay so each word has a fixed length and we are just adding them stacking them all up and adding them does that seem right let's check out this footnote three we then divide by the square root of the length of the sentence so that the differences between short sentences are not dominated by sentence length effects yeah they're just adding so they're adding the context-aware embeddings for each word sort of like stacking them up squishing them down sorry the stacking up is just the adding and then the squishing down is taking the square root so the long sentences square root of the length of the sentence so that long sentences and short sentences have comparable weights this is what I'm getting from that the encoder takes as input a lower cased ptb penn treebank it's done by the University of Pennsylvania I think it's a really famous data set and they have a tokenizer and it uses that tokenizer ptb tokenized string and puts as a 512 dimensional vector sorry the encoder takes is an input a lowercase PT be tokenized string and outputs a 512 dimensional vector as the sentence embedding so mom says concatenating or adding they say sum so presumably adding oh yeah I think when I said sandwich stacking that would have implied concatenation yeah I think they so I guess it would be like you stack them you squish him and then you shave off the sentence I guess is like shave the embeddings based on the length of the sentence the sandwich metaphor isn't that working great Kumar says I thought the same looking at all the possible word representation and new terms coming up in this paper yeah that is a good point I don't know how well you deal with out a vocabulary stuff buh-buh-bah the encoding model is designed to be as general-purpose as possible this is accomplished by using multitask learning a whereby a single encoding model is used to feed multiple downstream tasks okay so we we talked about multi task learning in two papers ago so same sort of model the tasks that they're using to change the embeddings are a skip thought like tasks for unsupervised learning from arbitrary running text a conversational input response task for inclusion of parsed conversational data and classification tasks for training on supervised data sorry ok hydrate um I've definitely heard of skips thought but I don't actually know what it is skip thought factors my thought on skipped thoughts can I get any images what they look like and these are all from the my thoughts on skip thought vectors blog post so I guess we can head to there I had to read it with a magnifying glass and implement it in PI torch I learned quite a bit about why skip thought works what is a skip thought skip thought factors is name given to a simple neural network model for learning fixed length representations of sentences in any natural language without any labeled data or supervised learning the only supervision / training signal skip thoughts use is the ordering of sentence is in a natural language corpus they're fixed lengths because it means you can replace any sentence with an equivalent number of word vectors so it's pretty straightforward and you can feed it into your neural network very straightforwardly some examples are sentence similarity and sentiment classification and then there's this sort of like vector tasks where you can do allegorical comparisons okay so interesting so the encoder gets a number of sentences and then I don't know what this zi is okay so it's the embedding and then it generates the next sentence and generates the previous sentence okay and then it uses the order of its sentences to self supervise the content of a sentence leads to a better reconstruction of the neighboring sentence is also the essence of the sentence okay I see so you take a sentence you use your encoder and then generate the previous attempt to the previous and the next sentence and if you do well then you sort of like use your reward function and if you do poorly then you're like nah try it again and you use the information about how well you generated these sentences to update your weights okay that makes sense to me hopefully that makes sense to y'all so this is can I use my sentence to generate the previous and next sentences in a text a conversational input response task for the inclusion of parsed conversational data so this would be like a chat pot thing and I'm guessing that you're trying to generate a sentence as close as possible to the actual next sentence and classification tasks for training on supervised data are they gonna say what those are maybe at the point where we talk about training the skip thought tasks replaces the LST M used in the original formulation with a model based on the transformer architecture okay so they're doing this whole thing but instead of using LS teams in the encoder they're using transformers as will be shown in the execs para mental results below the transformer based encoder achieves the best overall trust transfer task performance however this comes at the cost of compute time and memory usage scaling dramatically with sentence length and that's specifically that memory use was what the skip smart side nothing nothing skip was what the sparse transformers paper that we read last was about so he had instead of having full attention to every where you either have sort of like stripes of attention or you have a little sliding window of attention which means you don't have to store and update all of the weights and it scales I think quadratically with sentence length instead of exponentially or something so it's a much better in terms of memory use AJ says so skip thought is almost like bi-directional language modeling with sentences instead of words right yes yeah because in language modeling you're well hmm so the thing when it might not be sorry so the way in which it might not be similar to language modeling is that language models are probabilistic so you want so a language models gives you a distribution over possible words given a context so if you're generating a sentence and you say Rachel really likes to pet blah words with a very high weight might be something like cats or dogs or hedgehogs or something and I don't necessarily know that that's what's going on in the skip thought again I'm not super familiar with their work but it does seem very similar in that you want to be able to generate the surrounding context personally as probabilistically right there's no way someone would try and build it really bothered my husband's a software engineer and every time I talk about models being non-deterministic he just he's very upset he's like all computer models are deterministic I'm like yes I get that at the hardware level but not in the way that we use the use of terminology um Christopher says hedgehogs yeah I have a hedgehog he's he's asleep right I don't wake him up mom says they give softmax but here it gives prediction of words I'm not entirely sure what you mean softmax must be probabilistic only yeah I don't know I'm losing the math I'm thinking about it and it's just sort of like slipping out of my head yes let's think about it as a language model that uses sentences okay I'm presumably will talk about this more when we get into the nitty-gritty of the architecture although we're pretty far into is paid for yeah yeah okay we're gonna have a little bit more a little bit more discussion it looks like all right deep averaging network so this I've never heard of the second encoding model makes use of a deep averaging network dan I are at all 2015 whereby and been input embeddings forwards and by grams are first averaged together and then passed through a feed-forward deep neural network to produce sentence embeddings so you start with embeddings that are pre-trained presumably I guess or you're training them yourselves then you average together the word level embeddings and the bigram level embedding so a sliding window of looking at two words at a time ah and then you take that and you put it through a deep neural network to produce sentence embeddings so this almost feels like ensemble in a little bit to me um I would say stacking or blending but I treated about that last week and what I've learned is that I have no idea what the difference between stacking and blending is and maybe no one else does either or other lots of people have very clear ideas about the difference but those people's ideas are very different from each other so but sort of similar in that you have you have a model you have a train model you're doing some squashing of it together and then you are taking that as input to a new network okay similar to the transformer encoder the Dan encoder takes as input lower case PT be tokenized string and outputs a 512 dimensional sentence embeddings okay so if you are just taking these architectures and putting stuff in taking stuff out you can use the exact same pre-processing and then post-processing setup which is nice that is convenient ah Kumar says stacking does not include the original training data but blending does I think my understanding was actually the opposite I don't know I was trying to come up with a memnon ik because I kept getting confused but that is one way that those terms are you maybe that was the way I don't know it's confusing similar similar to the transformer encoder oh there we go the Dan encoder is trained similarly to the transformer based encoder we make use of multitask learning whereby a single Dan encoder is used to supply sentence embeddings from multiple downstream tasks the primary advantage of the Dan encoder is the compute time is linear in the length of the input sequence oh that's pretty good similar to IR at all our results demonstrate that Dan's achieve strong baseline performance on text classification tasks and I believe I believe transformers some flavor of exponential time so linear time is a dramatic improvement ah alright and they said it worked good I'm interested this this ir paper i have not heard of previously that doesn't mean it's not a good paper it's just i didn't hear about it h oh here we go oh how domain did it how is great or he was one of the authors in a CL okay so it was published in a super reputable venue and it was called deep unordered composition rival syntactic methods for text classification okay so this is a way of based on this title it seems like they're trying to get or do create a method that lets you have avoid having to do parsing which is pretty computationally intensive even particularly efficient parsing methods tend to be pretty slow just because it's a lot of sort of like graph transversal and searching cool dance deep averaging networks is that what it stood for I'm pretty sure was deep averaging networks deep averaging Network yeah AJ says I thought quadrat I thought transformers were quadratic that makes sense because it's like a or maybe sometimes there's just square root of two - I don't remember man they're they're pretty memory and time intensive quadratics that's right I don't know okay linear time encoder training data unsupervised training data for the sentence encoding models are drawn from a variety of web sources the sources are Wikipedia web news web question answer pages so they may actually have included some of the Quora data and discussion forums I'm guessing that they mean reddit here based on my knowledge of NLP research Andy says if it's networks with nonlinear activations how can it be finding averages I believe that they just take the averages of the embeddings like just like before it goes into the model they're just like averaging them together and then in they go is my my understanding of the way that this works Dora says self attention stages go through each word for every word quadratic makes sense that was my thinking but I will admit I just sort of I only really pay attention to like memory and time when it's incredibly big and then I'm like okay I guess I'll think about it it's just not super I have a very utilitarian view about algorithms I'm not super motivated by finding the most elegant possible solution which is why I'm a data scientist and not a software engineer we augment the unsupervised learning with training on supervised data from the Stanford natural language inference corpus which talked about quite a bit Boman at all similar to the findings of Khan you at all we observe that training on SNL I improves transfer our performance and I believe this is I believe this is the one where you have to you take two sentences and you say whether they won in oh my god what are the logical relationships contradicts no relationship implication I feel like it's not implication though Oh what's the term - hi my semantics textbook here it would be faster to look it up I do my semantics textbook oh it's a little bit green invisible textbook I'm assuming it's gonna be a chapter title so I'm just flipping through quantifier notations modality and beauty logical connectives negates woooo negates and it's either implies or in first I don't know anyway so that's the corpus and they found that it improves performance and I always wonder I always wondered when people add specific datasets or like this specific data set helped to what degree that specific dataset helped and to what degree any data set would have helped so just something I just something I think about transfer tasks this section presents an overview of the data used for the transfer learning experiments and the word embedding Association test data used to characterize model bias I really do like that they did that for these data sets we use the preparation provided by cono at all mr movie review snippets sentiment on five-star scale CR sentiment of sentences mined from customer reviews and these are so far these are pretty I think most of these are pretty standard and I'll P datasets and again all right English entailment thank you Andy yes that was the other one so logical entailment is if sentence a is true then sentence B must be true so if it is true that I have five hedgehogs I don't it is also true that I have three hedgehogs it's just pragmatically bad for me to tell you I have three hedgehogs if I have five hedgehogs I have one Hedgehog there's plenty of Hedgehog sabja scent subjectivity of sentences from movie reviews and plot sentences so I think this is a different which we like is this fact-based like brie Larson played Captain Marvel or is this an opinion like Captain Marvel is the best superhero and I love her does everything mpq a phrase level opinion polarity from news data TREC fine-grain question classification source from TREC SST binary phrase level sentiment classification an STS sentient sentiment textual similarity between sentence pairs scored by Pearson correlation with human judgments and I think this was a I think it would so like a five or a three point scale and then they use inter-rater reliability to see how close they were and then wheat we at word embedding association test word pairs from cycle a cycle ology psychology literature on implicit association test s-- that are used to characterize model bias and this comes from a lot of the implicit association literature in psychology I know that Brian Novak's lab at UVA has done a lot of work on this and also the Open Science Foundation has a lot of their data sets and modeling code so if this is something you're interested in from a behavioral standpoint that might be a good place to look Christopher says the t-rex thing sounds similar to your project fine green it does yeah we got about five minutes I'm gonna go on a little little detour here from University of Illinois at urbana-champaign in the middle of nowhere great University just hard to get to freeform factual questions one needs to understand the question at a level that allows determining some of the constraints the question imposes on a possible answer the constraints may include a semantic classification of the sought after answer and may even suggest using different strategies when looking for and verifying a candidate answer so it's a hierarchical classifier that does sound good guided by layered semantic hierarchy of answer types and eventually cacif questions and defined grading classes we show accurate results on a large collection of freeform questions can I get an example of what your question hierarchy is so they're looking for abbreviation entity description human location and numeric value and presumably this is the sort of thing that you would get as the answer so abbreviation or explanation for entity animal's body currency religion vehicle that sort of thing for descriptions definition manner reason for humans groups individual title description locations city mountain etc three four mountain numeric code date distance that sort of thing okay I don't think this hierarchy would be especially helpful for me but it is an interesting way of thinking about questions and more specifically the type of answers a question is looking for so might be a cool thing for me to check out thanks for pointing that out I think we can get through the next little bit so they have the evaluation sets here so the training is on t rec STS the benchmark and SST dev which is what people use to evaluate during training I believe is this and then test is the final set that it's going to be evaluated against transfer learning models actually I'm going to well we can try we may need to to revisit this next Wednesday for sentence classification transfer tasks the output of the transformer and Dan sentence encoders are provided to a task specific in DNN deep neural network for the pairwise semantic similarity task we directly access assess the similarity of the sentence embeddings produced by our two encoders as shown in equation one we first compute the cosine similarity of the two sentence embeddings and then use our cause to convert the cosine similarity in to an angular distance okay so this is the cosine similarity I believe and then R cos I don't know what that is this seems like very trigonometry and it's been a minute since I took trig okay so it's the arc cosine which is the inverse cosine function of X when X is less than negative 1 between negative 1 and 1 when the cosine of Y is equal to X cosine of y equals x then the arc cosine of X is equal to the inverse cosine function of X which is equal to the negative the inverse cosine oh that's literally what they just said the inverse cosine function of X okay so they are taking two vectors and then they're taking the cosine distance and then they're taking the the angular distance between the two ok I think I think that makes sense to me sort of sort of shaking off the very dusty trig stuff in my in my head and Kumar says inverse cosine T the angle into Radian okay so we're gonna have our our similarity measure is going to be in radians thank you that's helpful all right so it is 10 let's call it a day here and then come back together next Wednesday to pick up on section 5.1 baselines and baseline somebody asked one uses word level word embeddings and then the other one uses no transfer learning so presumably custom embeddings for that specific task Andy says why would you use ankles rather than correlations hello Anne I'd probably use use gist cosines if it were me but I'm sure that they had a good reason and maybe they will talk about it they do have a footnote here we find oh here the footnote answers your question we find that using a similarity based on angular distance performs better on average than wrote raw cosine similarity so there's your answer welcome to machine learning research we tried it and it works better for unknown reasons yeah I guess do both and then use the one that works better it's pretty it's not super intense to compute so it shouldn't take too too long but yeah act or says the inner product is similar to cosine so inverting it seems natural maybe yeah I don't think this should make or break the the model though it's just sort of the way that they are the way that they are looking at the semantic similarity specifically yeah although are they so when they talk about do I still have yet so when they're doing the Skip Graham things are these is it like number of errors that they're using true backprop or is it the semantic similarity because if it's semantic similarity and you change that function that's a pretty big change in the project anyway presumably they will talk about it alright thanks so much for joining guys we've sort of got an idea of what they're doing not I don't think there's actually gonna be a lot of additional results sorry a lot of additional discussion about their model architecture it does seem to be pretty closely based on the previous model architectures that are just sort of they're not making any sort of big algorithmic changes as far as I can tell okay so what's new what's different here models for encoding embedding specifically target transfer learning so I think the things that are different are that they're using transformers in this they're using transformers in this skip thought type of architecture so this one this nope this one this type of skip thought architecture and that seems to be novel and then they're also using Dan encoders and the other thing that's different is that they're using transfer learning using these specific transfer tasks and then they're doing the WEA T evaluation as well hmm okay interesting I hopefully by the time we get to the end of this paper I will be very clear on what the specific contribution of the paper is yeah it's really just very it is building very you know closely on previous work which is fine that's scholarship that's completely normal but it seems like either they're doing a lot of little things or I'm just having trouble picking out the one big thing that they're doing differently anyway we will get back into it next week on Friday we're gonna keep continuing on our unsupervised text classification problem and we're gonna be doing some Brown clustering and see how that works and then maybe we'll try some different clustering algorithms and if see if they're much better and since it does look to be pretty straightforward to actually try these embeddings we might also play around with these embeddings a little bit and see how they do because it looks fairly easy once you have the embeddings to do the sentence similarity measures so that could be one potential way of looking at forum post we shall see thank you AG says I prefer when paper specifically lists their contributions yes me too for my reviewing papers that is often one of the things that I will bring up if they do not all right I hope you guys have a great day I've got a lot of stuff to do a lot of fun things coming up but I think you all hopefully really enjoy and I will see you on Friday or on next one say thanks for drinking bye
Info
Channel: Kaggle
Views: 4,206
Rating: 4.8383837 out of 5
Keywords: data science, deep learning, nlp, neural networks, nlu, natural language, python, programming, coding, machine learning, ai, artificial intelligence, kaggle, research, technology, reading group, tensorflow, google, embeddings, encoder, feature engineering
Id: C8v-Q5ASfUE
Channel Id: undefined
Length: 63min 45sec (3825 seconds)
Published: Wed Jun 12 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.