Question Answering Beyond SQuAD: Larger Datasets and New Domains, with Branden Chan, deepset.ai

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I wanna start off by thanking Jess for having me talk to all of you today I'm very excited actually to talk about question answering from squad to industrial search just to give a little bit of background about myself Steph mentioned I'm a deep learning engineer a deep set based here in Berlin we're very interested in a lot of the deep learning technologies in NLP and especially interested in language models and question answering and what they can do to help build industrial systems yep I'm getting to do a lot of very interesting deep learning projects with deep set here and yeah I have a background in computational linguistics where I did my I did my masters at Stanford and prior to that I was working on languages especially classical languages in Cambridge so yeah just to set the scene for you guys deep set we kind of see ourselves in this position bringing the latest research into industry so right now in research if anyone is familiar with what's happening in NLP I'm sure they've noticed that there's quite a surge in work in looking into transfer learning into language models and also into question answering yeah as I mentioned I already mentioned the squad answer which is a Stanford question answering data set and we're gonna see how this conforms out of the basis of a lot of different other bigger systems we're very much interested at deep set using these new research technologies to create industrial neural search so driven by these advancements in question answering we're working on haystack which is an open source open domain question answering framework helping you answer your queries on your own big set of documents yeah so the the core of this is open source and we also develop and free enterprise features on top of that to create production production ready systems and we've already been working a lot in the industry with a lot of different partners but all who have the same information to really be able to get the best out of their unstructured textiles you can see just a couple there including Siemens and Airbus who we've been working with so I kinda want to start off by talking about a kind of search which I'm sure everyone's quite familiar with in terms of web search so everyone uses it very very regularly and maybe it's kind of hard in using a day-to-day to really track sort of the advancements that have but it definitely has improved Google has obviously been working on it very heavily and one thing that I want to point out here is that Google started off much more in this kind of conventional search style of ranking different web pages so you might want some sort of specific information like the address of the Eiffel Tower here but what it used to just return was a page which might contain that information which you would have to dig a little bit deeper into to read and to get the answer for now with modern NLP based methods it's actually possible to build systems which are much more accurate than that now if you have the same query if you ask if you want to find out the address of the Eiffel Tower we see that Google actually returns you've got just H but really the exact string span or in this case the address which answers your question so yeah just want to point out here that we are talking now about queries and natural language and very granular arts very fine-tuned answers which are really honing in on what the information need is we have deep-set are working a lot on these sorts of technologies and we see quite a gap between what has been happening in web search and what is currently available in enterprise search situations and for this reason we've been working very hard upon this demo here which I will switch over to you right now so that you're able to see this right now yes it's good okay great by the way video that would be good oh yeah that's good great so yeah I just want to give you a sense of what's possible with this kind of technology so in this demo we have a large corpus of documents they are actually annual reports and earning calls from Nvidia and for example we might be interested in asking questions about ribbon maybe we're interested specifically in how high was their revenue in 2090 and in the backend but which is the language model which I'll be talking about more specifically is doing its work reading through a lot of documents and is able to hear not just give us a relevant document such as the fourth quarter report from 2019 but you can see highlighted in green here actually a very specific information that we've been requesting I can just try one more quick example for you there's something a little bit more open-ended when you when you start your when you show your example you need to change your sharing to the web browser or whatever it is you're using is it currently just can your NEC my slides at the minute I see your desktop right now oh I see okay you actually like that you might actually quickly go back to what you were showing before first sure can you see my browser see your desktop you need to share your browser specifically yeah yeah and how about now is that any better no sorry you're seeing my photo okay but this is a if I switch the tab how about now that's good I see NVIDIA great on report exactly exactly so maybe let's go through the first example again I assumed that we missed that the question that I typed in was so yeah but is reading through the stack of documents and we returned a list of documents which might contain the answer you seem pretty relevant to me we're talking about quarter for our reports from 2019 and ingredie here is that exact answer to the question that we just posed we can ask something a little bit more open-ended how is the album render we had a request can you zoom in please can you enlarge show me of course sir how is that look here we have it that's and yeah we have also again a set of documents returned and sections highlighted which this model thinks is relevant to this query so for example I highlighted the section which as we expect cryptocurrency related revenue to be negligible going forward which is actually quite relevant in this case we have a relevance for here to give a sense of what the model thinks of it we also see a line here we expect competition to increase from both existing competitors and new market entrants with products that may be less costly than ours so yeah we see a lot of potential here actually we really think that this is the future of how we're going to be doing search on an enterprise number not just by keywords not just with control-f but really with natural language and with systems which understand language read through it and make some sort of sense of how sentences together and how it's used in practice so this one I want to come back to my slides here and I want to talk about the different components that we need to build this kind of a system and I'm going to talk about it in a couple of sections I want to talk first about language models which really what underpins this language understanding component of these systems I don't talk about question-answering which is a very popular NLP task which is very integral so this and then I'll talk about scaling how do we get from extracting announcer from just a paragraph to extracting an answer from a large set of documents so let's get started here language models what are they what do they do in the most basic sense a language model is something which understands the distribution of words and through that is able to guess words in certain positions it has a very good understanding a very good representation for yeah what it sees and is able to get a sense of what the sentence means so here read a sentence like the Fox some things across the street a good language model should be able to assign high probabilities to words like runs trucks or what why does it function like this well a language model needs to be able to build a good representation of the context of these words like the Fox and across the street to make these good guesses and practically speaking it does this by representing in perspectives but it needs to build good vectors that allows it to make good guesses in these missing spots effective language models should be able to learn from massive unlabeled data sets it really needs to have seen a lot of language to really understand how language functions the rules with the meaning of words compositionality of it and on a practical level language models as I said are the foundation of many modern NLP systems and we're going to talk a little bit about just how flexible they are these coming slides so I kind of want to show what modern NLP workflow looks like when you're using deep learning models and here what you might do is start off with a language model architecture might be bird might sell net but something which can't learn and the first thing you might want to take is to give it a large corpus of unlabeled text practically speaking this is often scraped websites or Wikipedia dumps and by reading through all this text the language model learns over time and as I said I gets a good understanding of the distribution of words in the language gets a good understanding of what words really mean once you have this trained language model you might have another task in mind and for this specific task you have a layer on top called the prediction head which is tailored towards what you might be doing document classification we might be doing named entity recognition you might be doing question answering in this case we might be doing question answering we have a question answering dataset and when we pass it into the language model the language model converts these tokens the text into a set of vectors represented by these green bars him these green bars are consumed by the prediction head which then finally spits out an answer perfect to the top such as the question answering question answers to your questions and so through this process you'll see train the prediction head and through all this it's these two components the prediction head and the language model which form your fully trained question answering system I want to give a little bit of a brief timeline of how we've gotten to where we are now with the language models some of you might be familiar with very early models early vector space mammals such as glove or words event these are uncontested alized vector space models which are very efficient they were very effective when they first came out and actually if the certain case is now still potentially the choice to go for this is a very lightweight but often very effective methods go for no very powerful but they have one very clear drawback which is that their representations are uncontested allows so to give you a sense of what this means you might have a sentence which is talking about an apple or Apple Testament and in some cases you might be talking about Apple the fruit in which case you would have one vector for it but in some cases you might actually be talking about Apple the company and in this uncontested eyes model we're using glove and words affect both these cases are represented by the one vector you can see already that that's very suboptimal in these two contexts a Apple has very very different meaning and that's why there's a push towards contextualized motives such as Elmo and you LM fit and these were a big improvement upon these uncontested eyes style and these models were initially built using this LS TM architecture so we're current here on network architecture which reads in one token at a time l STM's yeah sort of a lot of use of NLP and still do they have a couple of quite significant drawbacks titude one of them is that they struggle a lot with long sequences when they're reading one token at a time eventually they kind of forget what came maybe 100 tokens before 220 before it kind of have a mount of memory that fades as you go the other issue with ellis teams is that with our current hardware they're not the most efficient architecture and that it's very very hard to parallel eyes out of a GPU you need to take one step before you can perform the next step and that gives some limitation to how quickly they can process text fast forward it's what we have now the latest generation of language models are contextualized transformer models this transformer architecture is something that you'll find in a whole range of models like Bert like Excel net like all the Bert variants such as Roberto and Albert which we'll talk a little bit more about later on but this is the state of the art at the moment these are the most performant models and these the ones which really get the most out of advances in hardware design so yeah I want to talk about just what kind of models there are out there why I keep talking about this but is really he is a really really important one in in the context of language mumbles and a really really big step up when it first came out is the fact really effectively shifted to this transformer style of language modeling it's trained on the original version is trained on a corpus of about 13 gigabytes of data made up of wiki articles and corpus and it's trained using two tasks math masked language modeling tasks and the next sentence prediction tasks and as we saw before this this kind of architecture is very flexible and it can be used for lots of different tasks as I said document classification and ER this Bert model is the foundational language understanding it's not task specific you perform some kind of fine-tuning on a specific task in order to turn this Bert model into something that is a little bit more tailored to more practical case of some sort but if you're reading papers now and there are lots of language model papers coming out every month you'll often still see but as the baseline model just because it was so influential it was such a big step up and it's just such an effective design but it's spawned a lot of other kinds of models a lot of people wants to make improvements upon one of these for example if spam bot which wanted to push by changing the mass language model toss a little bit instead of just Marx's single tokens in masks out pull spans and creates a much harder task and they report some performance gains there there's the Roberta model which is actually quite popular right now and quite effective to it is in essence very very similar to that but with lots of minor tweaks you add some more dates and a train for longer they remove the next sentence prediction ahead and have a large batch size and yeah they manage to improve upon but I mean Albert is another variant that also came out from Google but really came from a team from Google and Albert is improvement from Google which really sought to shrink down the number of rameters in the model through a a set of different enhancement and yeah there are lot of teams looking to actually make but even more efficient and one of these which i think is really really cool really really interesting is it still but it uses this distillation technique also known as a teacher student learning method whereby you start with a large fully trained model like that and then you have a much smaller student network microchips be its smaller version of that which tries to copy what the larger goat does and a copies not just the output predictions the book makes that the teacher makes but it also tries to copy the confidence then the larger model has over the range of available class labels and through that we actually end up with models which are significantly smaller that actually retain a lot of the performance of the large model that was just a couple of them there are so many models out there now and so many really also really cool new techniques coming out here is a very brief overview of it but yeah I think just by seeing how many models are there I've planned it actually a very exciting time right now to be in Philippi and seeing what's coming next how people are trying to push forward so what can we do with these language models well one of these tasks is question that's right as I mentioned and at this point I'm gonna show quick demo and asks what question answering is so here I am on a demo page that we created that is going to leverage an English but one train for question answering and here we might put in a company description paragraph this comes from Wikipedia and we might want to ask it a question like what other divisions of a bus this is from an article about a bus as we run this once again we get highlighted the appropriate answer commercial aircraft defense in space and we could also try something like where is a bus and we can see yeah Airbus's corporate headquarters is located in Leiden Netherlands and what I really wanted to stress here is that there's not any kind of rule that is being implemented behind the scene here this system really actually reads whatever is typed into this question you can type in any kind of question as to try to make some sort of semantic sense of it and apply that to the passage to identify an answer spann so going back to the slides yes so to kind of describe this task a little bit more specifically question answering is actually a very broad field and we're talking about quite a specific subset with question answering called extractive question answering and it works like just we saw just now you have a question you might be asking what is gone in and you might have a text message another paragraph from Wikipedia and you want to ask that question and it's the job of the system to highlight an answer in this case the largest city of Germany quite a lot of challenges this kind of task so to really perform well at this a model has to really be able to understand what the text that's reading is and what the question being posed that it's really act upon both these things it may have seen the passage before but it might not have had it paired with the question the system has to really be able to act on the fly and understand the information need the other thing is that there are a very large set of potential predictions if we have a passage here any span of text could potentially be an answer to a question some right and this model has to be able to pick out just a right one and in practical terms text that the model reads often it can be quite long here is quite a limited example we have just a paragraph from Wikipedia but to really match information needs of people the system might have to be reading through large documents or even with stacks of documents there is a standard approach nowadays is to train a model that can identify the start and the end of an answer span and we're gonna see a little bit more in deep Hale what that actually is but yes it's no surprise that language models form the basis of these systems we get a language model we fine-tune it unlabeled question-answer pairs and we found that this actually works quite effectively get a system which is able to tell you this is the start and end of the answer you're looking for and I mentioned already a little bit about squad so as I said squad is a very very popular question answering data set it's almost a default in a field of question answering it presents one Wikipedia paragraph and pairs it with various questions and human annotators have highlighted the answers to his questions it's very useful dataset to get no one wants to learn how to perform an answer so I want to show a little bit more of a practical example of what's going on behind the scenes here so let's say we have this same example here we have the question what is Berlin and then we have the text from the document just this Wikipedia paragraph there in yellow and blue respectively we passed this into the language model Bert and Bert returns these word contextualized word vectors there's one word vector to each token that comes in these tote these vectors are then passed on to the question answering head and it's the job of the question answering head to return the start and end token which means that the system can extract the largest city of Germany has its predicted answer how does it actually choose this beginning an end well when the prediction head is given a set of word factors there's a there's a feed-forward layer which is applied to each one of these vectors which spits out two outputs starts it out start blodgett and an end logic when it's confident that this position is a start of the span and start out logic will have a high value when it's confident that it's the end of the span the end logic will have a high value this feed would component will iterate over each position until we can get this stacked up set of start and end logics and then you pick best span out of here so as you can see as represented by the red these positions have have very high values and will be taken as the prediction by the dictionary so why are we talking about question answering now now is actually a particularly good time to be working with question answering because transform of based models have significantly boosted the accuracy of that is possible on these question answering tasks and it's actually made it really practical to implement these in in industrial situations so on the right here we see a graph of actually the performance so stated the outperformance on scored data set of its time it might be hard to see the numbers but I just want to focus in on one feature of this which is the big surgery performance in the middle of this graph and that happened around November November 2018 and that is attributed to the release of burr the first really successful transformer based language model nearly performance increased usually overnight and since but there have been lots and lots of incremental improvements such that now we're actually at a point where these systems outdo the human benchmark there's a lot of momentum and research right now working on this task of question answering as well there are a lot of new datasets coming out which are really trying to push this question answering tasks in different directions they try to make it more yeah more difficult or challenging and they're trying to build systems which will be more robust and be able to answer more information needs so there's a big push towards getting systems which are able to synthesize information through different parts of text for example systems that can read one sentence from early on a document see something much later in the document and be able to combine these pieces of information to present a relevant answer as we saw before there are a lot of new model architectures coming out as well and these are showing lots of improvements no it's just actually not interesting question answering and a whole range of tasks but yeah the benefits of these better architectures is definitely being passed on to question that answer there's also a lot of work getting question answering and language models to operate on no just English but a whole set of languages xlm Roberto is actually a very nice case of this it's a single model that can handle a hundred languages at once and it actually shows very very competitive performance compared to even a lot of this single language models we think it's also very interesting space to be in right now because we see this need in industry to have state-of-the-art search systems and we think this technology this this interest is really going to drive a lot of development in building the systems that people need in workplace environments so we really see that there's a large growing number of information workers who have complex queries and who would really benefit from having a tool that is more powerful than simple keyword search we also see as I mention that there's a big gap between the possibilities in web search and what there is an enterprise search and so windy person have really seen that this technology can really help intranet narrow that gap we're really really driven to see what we can build to help people answer their questions through their own documents and here I just want to give a quick overview of some of the data sets out there so I've talked a lot about squad squad version one which became squad version two when they are added a lot of no answer negative examples for example in squad version two you can now have a question that cannot be answered by the passage which is at hand and in that situation the model is meant to say that exactly to say that okay the question that you posed can't be answered by what I have I don't know there are other data sets out interested in working with differences for example this visual cueing which actually operates on images you have question which might be asking about an image there are some data sets which tabular data there is a step down here which work on reasoning and synthesis as I mentioned it's really trying to push together different bits of information from different sources and trying to make some sense of that there are a lot of data sets which have translated squad into different languages or use the same methodology to create non-english squads there are sets which are designed to have lots of different languages together so they do question-answering but they have parallel examples in for example Japanese and Arabic there is open domain question answering which is interested in answering questions not from just one document but by looking at a whole corpus of documents and there is new answer types which are popping out so natural questions here allows gives them on certain examples where it's expected to give back a yes or a no as an answer did this event happen this year the model sometimes in that situation cannot actually highlight a span which would answer that kind of a question an effective system would have to respond with a yes or a No one thing that I want to point out is that squad is a very very effective dancer and it's created some really really cool models which are able to do some very incredible things but there are certain limitations to squad and one that was picked up on relatively early stems from the fact that the people who wrote the questions for squad did it by reading one of these Wikipedia documents and coming up with a question the issue of this is that often the questions that these writers come up with include a lot of the same keywords that exist in that document and with this kind of lexical similarity to put a question and document we actually get a much easier task it really doesn't leverage the full power of language models one language models which are very capable of understanding synonymy are very Cape very flexible enough to deal with different wordings different phrases so you can see by these little eye symbols these I've just highlighted the data sets which have made a very strong point of generating questions in a very different way such that the question writer does not see the document that the answer is contained in and therefore is not in writing or yeah and so these tasks to much trickier and really put holes today and has a lot of potential and we see the potential in it when we can scale it to a to an open domain kind of setting this is something that really reflects the information needs of work is much better and this would be something like having a question like what is but a Lin but trying to find the answer to this not just in the Berlin Wikipedia page but perhaps in a large collection of documents they're all challenges this I'm going to switch a little bit of the terminology here what I've been calling these questions of answering systems in the context of open domain QA are called readers that's the system that kind of reads very closely each sentence each line as it stands now transformer models are limited in the input length that they can process at a time and this is challenge that has been working that has a lot of researchers very interested in which a lot of people working on right now but currently this is definitely a limitation it's hard to ingest that much data and process it completely there's definitely a need for her speed in any system which is going to be answering questions and this really gets harder when we're talking about larger collections of documents and one other thing I want to point out is that these reader models these question-answering models when they read through lots of different documents might find not just one candidate on set but a whole set of Canada ants and scattered all around the place and then in these situations we need to have some kind of logic that tells them on which one to pick how how do you weigh up these answers that are scattered across different documents I want to stress that it's not always just an answer that is a span as we said before we have no answers we have yes and no answers how do you weigh these against each other so one popular approach is something known as the retriever reader pipeline and it looks something like this on the Left we start off with your big story hopefully the documents you pass this to a retriever which is a much more lightweight system that actually filters down to just your top K documents it's this which is passed on to your reader which these deep learning base models that were talking about and it says reader which has the job of extracting the exact and some of our experiments we've been working with readers and retrievers and in a setting where we had about seven point five thousand documents from the natural questions development dataset we were using a Roberta base model and a Tesla V 100 if we were trying to make the query with just the reader we're talking about over three hours for a single query but with this two-stage approach when we have the Retriever in there as well we can reduce this number down to just about one to two seconds actually and so yeah there's a certain trade-off here in that I trying to speed this up we can't pass all our documents to the reader about in the next couple of slides there are a lot of ways that we see improvements coming along so yeah I just want to talk a little bit about a couple of different approaches to readers and retrievers right now and yeah give a sense of sort of what level of performance we're at at the moment so if we had something like a perfect retriever something which always returned the right document we would expect a retriever the top 20 recall to be perfect to be 100 and using best Rita models right now this numbers taken from believable order the natural questions we could expect something like a 51% in top one exact match for the extracted spent so instead of documents we expect this perfectly this kind of a model to be able to fuse 1% and time extract exactly and answers the question and there's been putting and yeah can we have a cyclist which is a very established baseline that's used for the retriever so the variant of tf-idf I think a lot of you I would be a lot of company doesn't scale quite as heavily as an audience for example there are a couple of other techniques here such as later retrieval the tea and also the relevant paper which just came out this year the one that I kind of want to talk a little bit about is dense passage or retrieval which uses a dual incurred architecture now let's talk about this is because I kind of talked a little bit earlier as though retrievers as their readers are always deep learning one was and the retrievers are from some other architectural together but dense passage or evil is something that might shake us up Tim's past retrieval survey is showing a lot of promise and is actually using deep learning for the retrieval part of it as well and so how it works is that we have two encoders we have a tube-based encoders and we have one for the question and one for the passage in the question into shared many spaces touch that wait when you have the product of so many the passage answers the question or at least the passage contains the answer to the question one of the cool things about Spain is that comes from so the idea behind it is that each question you want to train it on one positive passage and a few negative some passages which just don't contain me that's the question they do something very smart called Gibbs I'm sorry something called in batch negatives whereby in the batch the post passage for one question is used as a negative passage for another question and by doing this they gain a very effective a very efficient way of training and yeah they've managed to get some very nice results out of this one other step that they took was that they combined multiple q88 assets to augment their data and they shared quite convincingly that actually this gives them a lot of improvement so if you really want to use this at inference time the idea then is that big corpus of documents and you pre compute headings for these documents you index them to some systems such as fights by facebook and then you want to remove Keynes neighbors when you have a query you have query you embed it and you retrieve the document events which are nearest to it mentioned we have deep sir on our own open domain question answering system and Hoth haystack it's open source and yeah you can you can try it out or Vinny is looking for it and pub there but we have a little snippet of the code blasts technicians here and how sure sort of how oh this concerts that I've been talking about are implemented to start you would have a different store you know you can see highlighted here the doctor says be prepared which holds all your documents we have the retriever component which is this first filtering step where you pick just a subset of the most relevant documents to your query you have a reading which is this deep learning based question answering system which is really going to extract that bit of text that will answer your query we put these together through an object known as the endo and we call it again prediction so yeah yeah we I hope you guys gonna try that's all the time actually it's a quite fun to work with and we have a nice little dinner Thrones fans where we have pre-loaded a set of Game of Thrones articles like who is our Ian stacks father and so on it does actually surprisingly well and moves but to wrap up kinda what I've been talking about yeah we see right now that transforming models have enabled extractive QA seems sort of really big performance gains in them and we see that this can really drive the development of industrial search applications even their games or in an open domain sphere we see that there's potential to scale up extractive its way into domain we have these components known as the retrievers which image QA they filter out the search space such that the reader that doesn't have to read through your whole document store but just a subset and yeah we want you to play around with this kind of technology we think that we found it really fun to work with and whether to you can try it through haystack and gives us a little kind of roadmap of what we see ahead but coming we see that there gonna be a lot of improvements we see that there are gonna be new data formats coming out we see that tables and figures and images we see that there will be new retrieving technologies coming out like the dense passive retrieval we see that people will be working on more reasoning and synthesis and we see that they'll actually I think that in the future Wars is teaching extractive QA but it's all kind of in this famil and we succeed at these systems and they can get a bathroom back at answering information needs of people when they go through their own stack large governments so yeah thank you very much for listening I hope that was helpful and interesting if you ever wanna get in touch here are some links to and LinkedIn handles and also to thanks very much Brendan we have quite a set of questions I'll make the remark that further down the horizon from the slide that you shown I'm waiting for argumentation in other words not just answering questions but putting forward arguments that are essentially a series of answers that move people to a certain position so if you want to take a look at the Q&A we have looks like 12 questions they are pins both of them you've answered in the course of your talk I can read them to you or you can read them yourself read them out loud that is which is your preference one that you largely answered already Stephen McInerney can you comment on not English squad type data sets I think you've done that but you have any further comment as for the target language other than English Twitter relative pros and cons of taking an answer from an English model and translating it versus direct query on a model you know I think you've largely answered that but you want to say anything more yeah I do actually yeah that's a really interesting area that we've been working long that deep set and a blog article looks about lots of different types lots of different attempts and I would love to share a link to that after this before or through this chat thread but what I want to really focus in on in that is that there are two big approaches the Machine translated style let's start with a English English board and try to maintain the alignment and there's this method which actually tries to replicate the methodology of school let's get human like people in chemical turk or human translators to create a parallel squad and what's been quite clear throughout all this literature is that machine translation by itself is not enough to create an effective squad equivalent dataset in another language by contrast these human methods have proven to be quite successful they've been able to create models which have which are just like my name it's great dataset for example in French which is not as big as squad but can still model so you get a lot of performance of squad but one last thing I want to add is that the hybrid purchase as well methods where teams have gotten human translators or animators to create say Arabic version first and then to add on top of diluted Arabic squad examples and they in some of these cases we also see a performance boost by adding machine translation but really the bottom line is nothing beats human annotations alright okay well I have two questions from Harish Romani one is how do we evaluate the models with the metric and the second is if he asks his own data said what's the first thing he needs to do so that he can do question answering on it does to start with named entity recognition how do you evaluate models and how do you get started um how do you evaluate models and how do you get started sure yeah to evaluate how well a model is doing question answering there are two metrics that I use generally there is the f1 span overlap so essentially let's say you can see in my webcam here let's say here is that ass that we're talking about this is Berlin is the largest city in Germany and your model predicts something that stops here so there's some overlap but it doesn't get it exactly there's an F 1 metric to kind of calculate some precision and recall to say okay if answer kind of got some of it but not all of it you're not going to get a full one out of one or for this but you'll get some small number so that's the more granular metric that's used so you measure how good these models are at extractive queuing specifically the other one there is exact match so in this case your model only scores went out of one on a certain sample if you really have exactly the same statment yeah so this is the two valuation methods how would I get started in yeah any are actually is not this is not really necessary to build a good QA system a lot of the times the answer that we're talking about actually not named entities at all sometimes the answer that you need is maybe a whole sentence or even a whole paragraph and so NER might be kind of good like markers for what irrelevant topics in this area but QA you don't really need it as a starting point I would point me to squad as I said squad is really popular we've been able to get very good performance on it it's really the archetypal extractive QA dataset but once again I would have give the caveat that it's essentially sold it's a slightly limited task and we're seeing this next generation of datasets coming out which are much more challenging much harder and which are really pushing systems to the edge okay I have two people with their hands raised Derek Audrey and Vinnie I'm going to call on them after the next answer so I also have two questions in here from Paul Trez Socko which I think you've mostly answered just a moment ago when you answered about metrics what is relevance based on and search performance versus for QA versus traditional tf-idf search anything further to say about metrics and relevance yes sure just to give a sense of how that you might be talking about that relevance number that we saw in the demos and how the model knows how confident is with a given prediction and there are a couple of things that go into that one of these is these logits that what we're talking about restore there's dots which represent each position and when it's red means that it's a high value means that the most confident based on that we can we can calculate how confident it is of that span there's also yeah on the document level depending on your architecture and depending on what kind of retrieve you use the retriever might also be able to return different confidences that a certain document is relevant to your query could you remind what the second performance versus traditional tf-idf search search performance versus traditional TF is yeah yeah if I guess Coverity it really has been very effective for how simple it is and I see it still being used in some cases we have systems which still have a fusion of deep learning methods mixed with tf-idf what I would say is that tf-idf has a very clear drawback which is that it relies very heavily on lexical overlap it really relies on a query having the same terms as the document if there is not that kind of overlap tf-idf will really struggle if you use a synonym if you use yeah just a word that means something similar but not the same tf-idf can't really pick up on this by contrast these deep learning models are much more robust to synonymy they really build they're really projecting words and their meanings into an embedding space so that words with similar meanings will be represented in very similar ways the other hand deep learning models are going to struggle in cases where it's dealing with words that it's never seen before tf-idf can very kind of a very shallow way deal with this if it is dealing with a proper name or some just Institute or word that has been seen before tf-idf doesn't really care too much what it means it's just interested in whether it exists in both the question and the document but you're searching through all right Vineet kumar i am promoting you to panelists so that you can ask your question directly if you're still there while we're waiting for you to unmute then I will just remark that yeah tf-idf has been around for several decades I've seen presentations in recent weeks that refer to keyword and context quick indexing that's been around since the late 1950s and we're still finding a lot of relevance in it do you want to answer your question directly I think it Vineet that doesn't seem to be working so let's go back to their questions that we have entered here we have Melanie Beck there's a lot of excitement over distill births smaller footprint and faster interface inference time however it does lose a considerate amount of performance on the squad datasets particularly on squad 2.0 what's your take on this model that is on distill Bert I think that I've seen it still but in a just a movement of research that is very very important to the field these models that we're talking about these team only ones are bigger and their size is often one of the challenges of getting them working in production settings they're still bird is one really really cool approach to this but it's not the only one and there are things such as like people are pruning away weights which aren't learning there are new architectures like Albert which they're trying to reduce the number of parameters and there are new this is where I think the most promising developments are they're a little but new Transformer architectures which are going to now work on much much longer larger spans and do this much more effectively and efficiently so yeah it is a concern right now that that distill boat isn't enough to kind of get us to deploy from GPUs to CPUs but I think we're really like there's just a lot of work going on out there and we're gonna see some very different models coming out to really make these fast enough to work yeah hopefully even yeah just on a CPU just on your local laptop okay I think we'll probably keep going for maybe another 10 minutes based on the questions that we have Sean you have a question from Frankie I want to ask advice on how to deal with dialog transcript QA models since the corpus has longer text length around 5,000 and text structures are different than a traditional document so that is dialogue transcript QA mm-hmm that's a very interesting one yeah so what else tres is that most of the data sets that I have that I mentioned today the documents are yeah well more something along the lines like Wikipedia documents and so when you talk about context in that case you might be talking about the context of you know what came in a paragraph for what came the sentence before but in dialogue you have a very different interaction of text when you have two separate agents if I'm to be completely honest I this isn't an area that I'm very familiar with I haven't worked a lot with a lot of dialogue text I think that there is there is definitely a slice of research out there which just working on things in this sort of domain I'm not mistaken I can't quite remember the name as people but the a team in Stanford were interested in sort of having to an iterative QA system which was kind of given yet an information need given a query and it would sort of return back questions that would try to like refine the search space and so on position between that system and yes I'm one who's kind of has it an information query so it's different style but yeah this definitely work in this field I just have to say that I'm not super familiar with it you know certainly a lot of dialogue does involve trying to read refine and often rephrase the questions that being asked they're being asked and that was the basis of a search engine from what 15 or more years ago Ask Jeeves the initial really vast Jeeves was trying to take you're an iterative approach to get better relevance on the question so just a very quick one are you gonna provide the slides for us I'll post that and the other meetup post can post that said okay yeah yeah definitely definitely okay we have a question from Aneesh Avinash v oz can you please share a pointer to new QA models that can answer a question using separate part of the paragraph as opposed to the start and end span of the answer if you can interpret that question can you say that again it's the phrasings a little difficult QA models that can answer a question using separate parts of a paragraph as opposed to start and end span of the answer yes sure sure so one that I would point to first is natural questions natural questions has annotations which are not just a single span sometimes you might ask something like who was in Led Zeppelin and maybe these entities don't come up very neatly in just a single string maybe they're scattered across couple of different paragraphs and in natural questions you do have these what we call multi span answers um there is another data set and this is I have to kind of I don't remember if it's exactly this one I believe it is hotpot q8 or maybe trivia cueing it's one of the ones in this cluster here between those which has not just like an answer but also like supporting facts which lead to that answer and so building the systems that are built for that yeah I really have to be able to deal with this synthesis and I can see that there's a lot of potential to transition there which answers with either multiple spans or even potentially provide some kind of reasoning for why it picked a certain spent yeah well talking about the facts or whatever that lead to an answer that's the essence of argumentation whether it's a mining or argument production we have a question about training bird language models on specific data corpus such as medical jargon sure yes there's definitely a lot of interest in creating first which are I should say that some of the some of the participants put forward by oh birds cyber doc Bert I don't personally Oh that'd be yeah go ahead yeah exactly yeah there's definitely a lot of demand for that kind of those kind of models what I would say is that there's a bit of a trade-off here so one of the challenges in training a model from scratch is that you need a large amount of text at a minimum I would say in the range of like 10 gigabytes of just raw text in doing so that's video that's the only way to really get a well trained language model that is going to perform well there is a method known as fine domain adaptation which can after reading this sort of set of text you can kind of use it fine-tune your model towards being familiar with ya medical data or scientific data or legal data this fine-tuning step yeah is really just kind of getting your model accustomed to new jargon to new style to a new domain there is one trade off with this which is that when you do this the vocabulary of the model is already somewhat fixed there are ways you can kind of add some tokens in but the motivation behind some of these other models I believe like by uber is that you want to train it from the very beginning just on medical text so that it is very focused on this kind of text and particularly good at it has a really well adapted vocabulary for it but yeah ultimately I think it comes really down to how much data do you have and are you going to be able to train an effective model 10 gigabytes really a minimum we're talking in like the hundreds of gigabytes for some of the latest models all right and what about integrating multiple models ensemble is and scoring mechanisms if you do integrate multiple models be an ensemble method I'm wondering whether this is specifically for QA of I assume is just talking about question answering answer no different ensemble methods have been around in many different contexts for a while so since this is talk about QA related specifically sure yeah definitely um I think that in almost every task I've seen in NLP like they in leaderboards top results almost always come from these ensemble methods and you can really quite reliably get gains in performance through this yeah like I would I'm all for it and I think that if you're really trying to push through state of the art this is a very very common method obviously it's going to be a lot more intensive in terms of compute in terms of methods that you use but yeah if you can afford to do it by all means go for it yeah we had two questions from Nikki Aurora about question answering for student answers in University examinations for versus other one for grading assignments final exam stuff like that I will interject here also that this was articulated as one of the so-called grand challenges for AI by some people I follow like 10 or more years ago is doing this sort of evaluation but anyway applications for grading assignments final exams academic settings um yeah that's very interesting one it's very interesting one so let me you seem not to really be on top I actually know people for instance at the Educational Testing Service here in the United States they administer the statistical aptitude test the SAT exam and other exams and they are working very hard on this because they would like to be able to automate the scoring of the their responses the essay type responses that are given as part of their test so it's works definitely going on oh yeah I definitely agree and I think that um to characterize what these models are good and not good at I would say that what they they definitely offer what I would call like a much more fine reading of text than you've ever had before where vectors were good at kind of just telling you what words sort of means and kind of grouping together so in the words now we get these models which actually read a sentence and compose it and really kind of say okay this verb does that now and has this sort of gist to it there are still some very big limitations to what they're capable of doing I think that that we're not really at a point yet where we can really like have a very good representation of the reasoning perhaps or the style or the eloquence actually least the reasoning and sort of how I put this the world knowledge and the conceptual understanding of a student I think that this kind of technology can still be used for a lot of different things I think there's a lot potential and the style analysis but in an educational sphere I kind of think that I would like to see this technology used as a tool for students to be able to get to what they need in order to study effectively to learn effectively and yeah to do well then courses let's do two more I'm going to group together three responses to the discussion a few moments ago about tf-idf and so on we have Yasser Martinez have you seen examples of non tf-idf systems for instance pure retrieval on production Steven McInerney says shortly the middle ground between tf-idf and DL is connecting to a knowledge base that's a different wrinkle here and you're a factor to follow up on tf-idf what's a good way to incorporate olv I don't know what that is actually out of yeah out of a cab Larry was yeah okay into a vector and language model based methods so Monte FIF systems yeah yeah yeah in the little summary that I showed before I think what I yeah what I was trying to convey there is that we're actually very excited about these dense passage retrievals these embedding methods are really the most robust systems we have right now and the issue with getting them into the Riva is how do you train one it's good enough that is fast enough that can be use at scale and dense passage retrieval is one of the first papers which really show that this is a possibility you just need to you know what they really showed was that they found a very good way to train it effectively and showing the performance there the idea there is that as I said you you just have an embedding for every document in your in your corpus and this is really just the most kind of fine-grained retrieval we can have right now knowledge bases I have to admit really not my strong suit and I know that there are a lot but knowledge base approaches out there but we we really see that there's going to be a lot more development as well in in these dense retrieval methods and we find that really really promising I just wanna make a quick comment on the outer vocabulary so these deep learning one of these transform style ones mostly use some kind of variant of things like word peace tokenizer z' which split up a word into into kind of component parts if a word has not been seen by this tokenizer and lots of small chunks maybe respectfully would be kind of broken up into respectfully something like that but if it's seen in a swards it will if it's seen the word enough times it will have that word it's just a single token okay so the hope is that basically these models have seen components of out of vocabulary words enough that it can kind of make a good guess as to what it means but in some sense I've been a little bit roundabout about this but in some sense no word is really treated like our vocabulary by a deep learning model it just is forced to kind of make a good guess as to what that word means based on certain parts yeah sorry that's a very very try to capture a lot of information about how to organize errs work with these newer methods out of vocabulary is not as big it's not really as crushing an issue as it was in sable events okay so we'll take one final question and that relates to your job which I thought could be an interesting way to close Sushant Singh does your job involve developing products with respect to research papers in accordance with industry needs and how is that different from people working on purely NLP research so about what you do yeah sure okay we're really doing a mix of these things so I'm actually currently working on a paper about trying to we've been training a lot of German language models and getting models which are really pushing forward performance on named entity recognition and document classification we've been working a lot with the Electra model which came out just this year and is a very cool twist on the Strand to be much more efficient training so there is a component of my work which is very research related I'm always trying to read whatever new is coming up and it's very much in the DNA of deep-set - yeah to be following what developments there are in QA in language modeling in open domain QA in transform architectures and so on but at the same time we're building this open source software we're working on projects with people in industry and we always have an eye to making so far that people will actually be used that's really something that is essential to us yeah
Info
Channel: NLPxing
Views: 3,933
Rating: undefined out of 5
Keywords:
Id: E80qHThomok
Channel Id: undefined
Length: 70min 26sec (4226 seconds)
Published: Tue May 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.