How to Build Q&A Models in Python (Transformers)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi and welcome to this video on question answering with bert so firstly we're going to have a look at the transformers library and we're going to look at how we can find a q a model and then we're going to look at the q a pipeline so we're going to look at actually loading a model in python using the transformers library we're going to look at tokenization how we load the tokenizer and what exactly tokenizer is actually doing and then we're going to take a look at the pipeline class which is essentially a wrapper made available by the hugging face transformers library and it basically just makes our job in terms of building a q a pipeline incredibly easy so we're going to cover all those can be quite straightforward and quite simple so let's just get straight into it okay so when we're doing question answering we're essentially asking the model a question and passing a context which is what you can see here for the model to use to answer that question so you can see down here we have these three questions so what organization is the ipcc a part of and then the model will read through this and use its language modeling to figure out which organization the ipcc is part of which is not inherently clear from reading this we can see we've got ipc here and is a scientific intergovernmental body under the auspices of the united nations so clearly the ipcc is a part of the united nations but it's not clear it's not definitively saying that in this but once you've actually built this model it will quite easily be able to answer each one of these questions without any issues so the first thing we want to do is go over to the hugging face website and on the hugging face website we just want to go over to the models page so it's here okay and on this models page the thing that we want to be looking at is this question and answering task so here we have all these tasks because when you're working with transformers they they can work with a lot of different things text summarization text classification generation you know loads of different things but what we want to do is question answering so we click on here and this filters all of the models that are available to us just purely for question and answering so this is the power of using the hooking face transformers library it already has all these pre-trained models that we can just download and start using now when you want to go and apply these to specific use cases you probably want to fine tune it which means you want to train it a little bit more than what it is already trained but for actually getting used to how all of this works all you need to do is download this model and start asking questions and understanding how everything is actually functioning so obviously there's a lot of models here we've got 262 models for question answering and there's new ones being added all the time a few of the ones that i would recommend using are the deep sets models so here are the deep set models there's eight of them for question answering uh the one that we will be using is this bert based case squad two uh another one that i would definitely recommend trying out is this electro bass squad two but we will be sticking with the bass now it's called deep set here because it's from the deepsea ai company and this model is being pulled directly from their github repository so deep set is actually the github organization and then this is a repository but based case squad 2 but it's obviously the model bert from google ai base is the base version of bert so you can see here we have bert large that's just a large model we're using the base model case just refers to the fact that this model will differentiate between uppercase and lowercase characters or words the alternative to this would be one case here where there's no differentiation between uppercase and lowercase and then squad 2 refers to the question answering data set that this model has been trained on which is the squad 2 data set from stanford university so we're going to take this model so you see deep set base case squad too and we are going to load it into here and all we need to do to do that is from transformers so this is the hooking face transformers library we want to import bert for question answering so this is a specific class and using this class we can initialize a few different models not just this specific model so you can see here we have this bert based case we can also initialize this larger case roberta and if there's a distiller as well we can also load those in and what this does is it loads that specific model with its question and answering layer added on there as well so this model has been trained with the actual layer specifically for question answering and we need to use birth question answering to load that otherwise if you are not using it with a specific use case and you're just wanting to get the model itself you can just use the auto model class like that but we want it for question answering so we load this one another thing to know is that we are using the pi torch implementation of here so transformers works by having both tensorflow and pi torch as alternative frameworks working behind the scenes in this case we're using pi torch if you want to switch over to tensorflow all you do is add tf in front of that class so that is our model and to actually load that in all we do is copy this and we use the from pre-trained method and then this is where the model name from over here comes into play so we've got deep set base case squad two and we just enter that in there okay and with that we've actually just loaded the model that's all we had to do of course there are a few other steps this is just a model but there are a few steps before we actually get the data to model so we need to actually process it later so we have this context here and this is just a string but doesn't understand strings better understands an array of integers where each integer represents a token id and that token id is very specific to bert and each one is unique and represents a specific word or piece of syntax punctuation or so on so we need to convert this string into that specific to ready format and to do that we need to use a tokenizer so again we're going to go from transformers i'm going to import the auto tokenizer class here we can use for example the bert tokenizer but for this the we don't need anything specific it's quite generic it will just load all of those mappings from the string or the word into the tokens there's no real issue there so we input out auto tokenizer and to initialize it we just see this it's practically the same syntax is what we used before we use this from pre-trained method and then again we're using the same model okay and then with this we can actually tokenize our data so all we need to do is write tokenizer and code and then let's just pass in one of these questions so we'll pass in the first one so questions in the first question there two variables that we will need to add in here are the truncation which we will set to true and the padding which we also set to true so when we are setting up these models and the data going into them but in particular will expect 512 tokens with every input now here when we look at this we can see there's probably one so each one of these words is most likely to be a token and then this question mark at the end will also be a token so we have around 10 tokens in there now because we have padding this will add a set of padding tokens onto the end of it to bring that total number of tokens up to 512. now alternatively say if we had 600 tokens in there we would be relying on the truncation to cut the final 88 tokens to make it a total of 512 and that's why we need those two arguments in there so let's see what we get from this you can see here that we have our tokenized input so bet will be able to read and understand this and essentially what we have so this 1327 is the equivalent to what this 2369 is equivalent to organization and so on and so on now what you might not see here is why we have this 101. so 101 for bet actually refers to a special token which looks like this and this just signifies the start of any sequence so if we were to just take this we can see that okay we get the same again we get this 101 which is the start sequence then we get the start sequence token again because that's all we put into here and but the tokenizer is reading that and conveying into the one zero one and then we also get this final special token as well and we can also see that's here so this is another special token which signifies the end of a sequence or it signifies a separator point so if we write this out you see here that diet separator is 102 and what i mean by it signifies a separation point or a separator so when we feed this context and this question into our bear model but we'll expect it to be within the format something like this so we have the the start sequence token then we will have our context tokens so this will just be a list of integers which are the token ids and then what we will see is a separator token here followed by our question which again after this is followed by a separator token and again after this we get a set of padding tokens which look like this and that will just take us up to the 512 token amount and that's how the data going into bert will look like we have that start sequence we have the context we will separate we have question we have separated and we have padding it's it's always going to look like that when it's going into a q a model so if we just remove that and this here and what we want to do now is actually set up this tokenizer and our model into a pipeline into a q a pipeline so again we get this pipeline from the transformers library so we come down here do from transformers import pipeline and now what we want to do is just initialize a pipeline object so to do that we just write pipeline and then in here what we need to add is a model type so obviously you can see up here we have all of these different tasks so summarization text generation and so on the transformers library needs to understand or this pipeline object needs to understand which one of those pipelines or functions we are intending to use so to tell it that we want to do question answering we just write question answering and that basically sets the wrapper of the pipeline to handle question answering formats so we'll see our input and for our input we will be passing a context and a question so we'll see that it will convert into the right structure that we need for question answering which is the cls context separator question separator and padding it will convert it into that feed it into our tokenizer and the output of that tokenizer our token ids will be fed into bert burt will return as a spam start and span end which is essentially two numbers which signify the start position and end position of our answer within the context and this pipeline will take those two numbers and apply them to our context to get the text which is our answer from that so it's essentially just a little wrapper and it adds a few functionalities so that we don't have to worry about converting all of these things so now we need to pass in our model and the tokenizer as well and it's as simple as that that's our pipeline setup so if we want to use that now all we need to do is write nlp and then here we pass a dictionary and this dictionary like i said before needs to contain our question and context so the question and for this we will just pass the first of our questions up here again so this questions at the index zero and then we also pass our context which is inside the context variable up here okay and this will output a dictionary containing the what we can see the score of the answer so that is the model's confidence that this is actually an answer like i said before the start index and end index and what those saw index and end index mapped to which is united nations so our question was what organization is the ipcc a part of and we got united nations which is correct so let me just show you what i mean with this start and end so if we go 1 1 8 here we get the first letter of our answer because we are going through here and it is pulling out this specific character if we then add this and go all the way up to our end which is at one three two we get the full set because what we're doing here is pulling out all the characters from u or character 1 1 8 all the way up to character 1 3 2 which is actually this comma here but obviously with python list indexing we get the character before and that gives us united nations which is our answer so let's ask another question we have what your own organizations establish the ipcc and we get this wmo and the united nations environment program unep so if we go in here we can see it was first established in 1988 by two united nations organizations the world meteorological organization wmo and the united nations environment program unit so here we have two organizations and it is only actually pulling out one of those so i think the reason for that is albert is reading is wmo and united nations environment program so it's pulling out those two organizations in the end just not the full name of the first one so it's still a pretty good result and let's go down to this final question so what does the un want to stabilize and here we're getting the answer of greenhouse gas concentrations in the atmosphere so if we go down here we can see the ultimate objective of the unfccc is to stabilize greenhouse gas concentrations in the atmosphere at a level that would prevent dangerous anthropogenic interference with the climate system so again we are getting the answer stabilize greenhouse gas concentrations so our model has gone through each one of those questions and successfully answered them and all we've done is written a few lines of code and this is without us fine tuning them at all now when you do go and apply these to your own problems sometimes you won't need to do any fine tuning and the model as is will be more than enough but a lot of the time you will need to fine tune it and in that case there are a few extra steps but for this introduction that's everything i wanted to cover there in terms of fine tuning i have covered that in another video so i will put a link to that in the description but that's everything for this video so thank you very much for watching i hope you enjoyed and i will see you again next time thanks bye
Info
Channel: James Briggs
Views: 1,843
Rating: 5 out of 5
Keywords: python, tensorflow, huggingface, transformers, nlp, natural language processing, pytorch, bert, electra, q&a, qa, question-answering, question answering, question and answer, machine learning, data science, programming, tutorial, learn, software development, python3
Id: scJsty_DR3o
Channel Id: undefined
Length: 19min 49sec (1189 seconds)
Published: Fri Feb 19 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.