Custom NER with spaCy v3 Tutorial | Free NER Data Annotation | Named Entity Recognition Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
dear fellow coders welcome to one little coder in this applied nlp python tutorial i'm going to teach you how to create your own custom named entity recognizer using spacey3 for this tutorial we are going to explore an open source tool called ner text annotator this is an open source tool that has been built and open sourced by takaholik a developer whose name is arun mori so thanks to aaron mori for making this amazing tool and then making it open source if you have heard about the cano if you have heard about prodigy this is an open source alternative for those tools and then you can do data annotation labels text data annotation for text data especially using this ner annotator and what we are going to do is after we do the annotation we are going to see how to do the model training in your model training custom in-ear model training using space e3 a lot of the tutorial on the internet are focusing on specie 2 with spacey 3 the lot of things have changed how do you actually train how do you actually um even take the the annotation annotated data so i'm going to cover all these things so if you are interested in training your custom ner this video is for you there are two ways that you can do in er text annotation so for you to text annotation using this tool the first way is that you can set up this tool because this is an open source tool and the code is available you can set up this tool on your local machine i've got a separate video that takes you through step by step how to set up this tool on your local machine which you can refer and i'll link it in the youtube description but if you do not want to go through that pane if you want to just simply go to a web url and then start begin your annotation process then thanks to ticaholic now this tool is completely available on the internet as a website so all you have to do is go to this website which i would link in the youtube description or you can see it from their suppository you can click this link and then you would land on this page and then this page has everything that you want to do annotation so we need very first important thing for us to do annotation is we need a text file on which we are going to do the annotation so where do we get the text file so for the text file what i have done is i've gone to a news website and then i've taken some text so this news is about crypto currencies bitcoin so what i've done is i've taken the text and then i have just put it on my jupiter uh i've just put it on my visual studio code to save it as a text file so right now i've got a text file called crypto.txt so our file crypto.txt which is what we are going to use to annotate and then when you see crypto.txt after you go select the file then you can select crypto.txt after you do that this is what you would get you would get a you would get a setup like this where the crypto.txt file is already loaded and then you have all these things so now let me quickly break down what you see on the screen so the first part that you see is what defines what separates text is it a new line is it an empty line other is it a custom string let us say if you say new line and you have got this thing so new line means every new line would become a separate labeling option for you but if you say an empty line this is an empty line so everything between empty line is considered to be one text or if you have a custom option so these are the three separator options you have got and then the second thing is for example let's say like if i say new line you can see that i've got 15 new lines but if i say empty lines i've got eight empty lines based on the rate changes the second thing is you need to define what tag that you want to do for example let's say i want to say cryptocurrency okay or i'll just simply say crypto so crypto is one of the entities that i want to attack so i will say bitcoin oops sorry i'll say bitcoin is crypto okay so now let's say i want to do dollars like money okay so i'm going to say money or value value is probably better right i'm going to say value this is value here let's say i want to do percentage now i will again add percentage and then say percentage okay so now this is how i would start tagging so crypto value percentage and here after i click this so that's another thing right so after i click this thing i have to select which one i want and then i have to select that okay so i've got crypto i've got value so now let's say i want percentage i want percentage and let's say i want more value so i'll click values and values and i'll click percentage and i'll say percentage okay so this is how i'm going to select and i've selected crypto and i've selected ethereum so this is what i'm going to do for all the files so this is the place where you define the tags the entities and this is the place where you're going to select it so one is done so the next i'm going to select save and then i've got so dodge coin is a crypto okay dodge coin is a crypto and um we do we have anything else okay we have an organization here right so we have uh we have tesla and we have person so if i want i can say org and then i can say person okay and this is what i'm going to say here oh sorry it should have selected or and then i'm going to say tesla and i'm going to select person i'm going to say elon musk like this i can keep on tagging like percentage do we have a percentage here we have a percentage here do we have a value here we have a value here so i'm going to say next save uh shibu is a coin so i'm going to say shibu shiba sorry shiva's coin and then shiba you know so i should probably tag the entire word percentage percentage and then value value and this is this is something that i can just keep on doing like one by one one by one so organization coin eco is a organization so this is what i'm going to do i'm not going to continue completely doing everything so what i'm going to do is i'll i'm going to save and then rest everything i'm going to skip so probably coinbase is one thing i wanted to um tag coin market cap is again another organization i'm going to recap it so everything else i'm going to just skip skip skip and i've completed all the sentences so the website says i've completed all the sentences and after i do this thing at this point i'm going to click export annotations so first we specify the text separated then we specified the tags that we want the entities next we have done the tagging the annotation now we are going to export the annotations click export annotations and then you can save the training data as a json file training underscore data as a json file so now the first part what we set out to do is done so we have used ner annotator an open source tool for annotation text annotation for specially for any are named entity recognition and we have successfully managed to annotate our input text which is what we're going to use to build a model now okay so step one upload the text data which we did create the tags and annotate export the json and now we are going to see how to do the model training in er model training for custom in-ear model training using spacey three if you want spacey 2 you can use this code or also you can refer my previous tutorial that i will link in the youtube description but what we are going to focus now is how to do it on space e3 where a lot of things have changed for the first thing okay to start with we need let me let me even clear this environment i think i should probably do this right so i'm going to say clear clear end time factory reset run time so that all the data that i've got here will be gone reconnect so i've got a google collab notebook which again i'll link it in the youtube description so you don't have to struggle to get start with so first i've connected to a google collab notebook first i'm going to upload my json file what is the json file the json file is the annotated text that we just exported from the in er text annotator tool so this is what we exported and i'm uploading it to my google collab notebook to make it easily available for me to access using my code so the next step is i have to install spacey by default google google collab comes with spacey 2.2 version or something around two not three so if you want to use space e3 you have to make sure that you are upgrading this pc so i'm going to install it at this point i'm installing it silently so you would not exactly see what is happening but you can see that it is getting installed after it gets installed let me close this and make it big yeah after it gets installed now the next thing that we are going to do is we are going to look for this pc version what version currently we are running on this google collab session you can see that it has said it is three point two point one so our objective to do custom ner using spacey three is successfully done again on your machine already you could have uh spacey three um so that's why you would be probably watching this video but just wanted to show you so the next step is um like if you do not have the json file like from your local machine if it is on a different place you can just w get it but yeah in our case we have already uploaded it file so this this is not required so the next thing is we need to create we need to convert this json file into a dot spacey object okay why because starting space e3 spacey does not have a very typical ner training process that used to happen before so you need a dot spacey object the doc bin object so now what we are going to do is we are going to convert our existing json file that we just got the annotated text file into the new format that spacey wants which is a dot spacey dot bin object file so there are a lot of advantages that spacey discusses about the doc pin object like it's fast the serialization is good so a lot of advantages are there which you can read it on the documentation but i'm going to show you how to do that so we have got spacey we have got from spacey tokens we have got doc bin and then tqdm is just to show you the progress so again this is not a code that i created this is available on the spacey documentation itself so we need the name we need the name uh rename the file i'm going to copy the name of the file and then come back here so i'm just opening the json file and then i'm going to store it in training data so i'm sorry i didn't import this after i import this i need to give the correct file name so it is done so just to verify whether we have correctly shown i'm going to say train data and then probably want something let's see okay one is it's a json file right why did i do one okay you can you can see that it has got all the information that we want we have we have like annotations we have whatever whatever information we want we have it here at this point so we have annotations and classes and addition is what we care about so i'm going to delete the cell then go to the next line and where we are going to use the train data annotation which we just saw and then going to create the dot spacey dog bin object not going to go into this code not much required for you to understand anything in this just copy this snippet and start using it so i'm going to just run this okay i should have given this name trainingdata.spc let me delete this object we don't need this object um just to keep the naming convention same so training underscore data dot spacey so training underscore dot data training underscore data dot species so we have the required object now so the next thing next couple of lines what we are going to do is we are going to do using the cli so now species training happens the cli but for you to start that process you need a config file so this config file can be extracted in two different ways one is the cli what we have done here the second is i'm going to show you spacey config widget okay once you go online and then put this online so you're going to see the spacey config the widget so here you can select english language and then you can say what task you want to do in our case it is ner and then you can select whether you want to do it with cpu or gpu so if you use gpu then it's going to leverage the gpu and transformer but if you say cpu it is cpu so you you can decide on like what you want to do so for example you can see when you select c gpu it becomes a transformer pipeline when you select cpu it's just top to wake pipeline and then you can see all the differences and you also can specify whether you want to optimize it for efficiency or you want optimize for accuracy after you do this thing you can download the config file so this is method number one okay and after you download the config file so you can say spacey in it fill config so it's going to use the base config and then it's going to create the final config file so this is option one what we can do is the option to which is to use the cli and then say okay init config this is the config file that i want and the language i am interested in is english the pipeline that i'm interested in is ner so you can manually specify all these options whatever option that you had in the widget you can manually specify that so i'm going to run this i have not selected gpu option and in fact you would have seen that i've not selected runtime gpu on the google collab notebook as well so it's we're just doing it on cpu so when once you run this thing it's going to take some time and then you can see okay it's basically telling you that if you want effective transformer based config which is gpu only then you need to also install spacey transformers package right now we are not going for gpu so it is not a problem but if you are doing it for gpu then use pc transformers package like installed spacey transformers package a python library and it also says the current config that is generated does not use transformers and it it gives you the information duplicate information language is english the pipeline that you selected is ner optimized for efficiency hardware is cpu transformers none all the values are filled config file is saved go here and then see your config file is available for you to use the next thing that you have to do is you have to say python spacey train and then you're going to train now using the config file you already have the training information the the dot spacey document object you have the config now you have to start the training process so spacey train config file and where do you want to save the output this is the path of the output directory i'm basically saying that you know give me everything here inside this but if you want to put it in an output directory again you can go here create a new directory and then you know you can keep it there so it's up to you but in my case i'm just it's basically saying like dump everything in my current root folder so the next thing is your training path which is our training file is training underscore data dot spacey the next thing would be our validation file so in our case again we are just because this is again used for a demo purpose we are using our training file as our validation file like training object as a validation object but in the real world you are not going to do that you will have you would have done this in er activity annotation for a separate file altogether and then you have done the same process so like whatever we just did for the training data you should have done this for the validation data as well we are not doing that here just because it is a demo i don't want to take up a lot of time but in the real world make sure that this is your validation data like a test test um object or a validation object okay so now i'm going to start this process it's it might take some time so you might see me continuously editing that video somewhere there so that just just don't think that you know i'm cheating you it's just that i don't want you to wait and see so you can see that it has set set up the nlp object pipeline has been created vocabulary has been created initializing nlp object is done initializing pipeline components and then you can also see certain important parameters like what is the loss value what is the ner loss and then you can start seeing and you would see convergence happening very soon uh the reason is because our text data is very small and also the second reason is like our validation details are same as our training data so so these are some some reason you would see the convergence you know happening very soon um but again it would just simply go through the same loop that it has to go through and then you just have to wait and see once it finishes after it finishes our next objective is to use some text that is available and then try out how the how the model validation is like how the model training has been done so for that we need some data set so what i'm going to do is i'm going to go here and then look for another news that talks about bitcoin prices okay so bitcoin prices news uh let's see what is there markets and rate ethereum fall so we have we have some news let's copy this like i'm just being ready let's copy this and then keep it ready okay so we're going to copy this and keep it ready copy right click copy and ready and at this point we have to add this text okay so what i'm going to do now is i'm going to add the text before the model process finishes i'm going to use three quotes because like we have a lot of multi lines we have code so just wanted to make sure that it does the right job um and then yes yes yes we have it so we have almost everything ready so right now we have to just wait for the model training process to finish and as you can see at this point um it's just going on and on and um and you can see the scores have not changed much for the same reason that i said right our input training data is very small and also the validation data that we have given is as same as the training data so so you would see everything so the pipeline is successfully saved to the output directory like right now you can assume that you know i didn't do any editing you you just saw the same time it took so pipeline is um the safe pipeline to output directly go to the output directory and then see you have the last model and you have the best model so you have last and best so last and probably like somewhere around the best here okay so the next thing is we are going to load the best model so i'm going to go here copy the path copy path come back here paste it and instead of just usually we create nlp as an object but because we already have an nlp object somewhere somewhere here right somewhere here somewhere here yeah because we have an nlp object i'm just going to call this something else uh in this case nlp underscore in here load the model after the model is loaded now we have to create the dock object just what you do with specie all the time so doc object is done and now it's time for magic to see whether whatever we have done in this tutorial makes sense or no i hope you are excited as i am so spacey.displayseat.render will render us with this tags so i'm going to just run it this okay we have done a great job so we have taken the news article it says this is value uh it didn't catch this percentage though um and it's it's thinking this is value okay this is value bitcoin is crypto this is percentage yeah there is a little bit of mess up here ethereum is crypto sebastian is percentage elon musk's person okay leon space walker is percentage like i said right so our data is very less but you can see that it has done a decent job i mean like whether you want to call it a decent job or not it's because like i didn't want to spend a lot of time and tagging a lot so that again goes back to this story that you know garbage and garbage out right so if your training data is really good you are training really well so your output is going to be really good so if you want to do this for a for your official purpose are a real good hobby project not just for a youtube tutorial then you should really spend some time in in doing the tagging like you need to use this tool to tag more like we skipped a lot as you remember you need to make sure that you tag everything that you see you will have like more versatile text we just used news article from one website but you can you can use news article for multiple websites to have diverse training data and then also have a decent validation data so that you know whatever you see here makes sense not just in one on one epoch you are going to get um conversions okay but nevertheless it has done the job like we have at this point successfully uh dot spacey dot bin object that we can use to detect crypto web 3 whatever you would like to call it and uh this is a custom in here so this is this is not part of like you can see that we have not loaded any english uh model so we have just loaded like a blank model and you can see we have successfully managed to do custom er for cryptocurrencies with three using the tool that takeaholic aaron mori has put it out just completely on the web nothing to do with the local machine completely on the web which means uh it should be easier for you to try it with your ipad android android tablet like whatever device you have and once you do that then download that json um have the json referred here in your local machine or google collab wherever you are doing it and then step by step we have done any our custom any ad model training process and ultimately after we did we we have done also the named entity recognition so this video has three parts the first part is text annotation data annotation second part is model training in your model training custom linear model training and the third part is where we actually do the inference like we actually do named entity recognition using space e3 so i hope this video was helpful to you in learning how to do custom any r using space e3 with data annotation using an open source tool which is alternative for the piano prodigy and all these tools but i would like to call out that species an amazing company spacey makes money when somebody buys prodigy like you can see most of their tools are available for free so if you want to support the company if you are going to use it for a commercial purpose i would strongly recommend you using prodigy please make sure that your company buys prodigy if you are going to use it for commercial purposes but if you are like me doing it for hobby definitely use the tool that take a holic has built and also please give a shout out to tickaholic wherever you are on social media for this amazing tool and i will see you in the next video tutorial with another python tutorial until then stay safe happy coding
Info
Channel: 1littlecoder
Views: 29,645
Rating: undefined out of 5
Keywords:
Id: p_7hJvl7P2A
Channel Id: undefined
Length: 23min 51sec (1431 seconds)
Published: Thu Dec 30 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.