ML.NET 2.0 Text Classification in C#

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey folks so last month ml.net 2.0 was released no if you're not familiar with email.net that is the open source machine learning library for the.net platform so I can use my c-sharp or F sharp or VB code to train machine learning models just like I could in another language I can also use the same languages to take a pre-trained model or when I've trained myself and use it in an asp.net or other type of applications such as uwp when forms stuff like that to generate predictions based on new data so I don't have to use python or JavaScript or some other programming language if I don't want to in order to add machine learning capabilities to my applications in this video I want to walk you through one of my favorite more improved features in ml.net 2.0 and that is text classification okay so text classification is where we can take a piece of text and we can classify it so let's say we might want to classify something as a positive thing or a negative thing we text classification will do that I pulled my friends on the internet who are about as weird as me and they suggested I write a example code using turtles so that's what we're going to do so bear with me here this one gets a little weird so here I am in Visual Studio 2022 community and I have a program.cs file with a few using statements so I have Microsoft ml ml data torch sharp and nazbert okay so torch sharp is a python machine learning library and nasbirch well that stands for well Bert which is a you know standardized Transformer and the Nas is Like A a more compressed version of it so this is a very portable Transformer that we can use to manipulate text it's okay if you don't know what Transformer is that's fine uh just know that this is kind of a big deal it's a big pre-trained model that we're able to take advantage of in order to use this stuff I have to have a few nougat references so if I right click my project say manage New York packages I go in and I look at my installed packages and I see I've got Microsoft ml ML torchsharp and torchop Cuda Windows okay so ml is the core uh you know ml.net library and towards sharp lets me do some things with the text classification and this Kudo Windows thing well that lets me take advantage of Windows based gpus there's a Linux version of this as well so you can use that while you're training to get a little bit more efficient trading now it does take a little bit of time to install this thing it took about five to 20 minutes I recall so it does take a while to get that installed because you're actually downloading and installing nazbert and that's a pretty hefty boy okay so let's look through the code for this we have our using statements and a lot of this code is going to be very standard.aml.net code the first thing we're going to do is we're actually going to create an ml context ml context is the standard object that everything flows through in ml.net and we need it in order to do anything at all with machine learning now we are telling it which GPU to use uh which you know kudos to you if you've got multiple ones but zero should probably be fine for most people and if you don't have a compatible uh GPU well we can always just fall back to the CPU as well so that gives us our context and from that context we could now go ahead and we can start loading some data so here we're going in and we're loading in a data view this is an idata view this is basically a data frame kind of object you're familiar with pandas or some other type of data manipulation library but here we're loading it from a text file specifically this turtles.tsvtsv meaning tab separated values and the delimiter for tab separated value file is going to be a tab character I'm saying this particular thing doesn't have a header it's because I don't have a row at the top of this file with like labels on it so this turtles.csv file or sorry tsv file is just a collection of sentences and then a a number here so this number here is it actually corresponds to an intent so in this application the user might say something related to turtles because again my friends are weird I'm sorry uh and that my thing might be related to do they want to eat a turtle do they like turtles are they talking about something related but happens to tangently deal with turtles are they talking about Teenage Mutant Ninja Turtles or are they just talking about General Turtle care so we get about five different buckets that my friends statements uh fell under and so I represented all that in an enum okay so this we're loading that up into a into this data View and we're representing each row with a model input class model input here is a really simple class that has a sentence and it has a label now this sentence here it gives it it tells it hey you're the First Column and you're the second column I see we're also giving it a label which we're going to talk a little bit more about that in a minute uh but this is a really simple c-sharp class that happens to have some ml.data uh annotations on it to help it understand how to load the file up okay so this right here gives us the data view containing all of our data from our tab separated value file so that's how we load our data and next we want to start trading our data and in machine learning in general we will split our data into a training set and a test set and what those two halves or actually usually on that halves there's 80 20 usually we'll give the training data set to the process while it's actually training the data and we keep the 20 left over to see how good of a model it is handling unknown values to prevent overfitting so here what we're doing is we're actually going in and we are splitting together the uh the data view the the data we loaded and we're saying Hey I want to only get 20 of it into my test fraction eighty percent of it's going to go into the training faction and then I just have this little variable here to grab out the training data and the test data and with that we can now start to use it to actually train our model we're going to see that in a little bit here so this is this is dividing our data into two chunks 80 and 20. next we're going to go in here and we're gonna actually create the pipeline pipeline is a reusable way of training machine learning experiments in a consistent way so it's a series of steps there's a couple steps here to map values to keys this is really just kind of relabeling things the way that uh ml.net expects them to be for our text classification stuff and the next thing that we're doing here is we're actually going in we're doing text classification so this is what's actually going in using Nas Verge it is saying hey you've got a bunch of sentences with labels I'm going to learn the ins and outs of what makes a given sentence have this specific this specific numerical label and I'm going to kind of internalize that right now the only bird architecture that's supported is Roberta but we have this enum here for future expansion if we wanted to add some more okay um so that's kind of what we got going on with this this actually gives me my my pipeline uh it's not actually training anything yet so to actually train our model we have to go in and say hey pipeline dot fit and we fit it to the training data that's the 80 of the data that we loaded up earlier so this is going to synchronously wait and train our model and until it's kind of memorized the patterns in there and it has this this formed model which we call a Transformer or eye Transformer in ml.net and once we have this we can we can actually save it out to disk as a zip file or another format and load it up later on this stuff usually takes a long amount of time but once we have a trained model it's usually very quick to load it up and mate and use it to make predictions but before we use it we should see how well it performs so we can actually get some metrics out of there so what we do is we go in here and we say hey like hey I wanna I wanna grab my test data set this is the 20 we left over uh and I want to transform it using my pipeline and then I'm going to grab evaluate it so I'm going to call ml context multi-class classification multi-class here meaning that we have actually five different possible classes that something can be in um if it was only two it would be binary classification but in this case it's five so that's multi-class anything three or more is multi-class if it's just two it's binary classification if it's one or zero what are you even doing here that's not a thing uh so multi-class classification we're saying like hey I want to see how accurate we are at predicting various things and that's going to give us some metrics which gives us macro and micro accuracy log loss uh and other other things but what's really cool about this the the metric that I use the most is the confusion Matrix and we have one of those as well so here I have some code to actually Loop over all of my turtle intents and kind of print out like uh zero you know liking Turtles one eating turtles that kind of thing um just to print the match of the user but then on over here on line 68 over here I've got this get formatted confusion table and what this is doing is it's actually using email.net to generate a confusion Matrix so that we can see it and see how our model performs and I'm going to actually just run this and we're going to take a look at that confusion Matrix and just sort of see what that looks like so it looks like it's generated it we have our confusion Matrix again I didn't have to write any code to format any of this stuff here my only code that I wrote was really to Loop over all these different classes but here we see here's the actual value something was so here's a case where something actually was you know an intent they're talking about eating turtles and we see that our model uh predicted it was eating turtles 26 times but two times it actually predicted they were talking about something completely unknown so in general we're pretty good at recognizing that intent so our recall is 92 percent right um now that's you know that's that's that's pretty good um now we go over here we can kind of see some of the strengths and weaknesses and we see over here like hey here's an eight right here so sometimes when something is a Ninja Turtle reference our model predicts that you were trying to eat the turtle somebody's talking about eating the turtles so there are some uh inaccuracies there and in fact it's more likely to get it wrong than right in that case but this is a confusion Matrix I have a whole another video and article about confusion matrices if you're interested but really handy way of seeing how good your model is and what strengths and weaknesses it has okay so once you know you have a good bat model and you want to go forward and use it to generate predictions well you can do that pretty easily so what we do is we create a prediction engine so we say minimal context model create prediction engine we tell it what type of row it's going to get in as an input and what type of value it should get out as an output in this case we're calling it a model output it doesn't have to be named that but generally the auto-generated stuff is used model input and output from what I've seen and this really is just the same stuff with we have a a label and a sentence and we have predicted label and score so these are the two new things that we get from a model output we get the label that we're predicting it to be and again label here is going to be like we're talking about liking a turtle or caring about the turtle or Ninja Turtle something like that right that's that here it's a float but this is actually going to be the enum value okay and the score this is an array of scores for all the different intents so for us we have five different classes something could be and it's going to be array of five numbers one of the highest score is going to be the one that gets picked um so I can see to help you see like how you know how close were these two things together and and is this clearly different than the other intents and that kind of thing um so once you have the prediction engine now you can actually go through and use it to you know uh make some predictions so here I've actually got a do while loop where I'm going in I'm saying like hey what do you want to say about turtles uh and it's going to generate a prediction using engine dot predict I'm creating a new model input based on what the user is saying and I'm calling engine.predict giving it that I might get out of Model output which then lets me go in and say like hey here's the intent that we that we matched and I can even get the score if I wanted to see that as well these scores They Don't Really range from zero to one in my experience they really are just relative to each other so they're not as reliable or useful as something that might be more percentage based so keep that in mind while you're writing that and this is basically just going to Loop until the user chooses to to quit Okay so I'll run this and fast forward it until we have a trained model and we'll take a look okay so our model is now trained uh fairly similar to the last one it looks like some slightly different characteristics um so what do I want to say about turtles well let's say that I like turtles case it says like hey yeah I think I think that you're talking about liking Turtles that's great um let's talk about heroes in a half shell and in this case it got it wrong that should have been a Ninja Turtles reference but it says it says unknown um sorry Peta but I'm going to try to test the uh uh the eating Turtle intent so I'm gonna say um tonight I dine on turtle soup and again unknown uh the testing data for this the training data for this is not the best I kind of got my friends to give me random statements and it's not as balanced as you might might expect in a real data science experiment but it's good enough to show us how to run with this stuff we'll try one more uh uh turtles are our friends and again okay so I could use some better training data here um and if you had better training data you're gonna see uh just a better output in general because your your results are really going to only be as good as you're training data the key thing here that's different for the text uh text classification is we're using this text classification stuff right here we're telling it hey what's the what did the user type in and what the what label do I want to predict this is really really handy if you're building like a chat bot or something like that and the user might give you an utterance of something they might say and you have one of a few different commands you might map it to you could use something like Azure cognitive Services uh with conversational language understanding or the older uh Lewis uh or Luis depending on your pronunciation language understanding uh those things will like will give you this capability without you having to write a whole lot of code at all you are paying per usage for that and so if you're like me and you want to write some unit tests that happen to involve classification well this is free right this is very very free you're just spending your time on it you have to understand this code but once you go through this code this flow a few times it doesn't it's not as scary so uh I hope that helped you understand a little bit more about text classification in ml.net if there's anything in uml.net you're curious about or machine learning in general let me know give me a comment I'm always happy to add something to my queue of a Content create to create and um you know happy coding and uh have fun [Music] [Music]
Info
Channel: Matt on Data Science
Views: 13,299
Rating: undefined out of 5
Keywords: Data Science, Machine Learning, ml.net, dotnet, visual studio, csharp, text classification
Id: m0sTUP39jlM
Channel Id: undefined
Length: 15min 48sec (948 seconds)
Published: Sat Dec 10 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.