Introduction to AutoML [Pt 10] | Generative AI with .NET for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome back to this machine learning and AI for beginners Series in this video we're going to do a quick introduction to Automated machine learning or automl in this video we're going to briefly talk about what the training workflow looks like for machine learning models we're also going to be talking about what automated ml is and then through a demo we're going to show how you can use automl to train an ml uh a machine learning model so if we're taking a look at the machine learning training workflow it typically looks like this you often start out with a problem that you want to solve and this can be something like forecasting sales or you want to uh build some sort of automated system that's going to triage and classify uh customer issues right once you have your problem you want to get data that's going to help you train this machine learning model and this prep paired data step usually involves things like merging data from different sources or getting rid of duplicates um getting rid of um missing values those sorts of things right once you have the problem or task that you're looking to solve and the data set that's going to help you train your machine learning model you go through the process of choosing an algorithm depending on the problem that you're looking to solve whether that's uh categorizing values or you're looking to predict the numerical value uh you're going to choose a different algorithm right in addition to that you're going to go through the process of tuning hyperparameters now in general hyper parameters are just settings that help guide the model or help help guide the algorithm during the training process in order for you to to come out with the best model for your data set and then last but not least you need to evaluate that model and see is it performing up to the standards that you have sort of set for it to perform at right and this often tends to be a very iterative process and experiment m al right so as you're going through it maybe your data set may work better with one algorithm versus another and then depending on the algorithm you choose the hyperparameters or the settings that you tune here uh may be different and you may want to try different iterations of both algorithms hyperparameters and in the evaluation phase determine which one works best for your particular data set and the problem that you're looking to solve now again this can be a very manual uh and very time in intensive process right so can we do better and in some cases you can so Automated machine learning can kind of help you here so what is Automated machine learning Automated machine learning or automl automates the process of applying machine learning to your data and so given a task and given a data set you can run automl to iterate over the different data Transformations uh machine learning algorithms and hyperparameters in order to find the best model for your data set and so if we think back to that machine learning training workflow again you start out with defining a problem and you start out with a data set but as you can see automl replaces those other steps of choosing your algorithm tuning the hyper parameters and evaluating to find the best model we're going to take a look in a second at what this looks like but I just like to remind you that any samples you can find them over at aka. ms.net A- beginner let's take a look at some code so in this scenario what we're looking to do is we have a bunch of issues from GitHub in this particular case they're from the net GitHub repository and as you can see there's the ID of the issue the area which are the labels um that are applied to these different issues there is the title of the issue and then the description which is just a longer sort of explanation as to what this issue is about and so in this case what we're going to want to do is we're going to want to build a machine learning model that tries to categorize the area or automatically apply this label when a new issue comes into the repository okay so how may we go about doing that with automl well first of all we start by installing the automl set of Nate packages which are over at the microsoft. ml. automl NIT uh we go ahead and add our using statements and then we initialize our ml context now in ml.net ml context is the entry point for all applications then you're going to notice that we're going to use this Auto construct here and we're going to call the infer columns method here and we're going to tell it here's our data set here's how the data that is separated right it's tab delimited here's the label column or the the thing that we're trying to predict or automatically infer right and in this case if you recall it's this area and then last but not least just make sure that uh The Columns are sort of processed separately if we take a look at the results of the column inference process you can see that there's options for the text loader that contain information about each of the individual columns so you can see for example if we go to column number one here it's the name of the column this area and you can see see that it is uh it contains some information here as well okay and then for column information as well you can see that it matches the columns to a respective type so for example ID is a numeric data type and then the text columns are title and description great now once we have that information we're going to use the text loader options to create a text loader ml.net and load our data into an ID dat view now and a data view is just the way that ml.net represents data now we don't really want all of the columns here um so we want to just use the title in order to predict our label right want to make it quick and easy to determine what that label is going to be and so in that case ID and description are not columns that we're looking to keep and so we're going to drop both columns from our original data set and what you're going to see here is that what you end up with is just that area and title columns next we're going to go ahead and split our data set now usually it's good practice to split your data set into training and evaluation sets or validation sets and the reason for this is you don't want the model to overfit or over index on the data that it's looking at now this is a general problem or something that you have to be aware of as you're training these machine learning models which is uh you know you can think of it like if you were uh sort of studying for a test right and so imagine that as you were studying for the test you knew what all the answers are and so you know you you go ahead and you read what the questions and the answers are and on test day you come in and you get the highest marks or the highest grade right now that doesn't necessarily mean that you actually understand the subject matter it just means that you memorize the answers right and so what we're doing here is by splitting this data set into train and validation is 80% of it we're actually going to use it to train the model and the other 20% we're going to leave it out so that we can test how well the model performs on data that it hasn't seen before Okay so after we go ahead and split our data set we go ahead and create our pipeline right and our pipeline is going to include this featurization that take the data or the raw data that's there and convert it into something that the algorithms can actually process now typically these algorithms need numerical values uh to to train to to train right and so as you can see our area and our title are both strings and so what this featuer is going to do is it's going to transform those into numbers that can then be passed into to the different algorithms for training okay you can see also that there's this multiclassification uh sort of uh section here and what this is doing is it's basically just saying I want to train a model that can predict or apply uh labels to many different categories right now that's it that's all we need to do at that point we then just create an experiment for automate automated ml to run over and we Define a few settings and we say look in this experiment I want you to run the pipeline that we've defined above um this is the metric that we're going to use to evaluate our uh our model uh we want to train for 2 minutes or 120 seconds and here is the data that I'm going to be using to train my model right and so once that's set up then we're basically for the most part ready to run now we also want to configure a monitor here and what this monitor has is just a set of sort of events or life cycle events that depending on which stage of the training process uh the automail experiment is in it's able to perform some task and you can see here that as trials are completed right or every run of the automl process runs we want to uh log a few things here and output things to the console okay and so these are completely uh up to you to sort of implement how you see feed depending on what your needs are here okay so now that we have our monitor we add that to the experiment as the thing that's going to uh update us on the status of the the training process okay now that that's all configured we basically just run the experiment and throughout this process you're going to see that there's a few things that takes place so in the 2 minutes or 120 seconds that I gave it it's going to iterate given my data set and given the pipeline or the the task that I'm looking to solve look through different algorithms data transforms and tune those hyperparameters in the process so you can see here for example that in trial one it this was the pipe line that was built of the transforms as well as the algorithm that it's using and you're going to notice here that it tried fast forward OVA here and it also tried Fast Forest OVA here right and you may think well it's actually just doing the same thing well no what this actually means is it tried this algorithm twice but it tried it with different hyperparameters right and similarly not only is it iterating over the hyper parameters but it's also uh choosing different uh sort of algorithms you can see here it was fast tree OVA Fast Forest and in this case a use sdca maximum entropy right so again it's automating that training process for us okay once the training is complete the best model is actually uh sort of captured here right and so you that's in the you know experiment results model uh property and you can see that the model these are the different again transforms that are sort of uh that were sort of used here and in the best model the algorithm that it picked was this one versus all uh algorithm okay if we display the metric right so our macro accuracy we can see that it has a 0.69 which uh for our use cases this is pretty good enough it's goes from uh 0 to one one being best and zero being uh having not a lot of predictive capabilities right and so 69 that's good enough given this uh limited data set okay so once we have this right it's time to try out the model and one of the things that I sort of call out here I kept saying model but these models are essentially just um they contain State and information about your data set that was learned during the training process that given new data it can infer or make predictions on that particular data point and so what we're going to do here is we're actually going to use that validation data set so this training validation data split and then the test set to generate predictions and if we preview those predictions we're going to see for example here if we take a look at that very first row of the validation data set you can see that um the original the area was for this particular issue now the issue itself is says uh it's um add Linux Mac build scripts right so that's the title of that issue and the original value is area infrastructure and if we take a look at the predicted label or the value that was inferred by the model you can see that it also predicted area infrastructure right and so we can sort of spot check here as well in addition to using the metrics you know how well there is is a model behaving the way that we would expect it to once we're happy with that model we can just go ahead and save it to this model. mln net file and again this is just a serialized version of the model that you can then reload into your net application and use them to make predictions here okay so in this video there's a few things that you learned you took a look at what the machine learning training workflow generally looks like you learn how automated ml or automl can help you automate that process and then we went over an example how you can use automl to automate this process of training custom machine learning models specifically in the case of uh categorizing your triage and GitHub issues thanks for watching see you in the next video
Info
Channel: dotnet
Views: 4,699
Rating: undefined out of 5
Keywords: .NET
Id: Wmybg70CW9A
Channel Id: undefined
Length: 12min 33sec (753 seconds)
Published: Fri Nov 10 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.