Introduction to scikit-learn

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right yeah so I'm gonna give you a brief tour off so I could learn today before I'm gonna do that just a little bit about myself so I did my PhD and in computer science actually have a PhD in computer science without taking any classes at University in Bonn I did a brief stint as machine learning at Amazon and Berlin went to the NYU Center 42 science and I now lecturer at the Columbia University in the data Science Institute so what ass I could learn so I could learn is a Python library and it's a Python library for machine learning and I want to talk today about how this might be different from what you used to instead of how machine learning is viewed in scikit-learn and what are the main paradigms so ii learned has a lot of stuff in it it has a bunch of algorithms so actually like everything you find it in a centered machine learning textbook will be implemented there like classification regression k-means clustering DB scan whatever tease me and there's all of these algorithms together with tools to evaluate these algorithms and tools to tune hyper parameter to cross-validation and these things it's been pretty successful like here's some logos of companies that worked with it but actually I think most companies now that do machine learning somewhere use it in some place so it's heavily used industry and also like a bunch of research so we nearly have 10,000 research since thousands attentions on the paper which would really makes me want to be on the paper which I'm not so so ii learned is a really big community effort so this are like these are the core developers I think there's about 40 there's about 50 people contributing every month so I'm not the person who created it that are some people in the top row but I've been like one of the core developers for the last seven years or something like that so what we're really proud of and second learn is in particular our documentation which I you should definitely check out some people said oh you don't need to read a machine-learning book you can just read the documentation I don't entirely agree with this but I think we did a pretty good job in particular we have like whole bunch of examples so you can go through and look at the gallery and see how everything works and I think that's pretty good let's I want to talk to you about how this actually works and how you can use it in practice so first basic API so scikit-learn works on numpy array so now hyeri's are homogeneous arrays so that means they have one data type so they're quite different from data frames and so we assume that our input data is some float matrix X where like each row corresponds to one sample and each column corresponds to one feature or one independent variable and so this would be our input and we have a separate array if we do classification or regression that is a would be outputs or labels so we have two different objects one for data one for a targets and so we most of the algorithm since I could learn assume that everything is a float so by default none of these work with categorical data or missing values this is something you have to take care of yourself I'm gonna talk a little more about this later but it's a sort of the basic paradigm and we always call the data ax and the labels Y alright so here's how I think about machine learning pipeline particular supervised machine learning which is sort of the most common used form so you start with this training data and your training labels and you built a model which could be like linear regression random force gradient boosting whatever and you have some new data for what you don't have labels usually for which you can want to make predictions often you want to evaluate how would your model list so actually you have hold our test data with test labels which allows you to say okay my model generalizes just as well so for me machine learning means that I'm mostly interested in how well my model generalizes I'm like I now hear people that maybe have more statistics background where you're more interested in understanding the model scikit-learn is really tuned towards making predictions and making accurate predictions so I don't think there's none of the values have and none of the models have any P values associated with them and scikit-learn so if you're looking for that this is not the right thing if you want to make accurate predictions then scikit-learn will be helpful all right so this is sort of the standard workflow in supervised machine learning and this is there's basically two phases you're trying to model but is your train data and then you want to generalize to new data and this is murdered very directly in a second learn API so all the algorithm since I could learn are implemented as Python classes so let's say I want to write enforce classifier there's a Python class that's called random force classifier to Train random forests for classification these objects contain both the algorithms or encapsulate the algorithms for building the models and for making predictions they will also store all the model parameters so here for example in the random force classifier if I call fit fit we'll build the model and this will store all the trees all the splits of the trees and so on in this CLF object all models in scikit-learn have a fit function and it always looks exactly the same way it always gets the training daytime which i call x train and if it's a supervised algorithm like here it gets also the desired outputs y train so this filter function stores the model and CL in CLF so this is sort of all you have to do to build a random forest if you want to apply this to new data they are CLF not predict which you input only any new data that you have or you could also do the training data and will return the predictions according to the model there's also a score function which is basically just a helper function that does both the prediction and then Avella weights again against some known ground truth so see LF that score makes predictions on X test according to the model and then you compare to Y test and then it reports accuracy so this is sort of the most common interface first I could learn and like all the models for classification regression will follow exactly as interface and you mostly need to think about fit and predict these are sort of the core methods there's another interface since I could learn it's also very important in particular in unsupervised learning and in pre-processing so this is for unsupervised learning and/or pre-processing where someone gives you a training data set and say you want to do PCA so you have your training data there's no labels or grand truce or something like this you just have your matrix X and you built a model from this then when you get some new data or you have your test data you want to apply this model and this model will give you a new view of your data so let's say a projection onto the principal components or something like that this is sort of a slightly different task and has a slightly different interface so the way this works since I can learn is again everything is encapsulated I was an object so if I have PCA hey I want to do PCA I instantiate the PCA object I call fit again all things in scikit-learn have a fit method and so here I just give it the training data ax because it's an unsupervised method and then if I want to actually project to the principal components I used to transfer method on any data this will give me my new view my new representation of the data X new so these are basically the three main functions you need to understand for scikit-learn is so we call our models estimator so estimators could be anything like random forest or scaling your data or something like this I'll have a fit method they always take the data X if it's a supervised method it also takes some target output Y if you predict something that is like a labeling so for classification regression and clustering you use the predict method to make this prediction and if you want to get a new few of the data a new representation in new acts you use the transfer method which is used for pre-processing dimensionality reduction feature selection and feature extraction so these are sort of the two main building blocks are things to transform your data and things to make predictions but we also have more building blocks for sort of standard machine learning tasks particular from model evaluation and selection so there's tools to do trained has split which is like very simple I'm not actually going to talk about this but after instead of doing a train tests split you want to use cross-validation I think you're all familiar with cross-validation something I can skip this where you put your data say in five folds and you hold on one fold and train on the other ones and this gives you more robust estimate of deep generalization performance of your model if you want to do that there's a method called cross vault score in scikit-learn and cross wild score sorry not method a function Cresswell's two or gets an object data and labels and you tell us how much cross validation you want to do by default it is three fold stratified cross-validation in here I say I want to do five fold stratified cross-validation and so this will return the scores on the test set or on the holdout set for each of the iteration so here I get five scores for the five splits of the data and then I can compute the mean in standard deviation or something like that so this is some like one of the very commonly used tools another tool that's probably used even more commonly is to do a grid search to adjust parameters because all models have parameters and you always need to tune them and is always a bit of a pain and so the workflow that I usually encourage people to use might not be either one that you use but that one that I favor is you take your data you do a training and a test split then you do cross-validation or training data set to tune parameters and then you do a final evaluation on your test data this way you have an unbiased estimate of the generalization performance by running your tuned model on a test set if you just did cross-validation to tune your parameters then the estimate of generalization performance that you would have is too optimistic so in this workflow which is a things are the standard supervised learning workflow for me it's pretty easy to implement what ii learn there the main class here at you need is a great search CVS a great search t v-- implements great search was cross validation so if i want to run this whole process I first put my training data or my data into a training test with the trend has split method then I need to define the parameters I want to search over so here SVC is the support vector machine but he faulted use the RBF kernel there's two parameters associated with this regularization parameter C and kernel bandwidth gamma and so now I specify grid what values of C and gamma do I wanna try out and so here I basically give an exponential range from like 10 to the minus 3 till 10 2 to 2 and for both C and gamma so this defines my search space that I want to adjust the parameters over then to actually do the search I instantiate an object a new object with this grid search CV and it gets the model I want to tune and the parameter grid I can also specify which metric to use like I want to use a UC or accuracy or average precision whatever you want and how I want to do cross-validation the cool thing here is that this grid search CV object returns this grid object in this grid object just behaves like all the other models so this thing that does grid search for us so has like fit predict score was exactly the same interface as everything else only if you call fit now what it will do is it will run cross-validation find the best parameter models according to the cross-validation score according to metric I give it and then use this best model and retrain this model on the whole training set so after it does grid search it returns the model with the best parameters and so this allows me then to make predictions so here you see again this is very geared towards making predictions it's very easy to search parameters and then make predictions with this model on new data so this is sort of pretty convenient I think so one thing I mentioned is that if I could learn doesn't really automatically do any pre-processing and so what one of the reasons to do this is because you won't really have control over what is happening like how do you want to encode your variables how do you want to compute data and so on and so more often than not instead of taking your training data on your trading labels and built a model there's a lot of things happening in between like extracting features from text data or images or whatever you have rescaling your data in the way that you think makes sense for your data set possibly doing automated feature selection and so on and then it all goes into your model very common mistake that people then make is okay then I'm gonna do cross validation on the model tune the parameters of the model and this gives tells me how what my model will be however if you do this then you're gonna actually leak a lot of information in your previous steps for the cross validation what you really need to do is make cross validation the outermost loop and across the very h2 your whole processing pipeline from feature extraction scaling project selection all the transformation you want to do and then do cross validation over list select the best model that the best parameters and then rebuild your model to make this very simple second learning a little thing called pipelines so pipelines are a way to chain together transformations I call them here t1 and t2 think about like imputation and PCA or scaling and PCA or whatever you want and the classifier and so similar to cross validation to words or CV making a pipeline returns a new object that is again an estimator has again exactly the same interface as any of the other models so this pipe object just looks like a model again only now it's a chain of the say two transformations and a classifier so and then if I call fit on this model it'll fit the first transformation transform using the first transformation fit the second confirmation transform using the second transformation and pass the transform data on to the classifier if I make prediction a prediction on new data it'll do exactly the same transformations it will not refit the transformations will just transform the data and make sure basically the test state that gets exactly the same treatment as a train data in capsule writing your whole processing pipeline like this makes it much more unlikely that you're leaking information or that you're doing leaking information from your test set or that you're doing different steps on training and test set because you have encapsulated everything in a coherent unit we have the cool thing about using pipelines is well it saves code and also now you can do cross validation the way you would want to do it so for example if I want to scale my data so the SVM in cycle urn doesn't was scaling I think by default in Rd Lopez VM wrapper does like 0 mean unit variance and so I could learn sort of it doesn't do that so you have to do it yourself and or you could say well it allows me to choose more precisely how I want to pre-process my data ok so I can make this pipeline here this would be scale the data and then 50 SVM so if I want to do grits or show with this I can just use this pipeline inside the grits are CV the only thing I need to adjust is I need to tell Burt search TV which does the parameter search which steps inside the pipeline I need to want to tune the parameter on and so here and there's this notation with the double underscore which basically says under step that's called SVC tuna parameter C on a step that's called SVC to mature gamma and so this allows you to do quick search and cross validation over these complex pipelines and the cool thing about this is that you can not only say do scaling and then grid search the parameters of the model but you can grid search parameters of everything jointly usually pre-processing methods like feature selection also have parameters so for example here I do feature selection asking I should use one two three or four features okay so I can select over how many features should I still should I select and the parameter scene gamma support vector machine together in this pipeline so it says here I have this pipeline select a best is select a cask a best feature according to P values the asper vector machine and now grid search the best value for K and select K best and the best value for scene gamma and support vector machine and so this allows me to search all these things jointly and they're all encapsulated in this single grid object alright I think that's all I had today so I know that was really fast but I got 20 minutes and so now you have all of second learn nearly yeah I have this book which you already saw and you can follow me on Twitter or something like that [Applause] you

Info

Channel: Lander Analytics

Views: 2,010

Rating: 5 out of 5

Keywords:

Id: juEOOQntrd0

Channel Id: undefined

Length: 19min 36sec (1176 seconds)

Published: Wed Aug 15 2018