Talks # 7: Moez Ali: Machine learning with PyCaret

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay hello everyone and welcome to talks number seven and today we have a very special guest and many of you have used the library he has created called pi correct uh he's moyes ali and he's currently based in canada i will leave rest of the introduction to him and i hope everyone enjoys the talk and there's a link to ask questions so ask questions using that link and now it's you moyes okay just one confirmation can you see my screen full mode presentation all right perfect all right so thank you everybody for joining us today uh my name is mois ali i'm a data analytics leader uh and a passionate data scientist by a knight a very open uh very active open source contributor the most recent project i've done is spike edit that took almost a year in making uh my background i'm a chara accountant so i'm a member of cpscma canada and acma uk and uh in last 10 years i've lived and worked in four four different continents and i worked mostly in uh reporting an analytics role these days i'm in lockdown currently based in toronto canada works for pwc canada very active on social media primarily on linkedin and twitter so if you guys would like to connect with me say hi to me please do so on linkedin and twitter okay some important links for for this presentation today it's our website github linkedin youtube medium excuse me and this presentation the most important link is this the the one highlighted in yellow uh the presentation today along with all the demos and all all the files uh is uploaded on this github link so if you would like to follow along uh the demo part or just uh refer it in later you can clone this repo uh if you'd like to get more updates about pi carrot uh use hashtag piket on linkedin and twitter follow that hashtag uh we continuously post new content and tutorials every day okay so agenda for today is will start uh although some of you might uh already know about pi category have used spike air before but i would start by introducing pi carrot for people who doesn't know about picad and then i'll show a small demo of pi card 1.0 uh but for for for the remaining part of the talk i will primarily focus on on the new features that are coming up in pi character 2.0 which is which is approximately two to three weeks away from today so in three weeks you would have pike 2.0 so i would spend most time today discussing the new features and showing you the demo right and then towards the end of the session we can we can take q a uh and i i hope you it's it's worth your time okay so what is spike edit so picaret is a open source low code machine learning library in python uh and the primary objective of this library is to it's it's it's safe time for you whether you are coming from non-computer science background and you you need something uh that is less code intensive or whether you are coming from computer science background and you just want to save time uh in in writing your code or managing your code right so it's ideal for rapid prototyping uh it is it is and i know the audience from abhishek channel it is not going to to to win kaggle competitions for you in fact no no auto ml solution is gonna do that because uh what we can automate so there are two part of data science the art and the science itself we can only automate as much as science is concerned uh to win competitions or to really build awesome models you need to introduce business context into those models so in terms of feature engineering and other stuff that you do so keep that in mind it's it's not it's not a magic solution it you would realize after seeing the demo it goes as far as saving their time only so it's a productivity tool and whenever i talk about pi carrot these three characteristics uh are the first one that comes to my mind i'm not only the developer of library but i also consume this library in my day-to-day work right so it's easy to use and there's and just because it's so easy to use uh your learning curve is very limited and we realize that every day there are hundreds and thousands of different new tools being introduced and you can go only so far to learn everything but with pike edit uh there is almost no learning curve right uh i consider this as a productivity tool there's no magic but it saves a hell out of time right so and you'd realize this after this demo uh i call it business ready because the way pi credit is designed the api uh is designed it's it's more it comes naturally to people who are not coming from coding background so pike edit is functionally written uh library as opposed to op uh it's it's more so if you're transitioning from r uh you would feel like home when uh when coming and using pi current in python you'll feel like home um and there is also the original inspiration actually comes from max goons dr maxwell's work in our it's called garrett package in our all right okay so we did two small experiments with pi carriage so this first graph uh on your x-axis these are the levels of machine learning so starting with initializing the experiment doing eda pre-processing so all these steps of machine learning on your y-axis it's number of lines of code that you have to write so it's like a journey right and these two blue lines represent uh two two scenarios uh blue line is is our base case scenario code written uh using scikit learn and redline is a code written using pi carrot and you can see as you finish the experiment uh we took uh 23 lines to achieve something that would have taken 170 lines in in a base library cycle plan to achieve that this graph on the right is same thing but it's in in terms of time on your x-axis you see unix unit one two three four five so we had um we had uh picked ten ten data scientists seasoned data scientists and we have ran this experiment with them where we have designed a task list and arrange them sequence sequentially in unit one two three four five so that means uh unit five is the hardest unit that would have most number of tasks and on your y-axis is a percentage of time taken by pike edit and these three lines represent three different flavors supervised unsupervised and nlp and you can see this supervised line unit five tasks which had ensembles hyper parameter tuning and all those stuff you take eight percent of the total time that you would have otherwise taken using other libraries so it's it's a huge time saver in that sense okay some facts about bike edit started last year in summer when i was going to my final year of grad school and i realized the need because uh what i've seen last year is there's a shift happening over 50 percent of people now entering into data science they are coming from non-quantitative background so they don't have any computer science or stats and pi edit is just one small step that would make uh very easy for analysts or people from non-cognitive background to adopt to data science right it's completely written in python uh there are no rappers at the moment but uh as of today uh it looks promising uh we i i i i hope we have will have uh uh wrappers for for for different languages in future but there's nothing as of today it approximately took one year to complete for pi care 1.0 and we are still actively uh building this library there are many contributors it's self-funded so far and supported by a team of brilliant people and because it's open source project we always need contributors so uh if if you would like to contribute not just in terms of coding it but also if you want to contribute in terms of documentation content creation um designing unit test cases and stuff like that uh please feel free to reach out to me fill this form on on this link picada.org slash contribute uh and and i think it's it's a community efforts and we can we can improve it with time okay so in today's talk we'll we'll only be able to talk so much we just have one hour but there are a lot of resources on pike area and most of them are targeted towards somebody who is just starting out so uh mostly beginner-centric uh but if you if there are a few parts that you haven't done before these tutorials are ideal for you so for example this one deploy machine learning pipeline on google kubernetes engine or implement a clustering in power bi using pi card so the the good thing about pi card the way we designed it keeping in mind that there is a broad analytical ecosystem which is not very specific to data science but there is a very very big ecosystem right so there are tools like tableau power bi qlik sense and all those tools that are being used in organization today so we have kept that in mind and uh it's super awesome that biker integrates with all those tools uh including sql so if you if you want to ship your linear regression code into sql server you can do that using using pi head so these tutorials they are hardly 10 to 15 minutes read and they start and they are very instructional uh in terms that when you read them it would it will start from installation so even if you have no background i encourage you to to go through these tutorials and there are many of them um i'll just leave them in the slide here so that you can refer them if you need it uh and let's and this brings us to the to the demo part of pi carrier 1.0 and if you are planning to follow along today this is the github repository uh that you should clone excuse me and if you want to follow along on google collab you should go to this link picada.org demo and there are links to google collab notebooks and you can run them in playground or you can just follow along if you want uh to install pi carrot it's on pip so you can simply do pip install pi carrot uh it would take 10 15 minutes install all the dependencies it has and if you are using mac you might have problem uh installing light gpm uh which is microsoft's uh open source boosting framework uh and if you have that problem follow along to this link we have instructions on how you can build light jqm in your map from source right okay i'll just give 30 seconds here for people who want to uh follow along this is the link github.com slash pi card slash pike edit dash demo dash 80 let's take a 30 second pass okay all right so let's let's get into the demo for picard 1.0 okay okay so if you are following along this is the notebook that you should open by carrot dash classification demo there are a couple of notebooks but at this moment i'm gonna follow you through this notebook okay okay the first thing i'm doing is checking my version because i have a 2.0 demo as well in a different condy environment that i'll show you after this so this is 1.0 so if you have install 1.0 and when you run this code it should show 1.0 um just to be careful okay i'm using a toy data set so with pi credit there is a module called datasets where we have a couple of data sets uploaded on our github repository because we don't want to download all the data sets on your computer when you install python but obviously in order to use this function you need to have internet on your computer i do have internet activated so i am reading a data called juice right and if you want to see so this is the data frame used but let me just show you if you want to see all the available data sets just get a index and it would list down all the data sets on our github repository that you can call and you can call them using this index so in this case i am using this data set number 14 juice right okay so what is this data set so this is a binary classification problem right where each raw is is a transaction in a retail environment in a retail store the target variable is purchased which is ch and mm uh these are two different kind of juices and based on all these factors like what was the price of c a juice what was the price of mm juice what was the discount loyalty factors the store number and stuff like that based on this we have to predict whether the customer would buy with you ch or mm right so it's a binary classification problem just if you if you share the data set enough you would realize if you were to build a simple model on this data set even logistic regression there are a couple of things that you need to do first you need to encode the the target variable and convert it into binary one or zero uh you have to drop id column you have to do one hot encoding here store seven and if you are using linear algorithms you would remove perfectly cool linear algorithms okay so okay so you you do basically couple of pre-processing on this data set before you actually start fitting your models right okay so now that we understand the use case uh let's see how you work with pike edit so picaret is is functionally typed as opposed to how how you're used to working with objects in sql in but but the objects that you that pike edit returns are are still cyclical and objects so you you can use all the base methods and attributes if you want but when you're actually coding in pi carrot you would type it functionally as opposed to uh oop uh and pi carat has six six modules classification regression clustering anomaly detection nlp and association rule mining and the good part is all of them works exact same way so when i keep saying there is no learning curve that was the reason and it is very intentional that our design is very very it's very consistent so that you don't have to we don't want you to copy the code or refer back to your notebook it's just like english language it's it's a very natural english language that you write here so what we want people to do is just start writing it so don't waste time in copying the code or troubleshooting it right okay all the experiments in pike error supervised on unsupervised would start with initializing the environment in pi caret so you initialize the environment using the function called setup and before i talk about this function uh let me show you this line which is from pi edit dot classification import star and you might be i'm sure many people wouldn't be happy using a star here because you want to call it specific functions that you that you can trace back and i don't particularly find anything wrong about it if you are using it in a notebook environment and you are just doing experimentation it is totally fine to import everything uh because you want to import bunch of functions that you would use in this experiment but if you don't like it don't feel obligated you can import individual functions like that but i'm going to use that in notebook what i usually do when i'm experimenting in notebook i do this but when i'm promoting my code into production you obviously want to point out to a specific function but so that you can you or somebody can trace back right that's just a good practice but in notebook there's nothing wrong but if you don't agree don't don't feel obligated to do it this way for now i'm just gonna call this function uh import everything all the functions and initialize my environment by by passing my data variable which is uh this data frame stored in data and just passing the name of target variable which is which is purchase right i'm not doing any pre-processing or anything like that and i'm just going to run this now and as soon as you run this it would basically go back into the data frame and infer all the data types right if you have a date column it would infer that if you have id column so in this case it has inferred and correctly uh detected the id column and as soon as we detect the id column we drop it for for the purpose of training uh and if you have a date column it would detect it would drop the date column not remove it from the data set but drop it for training but also extract a couple of features from your date column right and also for for other columns you can either have a numeric variable or a categorical variable uh 85 1995 percent of the time it it would be right in infering in inferring inferring those column types but if they are not right you can always override them through through a function uh but assuming this is all fine here these are correctly identified i'm going to press enter which would means that go ahead and it would now say set up successfully complete completed and would print this information great which has a lot of meaningful information so if you would if you would see first thing is um this target type so it's binary it's not a multi-class this is a label encoding so ch is converted into zero mm into one in case you want to use that in future to to map it back this was the size of your original data there was no missing values numeric features categorical features but the and and also the the train test split is also handled in setup so when you pass the data set it would do a train test display by default it would use 70 but you can change that so you can see you had a data set of 1000 raws which is splitted into 748 and 322 so train set and test set right and all the modeling that you do in pi care it would be on this train set right or it would be it would create a cross-validation environment just on training sat so test set is totally out of the systems it's like a final check for you to see if you have done something wrong in terms of over fitting or arafating right but the real real power is in this function in setup there are so many uh parameters and they are essentially like searches true or false right and in some cases methods you can do uh and there are about 20 or 25 pre-processing options so for example dealing with high cardinality variables uh or dealing with uh or normalizing your data or doing non-linear transformations such as your johnson or linear transformations how do you handle unknown categorical variable on test set you can also do pca with different methods you can ignore low variance combine rare levels you can remove perfect collinearity or you could also remove multicollinearity among variables up to a certain threshold above a certain threshold which is defined here you can create new features using clustering unsupervised clustering you can create polynomial features to chromatic features all this stuff right so you just have to define that stuff inside at this point i'm not doing any pre-processing so i basically uh i'm just passing my data set and target variable what it would do there are certain things that you have have to do they are imperative for you for model training which is if you have a missing values in your data set pi caret would impute it by default for numeric values it would use mean but you can change that method in setup for categorical variables it would use most frequent value but you can also change that method and also if if there's a need for one hot encoding or target encoding it would do that using one hot encoding for for x features but you can change if you want to do different kind of encoding there are a few options that you can use here right uh in pi carrot 2.0 and we'll see that uh if you have an unbalanced data set in in binary classification you can also fix that imbalance in in the setup and we'll see that excuse me sorry so once you have initialized the setup and it's successfully complete our recommended first step is to use this function called compare models and when you use that what it does is it basically goes into model library import all the undrained classifiers and fit them using 10-fold cross-validation stratified cross-validation in case of classification and normal non-stratified k-fold in terms of regression but it would do that job for you evaluate this five six metric six commonly used metrics for classification in 2.0 because there's a lot of requests we have also added mcc uh matthew's uh coefficient correlation uh so we have seven metrics but all these numbers essentially it means that logistic regression uh on a 10-fold cross validation on trainings that performed 0.83 in accuracy 0.89 in auc blah blah blah right and all these marks are here so it goes as far as highlighting uh the best performing metrics in each category so in this case i can see logistic regression is doing quite well across the board except for recall which is quite high for qda but qda is not really good in terms of auc so but that's that's not the point here the point is uh it would train all these models uh and it just makes makes makes your job easy in creating baseline models right so this is how you always start okay so once you have that uh have this baseline grid printed for you and obviously there are parameters and compare models that you can check so if you want to change the whole from 10 to 5 you can use full parameter by default it would sort this table for accuracy but if you want to sort it by recall or anything else you can do that uh you can blacklist certain models from running so uh in regression especially there are models like this and regressors that if your data is high dimensional it takes a lot of time so you can basically uh block them from from running in 2.0 we have also added a parameter called whitelist that means if you want to run a specific five models you can just pass whitelist so you can either block certain models or you can run certain ones right okay okay so once you have your baseline available here uh create model is is a primary is a is the most granular function in pi carrick which is used externally as well as internally by so many other functions and what create model does is basically fix a model uh whatever model you ask it to fit it based on 10-fold validation cross-validation by default and evaluate these metrics for you so pretty much the same thing but you would see bi-fold results so in this case let me just create model lr what means create a logistic regression model and as soon as you do that you see this table and this is by folds so these are ten folds metrics and the results of ten volts and this is basically mean and standard deviation so this number here eight nine nine five is basically same as this number the difference is this is the average result and this is by four if you want to see for example you want to look at five-fold if you want to do that for some reason this is how this is how you move around right and there are 15 or 16 um estimators uh in in the library uh escalant and non-escalant so all these are escalants this is outside of escalant and the idea is not just to create a wrapper around sklm but idea is to create something that that that gives you flexibility to run whatever model you want and we'll see that in 2.0 right so we are not creating any algorithm from scratching so it's not about algorithm for us okay so here i have created logistic regression to create decision tree dt we're very very intuitive so this is decision tree just like create model there's a function called tune model so when you create the model or when you compare them it would use default hyper parameters but to tune model uh when you tune it using tune underscore model there are pre pre predefined search spaces in pi caret it would iterate over those search spaces using random random search strategy and it would tune the model so in this case you can see i've just created a decision tree using default parameters i was doing 74 75 in auc when i tuned it uh it bumped up to 86 and i think the difference if i show you the difference here um default decision tree has no max gap and when i was tuning this [Music] i think these are the parameters that change if you notice this right uh but again the point is you can't pass your own parameters in 1.0 it would use a predefined search space but the one big thing in 2.0 is you can pass your own custom grid or create your own gradients somehow and and run that under loop right so it gives you more flexibility in 2.0 okay by default for classification it would optimize uh accuracy function but if you want to optimize auc or recall what you can do just pass optimize is equal to auc right and you can see all of it is is is functionally written so if so very very different uh than usual how you would do it okay so this was the parameter that caused this increase uh 85 percent and the knife base by default was doing 84.4 so again the difference is this is byte this is this is using default parameters when you tune it it unit using uh it randomly it tunes using a random grid search okay so similarly just like create tune there's a function called on sample which basically on sample the model so you can create a model decision tree in this case and then pass it in on sample mode what would do is take your train decision tree and wrap it around begging classifier or bagging regressor in this case that bt is a is a bagging begging classifier right and you can change the number of estimators using an estimator's parameter so by default it is taking decision tree fitting it 10 times and using using bagging method but if you want to uh create like 25 trees just pass an estimator is very fine right similarly uh there's another way of on sampling which is boosting so you can pass the method parameters equal to boosting now in this case it would take your decision tree and wrap it around at a boost which is different mass of them okay excuse me there's another function called blend models which essentially you pass a list of pain models to this function here i'm passing blind models estimator list and i'm passing this list and these these are basically individual models and what it does it creates a voting classifier regressor depending on what you're doing based on those base models so in this case i'm creating lr which is logistic regression lda is a linear discriminant analysis and gbc is gradient boosting classifier i'm creating those i'm just i'm setting verbose is equal to false so that it doesn't flood my screen and now i'm simply passing this in in the blender and see 90 percent that's not the point but this is this is the these are the estimators in the plan as you can see okay there's another useful function in pi caret uh which is plot underscore model and there are quite a few plots uh built-in blocks available in pi current uh which helps you uh with the with analyzing the performance so in this case by default i'm passing my blender which is basically this model here that i created and by default it will show you the auc curve of that on your test data but there are quite a few plots so if you see the documentation for this uh this is the list of plots so let me just quickly run them confusion matrix threshold precision recall validation curve decision boundaries so i have i hope this is giving you an idea that what kind of efforts and time that you are saving by by by simply using this right uh if you don't wanna if you're too lazy to even just call off the individual plots there's a function evaluate model when you run that it prints this nice little widget for you and it obviously works in notebook environment only but it would print this widget for you and you can just instead of calling individual plots you can you can you can use this right okay there is there is a package in python called shaft which is the implementation and if you care about model explainability it's a very very useful package in python it only works on on tree based models or kernel based models but in this case let me show you an example through xg boost which is a tree based model i've created exuberance and now i'm just passing that inside interpret model what it does by default it would give you this plot and if you don't know about sharpie values you might not understand this immediately there is a bit of a learning curve even to understand how to read this plot because it has x-axis and then you have a variables there but you also these these colors are also representing feature values so there's bit of learning curve but if you don't understand i encourage you to check out uh sharp values uh but this is a very useful plot uh that tells you feature importance uh it's and it's different than a normal like normally you would use coefficient weights for feature importance but this is different than that okay very useful you can do a correlation plot or you can basically check a reason plot which is in this case i am passing same model plot is equal to recent observation is equal to 1 which means that i am asking to show me the vectors and you would see these are the features show me the features that influence the prediction of observation one so this this would be one record in my test set and you can see these features in blue are pushing towards lower side while features on your left hand side in the red is pushing to it and these are additive so if you add them up you would come up to this value very useful again if you don't know shaft values i encourage you to read that there is a bit of a learning curve but there's also this interactive effect if if you understand this well enough uh this is also very useful to analyze the controlling effects in your model okay the last function before we see 2.0 features is deploy model uh very useful what it does is it would take your model and ship it on aws s3 so let me just run this code one one line and model successfully deployed what it did in background is basically let me first show you if i go to my s3 account this is my amazon portal and i have a bucket here this is the model we just uploaded xgboosh for aws pickup and it's a pickle file and what this model is this is basically if i save this save model xgboost abc and just let me load it back so that i can show you what it is and print it out so this is not just the model but the entire pipeline and the last part of this pipeline is basically the model right so when you deploy it or you save it it would take the pipeline that you constructed when you were defining setup right and normally if you're not using this you would perform those steps and then you have to manage all the dependencies right so if you have imputed the values first and then created features through interaction you have to repeat that in the same order or orchestrate that in the same order on the new data set right otherwise you'll end up making wrong predictions or not making any predictions at all right so pike edit does that automatically so whatever you choose from here it would orchestrate those dependencies and create a pipeline in a sequence and you can just use predict model which is a native function of pi caret instead of just calling so when you create a model so for example let me just create a lr when you when you create that if you notice this lr oops this lr is basically a this is a scalar object so if you want you can do this right and you can pass that you have those methods available but what we suggest is you use this so that it uses it it flows your data through the entire pipeline right but if you just want to take your object and ship it to another environment you can just just use that all right going back to presentation so let's now talk about some new features in pi care right so you you have seen pi carrier at 1.0 and it was more like a in-person modeling experience that means when you were using compare models you were seeing baseline models and based on that you knew okay logistic regression and lda and qda were my top three candidate model let's try to tune their hyper parameters or unsample them or blend them or stack them or whatever you want to do but you still as a human you are involved right you have to eyeball the numbers so even though it automates a lot of thing for you 1.0 it is not completely automated because you as a human have to be involved in moral selection process right that's because uh compare models wasn't returning anymore it was just presenting baselines for you right one big difference and which connected all the dots or close this loop in 2.0 is we have changed the behavior of compare models and tuned models in pi character 2.0 and we have introduced a new function called auto ml uh and le and and also as i as i mentioned before you can there's much more flexibility now uh in 2.0 then then 1.0 and instead of just talking let me open a notebook and show you those things right so this is a 2.0 notebook package 2.0 new features and you notice i am on a different environment so i have a nightly version installed so a version here so you can see instead of 1.000 i have this nightly and if you're interested in using this you can pip install pi carried nightly it's a nightly build of bike at it uh once this this will be completed in two or three weeks we'll ship it and it would be picara 2.0 right let me repeat the same let me use the same data set and this time i'm i'm doing the same thing but there are few new variables which is this log experiment and experiment name and i'll i'll talk to you about what are those but let me just for now just just run it okay okay press enter okay so the grid is little bit bigger this time so you can see we have added fixed imbalance and fixed imbalance method so you can pass that here if you want there it's by default false you can set it to true if you're if you have unbalanced data okay so the number one difference now when you run compare models notice that i'm storing it on a variable called best model and the reason for that is by default it would not only prepare this grid and show you the baseline models but it would basically take the top one like the best model from this grid and store it so train model store it in best underscore model variable right by default it would only select number one model but if you want to select uh more than one model you want to select the top five candidates you can pass a parameter and select to select top three top five or bottom three or whatever you want right uh now if you think this kind of closes one loop right because if every time you can return the best model out of all the models in library that means that you can make this process automated right you can you can write your own auto ml script and automate this process because if you can get the top five top three or top end models from this you can create a workflow you can pass them in in a blender you can pass them in a stacker you can run a loop to tune them you can run a loop to unsample them and then just keep checking the variables and at the end you would get the batch model right so now this is more automated in that sense okay i'll just give it a minute to let it complete okay now if you see best model it's basically a lda model stored because lda is doing number one right now if you would have done this part is equal to recall because all you care is recall then you would have got naive base in that case right because knife base is bash in terms of recall i'm not gonna run that in interest of time i'm not gonna run this code as well but essentially this is giving you one model the top one model and if you pass n select is equal to five you would get top five models so these five models here i'm not going to run that because this is how you if you want to run very specific models in this case i just want to do decision tree random forest xd boost and light gpm just pass a white list and it would quickly train only those four models and return you a grid and obviously uh my w if you see w it would be light gpm because slight tpm with best of all okay again the behavior of tune model is also changed previously you would pass a string when you are tuning the model it wouldn't take your create object but now you can create a loop you can create a model so in this case what i've done is i have tuned model best model so basically whatever the best model was it doesn't matter uh but whatever it was it would basically tune it right and in this case so it's more more dynamic now right so my knife base was my lda was doing 90.83 sorry in precision 80.21 69 so it improved anyways okay i mentioned earlier 2.0 is much more flexible for for developers because what we want people to do is build things on top of it right so you are no more limited to to only create the models available in modern library uh what you can also do is use pi card as like a like a plate or or or environment where you can do all your model uh so you are no more limited to use pre-built models in library but what you can do is you can train any model as long as it it fulfills fit predict api style of scikit-learn which i believe like 99 of them would do so in this case i'm importing a symbolic classifier which is genetic programming from this library just importing it and pass it into create model and it would essentially work exactly in the same way uh that that any other pre-built model would work right so you can use all the functionalities from pi cadet so it could be library in this case i'm importing it from a gp learn library but it could also be a python file that you have just created yourself and it it's basically a class object that follows fit and predict api as long as it does that you can use spike error to analyze that through plot model or deploy it on aws or evaluate the cross validation results of those or even promote them in production with predict model right you can do that gonna stop it in the interest of time let me just stop it okay i'm gonna skip this okay another flexibility in 2.0 is here i'm creating a light gbm model right and if you see this is basically uh all the default parameters now in one point though there is no flexibility if you want to change a parameter there's there's no flexibility obviously either let's use all the default parameters or or whatever it is in the predefined tuning grid but in 2.0 what we want you to uh what we think is really valuable is you can pass any any parameter as long as it exists in that particular algorithm so in this case i'm creating a simple loop to iterate over all the uh to iterate over learning rate of light gbm so for i in np range 0.1 to 1 with the increment of 0.1 create light gbm learning rate is equal to i because it's a loop and verbose is equal to false and then append this into lgbm so this simple code it would basically fit 10 like gbm models with different different learning rate parameters right uh so this is the flagship so for now i'm just doing a simple loop but in practically you would not have a static list but you would have some kind of dynamic uh list or variables which would be your gradient uh that that you would zoom in or search through some kind of logic right but but this is the flexibility this is what you can do right okay just 10 more seconds and if i just now see lgbms these are these are these are all lgbm then if you see learning rate 0.1 0.2 0.3 so essentially you would write your own loops and and create whatever you want right pi current is just giving you a environment to do it okay now there's another function uh called automl and it's basically just automl no no parameters nothing and what it does is whatever you are doing in pike edit it's generating a blueprint behind the scene and for now we we are in the notebook right so we can see the numbers we can take decision move forward but if you are constructing it as a script unmanned script so there's there would be no eyeballing uh essentially your model selection process is basically you want to select the best model right that's what autonomous are doing but this automl when you run this function at the end of your script what it does is it goes in the background take the blueprint and whatever you have done today whether by sitting in front of computer or just by running python script on command line it would scan through the environment get the best model and return you the best model in this case i'm saying give me the best model and let's see so the best model was this right you can also see return the best model using hold out so remember we had 30 outside 30 percent of the of the data set was kept as a testing that said test data said hold out data it wasn't used so if you do that in that case it would basically go back in the environment take all the models that you have created today predict them on hold out and based on whatever metric that you define here i think the parameter optimize whatever you define here based on that it would select the best model return right so you you think about you you're writing this long script in pike edit at the end of your script if you just write automl it would get the best model for you that's it it's not retraining or it's not doing anything so it's very quick uh in in that sense okay uh and the and and the other other exciting part for 2.0 is we have uh we have whatever you are doing in pike edit is basically getting logged and we are using a ml flow backend so at the end of your script if you run this ml flow ui if you're using notebook you have to put this sign here but if you do that what it would do is it would initiate a ml flow server on your on your local computer right and remember i told you these variables here log experiment 2 and experiment name that's because if you want to log it by default it's not going to log it right because we don't want to increase your processing time without you knowing uh but if you want to lock something which is really helpful you can use these parameters okay now when you do that initiate a server you should you can go on localhost 5000 and it would initiate this uh ml flow server for you very very useful and you can see pi cr2 juice is created right and it would give you this functional functional ui right and very very very powerful so this is like a leaderboard for for automl but this is really functional so you can see all these metrics training times tags parameters so you can you can do anything with this you can export it on csv and if you let's go into one of the model all these are the parameters of the model uh metrics of our models these are a few important things if you are using apis these would be helpful identifiers and restored the entire training uh pipeline along with the model this this is that table that you see in jupyter notebook it will store it as a html this is based on hold out and this is uh ml flow is really really good and handy tool right so you can use functionalities like let me compare all those light gpms remember the loop i created for light gbm for learning rate let me just select them and compare them so you can do things like this very quickly and let me put my learning rate here on x-axis and see what is impact on auc right so you can see uh decline right so a very very powerful ml flow and it is natively integrated with with database right so if you are if you are using databricks so this is the example um this is my databricks account free account uh and in data breaks i perform exact same experiment and you can see this one is exact same thing that we have just seen except that now i have my username here as well because i'm using it on on a cloud server of dynamics right so if you're using as do data bricks or databricks community version it's gonna work and it's really helpful a few other things that you can you can do with this is let me just stop it if you pass this parameter along with log experiment if you also pass this log plots is equal to true and let me just rerun it and let me create a lr model if you do that and now if i show you my server i would have a logistic regression here you can see this timestamp and now if i go back these are all the parameters metrics but now i also have train models results hold out but now i also have auc confusion metrics which are important for whatever you want right at the moment i'm i have the back end of ml flow on my computer but in practice the way you would use it is there would be 10 people doing modeling in a team which is which could be in multiple countries and they are using a common back end and that's how it become a very good collaboration tool right and you can you can create your own tags so you can use this tool as a collaboration tool between data scientists and data engineering team right and they can simply pick the models from this location here and just deploy it or do a b if you if you want to do a b testing with different models this this is very very useful in that sense right uh i think we are just left with five minutes let me just cover two two more points uh before we open up for a q a uh with pike edit 1.0 one of the big problem was uh it was optimized for notebook environment that means if you are if you want to use a spider or if you want to use python command line it it it wouldn't work right because we were using some html functionalities that is only available in notebook and after 1.0 release i get a lot of requests for including non-notebook environment support as well so in 2.0 you can run a pi credit script just like python scripts on command line or in any any environment so in this case for example i have created this cli uh cli script which is basically it's a command line script it takes two parameters data set and target variable name and it is basically running the same code the only difference is i'm passing html is equal to false so that it doesn't rely on those notebook functionalities then i'm creating a compare models to return me top five models then i'm blending them then i'm tuning the best model and then i'm calling autumn l and saving that best model right and if i want i can i can just deploy it also um on on aws and this is my script and if you just like any python script if i just run it on my command line i get it cli.i uh use project it's gonna okay something is messed up but but this would basically work on command line and execute itself just like any any normal script right um what i have also done so the the way i i use this if you are using kaggle kernels or any any remote execution this and and you don't have option to initial to access the ml flow server what i've done is created this function as well get logs when you do that what it does it basically gonna save the excel file um excel file here and it would essentially be one giant grid very useful right so it would print all the parameters so here is all the parameters all the all the all the models here and their metrics here gap iqrf cmcc so very useful very handy uh it becomes more helpful when you're running it on kaggle so for example uh this is my kaggle notebook and at the end of my notebook i'm just doing get log save is equal to true and when this execution completes i just come back here and collect my csv file from here right uh similarly you can create a github action so in this case i have created a github action uh that basically executes itself uh using using using an action that we have published in in our library and it would spin off ubuntu machine and just run the entire automl and at the end you get these three things back right their best model experiment logs which is essentially the csv file and the system logs which is basically if you are if you if you plan to develop things based on on on top of pi card system logs are really helpful they they kind of show you um so here they kind of show you each as the script is executed in the background because it's it's a functionally written uh package it's very difficult to troubleshoot so these logs are going to help you with that okay i think with this we hit 1pm so i'll be open to questions now okay okay back to abhishek great talk mois um really great talk and uh i'm also a founder of automl so it was very interesting for me uh frankly i've i've i haven't used pytorch uh sorry by carrot yet uh but i'm planning to do so now and there are many questions so let's let's see if we can take a few of them because we are a little bit out of time so one of the question is like the most popular question is how much time should a newbie invest on data science life cycle when we have so many automated libraries available in the market okay so it's a very good question it's a very very generic question i think it is so data science is is a glorified computer science problem right it's a software development problem as well but if you are just starting into data science i would say you would you shouldn't be using all the author has or all the fancy things on day one because if you do that you'll still be able to achieve the output but going forward in the middle of your career and you don't have any background about how things are working that would be a problem so if you are just starting out my advice is to get get basic knowledge get get get basic knowledge of how the match for decision tree and linear regression work and and then start implementing them by using automated tools it doesn't matter whatever you use as long as you're solving a business problem but don't just directly get into our demos because if you don't understand as basic as linear regression that that would be a problem going forward true true another question is do you have any plans of uh for extensions like if somebody wants to extend or add new features yeah so one one very very big objective was we have designed pi carrot in a way that it is uh we can we can we can extend it easily right so it's expandable but also uh with 2.0 if you have noticed that you can you can you don't need to have a algorithm in picadex library to use piper right you you can import it from any other library and just use it because as as library or software grows you have a lot of governance or a lot of control issues you have to for example there are requirements if you want to include some algorithm and sql today it is you have your algorithm has to be x number of years or x number of citations you have to meet that to include that so we don't want to be in the in the business of certifying algorithms uh what we're going to do is provide a platform where people can write scripts outside of pike edit and just bring them in as long as it satisfies the fit predict condition true true so kind of a generic uh approach yeah yeah okay i think this question has already been answered so the question was if you're planning to add over sampling and under sampling approaches like smooth so you have already answered that yeah yeah um another question is all of the experiments are running on cpu right yep so do you have any plans to make it run on gpu or gpus so if if the algorithm supports gpu for example cad boost it supports gpu you can pass those gpu parameters or set up the environment variables but if algorithm doesn't support gpu we are not writing algorithms from scratch to support gpu so if they natively support gpu you would be able to use so for example cad boost you would if you want and a similar question was like not related to this one but to the previous one what about custom metrics at this point at this point there is no custom metric but that is in our high priority list we have received a lot of requests to allow people to pass them past past their own custom scorers so it would soon show up and um are you using are you using the default parameters for circuit learning models yeah yeah so when you just create model or compare models it uses default parameters there are many questions related to some some questions related to the size of data set so uh what what do you think is going to work on large data sets for example more than 100 000 samples right so for you and that's where that's where your knowledge would also help you right like four days ago i ran a multi-class problem on a poker data set and it took 4 hours and 30 minutes to run on on on github ubuntu machine and when i when i investigated what took why why it took 4 hours 30 minutes to just run compare models i realized just because it's it it out of 4 hours 30 minutes 4 hours was took by k n right so if you're already aware about those situations you can blacklist certain models but obviously if you're dealing with millions and millions of raw and you just want to run everything it would take time which which which otherwise like even if you're just using scikit-learn or anything else it would take the same amount of time right yeah right it's not doing a distributed processing for you or it's not even improving any any anything on top of scikit lan for you so so if if the models in circuit learn support multi-processing then it is using multi-processing otherwise it's not using processing right right right and what is the relevance of session id right so i didn't cover that but session id is is a seed when you wanna and you always want to reproduce your experiments right and when you are doing your experiment and you think there are so many places where you have to define random state starting from train test to split then then in on sampling methods there are random state there are random sticks everywhere with scikit-learn uh their website mentions that if you want to set a seal in your environment this doesn't guarantee that you would be able to reproduce the experiment because scikit-learn uses the numpy randomizer and due to jobless functionalities unless you pass random state in each of the function there is no guarantee what we do is session id when you pass a session id we take that number and distribute an individual function then in a normal experiment there are over 20 25 places where you have to take care of random states so we do that so it's for reproducibility okay just a different name yeah okay um and some people have been asking do you have any plans of including time series yeah yeah actually uh we have not we haven't started reviewing that but somebody from community has already built an awesome time series module so a couple of months maybe but we'll have time series and uh are we so there are many many questions um is it is it possible to capture the steps as a recipe and then save and load them so we can replay them or share it with others yeah yeah that's that's that's the whole idea that you write your code which eventually become your recipes and they are shareable or transferable in the in the form of python code yeah in the form of normal item okay that's that's awesome there are many many more questions but i think we are out of time now so thank you everyone for joining and thank you moaz for taking the time out and it was really interesting and amazing talk and uh i hope you will share the presentation yeah so the presentation is on on this github repository uh github.com slash pi carrot slash spike edit dash demo dash 80 80 is for abhishek taco if you want to remember that but the presentation and the notebooks and everything else is there uh also if we cannot answer your question here uh feel free to to ask them on linkedin or ask me offline and uh i'll be happy to answer so i will also share the links uh i will take all the links the useful links from and i will share it as a pinned comment and uh his linkedin profile is also in the description so feel free to reach out to him and some questions have not been answered if it has not been answered and you think it should be then add it as a comment to the video and probably i can ask moyes to take a look sometimes later so thank you very much for everyone who joined and we will see you in two weeks thank you mois all right thank you
Info
Channel: Abhishek Thakur
Views: 4,195
Rating: 4.9595962 out of 5
Keywords:
Id: jlW5kRBwcb0
Channel Id: undefined
Length: 65min 44sec (3944 seconds)
Published: Fri Jul 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.