End to End Machine Learning Pipeline Creation Using #dvc | Live Demo #dataVersion #mlops #mlpipeline

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so hello guys welcome back to my channel and today we'll be talking about uh how to create end-to-end machine learning pipeline using DVC so as you see here uh in my terminal like we have various steps like data split data processing train then evaluate okay and then once this complete pipeline run then I will generate some visualization so here you see like we have generated some Roc curve and here you see various revisions like head revision and workspace revision so workspace is like the current version in my working directory and hit the previous version so with respect to train and test okay so as you see like this uh dark uh red color this one okay so this one is the current workspace revision and then with respect to test data and this is the head revision with respect to test data and you see like this has much more uh improved performance improved Roc value right compared to this one so that's where what we have done we have tuned certain parameters and that's where we got the improved accuracy so that's what you will see like here we have a confusion Matrix and then precision okay so like two revisions I have generated and then compared the performance and with respect to that you will see like a lot of code has been generated so don't worry so ever like almost 80 percent of the code will be Auto generated here you see like 80 percent of the files will be Auto generated you have to start with only this uh sorry this SRC file that is the source folder okay so here we have a certain python files so that's how we will start with and then you will see like uh in as we move further these folders will be Auto generated okay so uh let's start uh so I will what I will do basically I will start freshly so I also suggest you guys to execute every line of code along with me so that uh it will be a good Hands-On session okay so let's start with the blank folder that is the empty folder and here I have created an empty folder and you will see like there is nothing here okay so let me open the terminal okay so I have already put the basic file in my git repository so that's what I'm going to clone as I said we'll start with only the source folder that is that contains the python files and then it has some parameter.yaml file which I am going to explain okay so first of all let me clone this directory so you also do along with me just clone this repository okay so what we can do we can simply do uh git clone and that path okay and hit enter so it has cloned and let me go to inside CA DVC folder so now we are inside here and then if you just expand this so you will see all those files okay so don't worry I am going to explain each and every file in detail okay but first let me set up the environment so the very first thing like here if you see I have placed this requirement.txt file okay so uh these are the main uh uh I would say library you need to install so the very first thing we'll create a python environment so let me create the python environment outside this DVC okay so let me go back and here what I will do python hyphen m v e n v and let's give some environment name maybe DVC demo EnV okay and hit enter the moment you hit enter it will create a environment file so now if you see here uh let me collapse this so there are two so before we have clone only DVC and now a DVC demo environment file is created okay folder is created so let me activate the environment how you can activate the environment basically so make sure you are inside TVC project folder where you will find this folder of the environment so DVC demo and environment and then type scripts then type activate Okay this way you can activate this environment and inside that we need to install certain libraries so what I will do so you can also do using this requirement.txt so what I'm doing I'm doing one by one so nowhere is flip install I need to install pandas I need to install pi yaml psychic learn and then sci-fi then matplotlib okay so matplotlib I will be using for certain visualization purpose these are the machine learning Frameworks and then yaml to configure the yaml file okay and then pandas you know all these things right so let it install so make sure you are installing everything after activating your environment so that it will not be installed in system level it will be installed within your virtual environment okay so now it is installed and make sure you are inside the virtual environment okay so now uh one more thing we need to install DVC live okay so I will tell what uh for what dvce live is going to be used before that let me install the dbcr as well okay so pip install DVC okay it has installed let me do clear the screen and then pip install DVC live and version greater than equal to 2.0 okay if you don't Define the version and of course it will install the latest one so no worries if it is saying requirement is already satisfied because it has already installed with the help of uh DVC so no no problem okay so everything is installed now so now let me go inside DVC folder and then again clear the screen so now I'm inside this folder okay and I have activated this uh DVC demo environment as well okay so now uh let's first go through a little bit about all these files what all these about and then we'll tell you like how to create a end-to-end machine learning Pipeline and again one more thing and we will also talk about reproducibility the very important Concept in machine learning how to reproduce uh certain output so for example there are four steps in the pipeline okay and there hasn't been any change in first and two steps while we have changed the third and fourth step so while you execute the pipeline it should not execute again the first and second step because there is no change in first and second step it should only execute the third and fourth step right so that's where the reproducibility comes in place so we will see all these things basically so let me start with so the very first step basically include the data split okay so we'll have certain data and we need to split basically so basically here I'll be using the wine quality data set so red wine quality and we'll be building a random forest model and which will predict uh the quality of the wine whether it is a good wine or bad wine okay so here I am just uh so guys uh the code part this machine learning model building part I will go through little fast because the scope is not to explain you how model is built rather uh to explain you like how to build the end-to-end machine learning Pipeline and then later you can deploy the same pipeline okay so still uh for the newcomers like let's uh quickly understand these things as well so the very first method I have defined like data split and here I am giving the yaml path okay so this parameter yaml path is nothing but this file path so let's talk about this file basically so here I have defined certain parameters so that we should not hard code any parameter within your code if you want to change any parameter you can change here and then you can simply run the pipeline so that your code is intact okay so for example here number of estimators you have defined 10 and you want to tune the model you want to change the number of estimators so you can Define 10 20 whatever and then return the pipeline okay so that you don't need to touch the code block so that's what here I have defined the uh my uh input data source file okay so here guys you can Define any data path for example you can Define your GS bucket S3 bucket or any other bucket or you can Define the local path itself okay so in my previous video I have already talked about how to do data versioning using GS bucket okay as a remote backend storage so you can watch that video and then if you understand then you can Define your GS bucket path here okay instead of this so for example my GS packet is this DVC underscore project so like this you can Define and then whatever file you want to read you can Define that file okay so that's how you can read but I am going to use a local path okay so here my data source I'm using this from uh git GitHub repository so you can Define git you can name it the git path itself but I have named it local path so that I don't need to change the uh code block okay so let it be as it is but rather it should be git path okay but it's fine so now from here I will be reading the data so now you go back again so this is that yaml path I am expecting here and then with open I'm opening that file and reading that yaml here so yaml instance will be here and then local data I'm just reading that data so here that path will come in this variable so this uh CSV path basically okay and then after that I am loading into the data frame and then I am just doing some operation like quality variable so basically the if you open that so let me open this and show you this file how it looks like so that will be better to understand okay so it looks like like this so there are certain uh independent variable and then we have a last variable as quality so here quality and it is also a numeric variable like five six seven like this numbers are there so what I have done if this number is uh greater than um 6.5 then it is good quality uh wine if it is less than 6.5 then it is a bad quality wine so for good quality I have denoted this by one and bad quality I have denoted it by zero so there are two classes basically okay so it becomes the binary classification problem so that's where you build the random forest model and then random state so here again you know like while splitting uh you need to define the random State and then split ratio so this I am reading again from the yaml files from the base so if you go to the yaml so this is like a key value pair so from base I will read random State and then from split I am reading split ratio okay so on what ratio we need to explain training and test basically okay so that's where a random State and uh split ratio I've defined then I am just simply doing Trend test display here you just pass the data frame then split ratio then random State random State means if you run the code again then it will give back your same sample basically okay same records in the trend test and then I'm creating a directory basically what this is doing so again I am defining here the directory so data is split so what it will do basically whatever uh explicit data set like train and test it will create a directory data and split here in this folder itself and then inside split it will keep the train and test data splitter data okay so for that reason I'm just creating a directory here and then I'm just joining this path directory and then file path okay training file path and then I'm saving trend.2 CSV this generated file path okay so guys just take a pause or you already have these things so you can just uh just understand this if you are just having little difficulty take a pause digest everything and then proceed further okay and then similarly I am generating the test path here so here uh the directory path and then uh test file path it is reading from here so it is again a key value pair so training file coming from here and test file coming from here and then split the parent key okay so that's how and then test.2 CSV so in this this is the basic method for data split and then it will once you run this it will generate the data and then split folder here and then keep the trend test file there okay and then this is the main method so once you execute it it will start from here so this is a data split and then next thing is like a data processing so once our data explode is done then we have created one data processing file uh python file so what so these are the basic libraries and then data processing give the data path because I have given the data so that you can pass the training and testing data individually so that while training so while training whatever data you are using it has it is not doing a data leakage if you are doing the whole data processing together and later you are doing a trend test split then there is a problem of data leakage okay so I hope you know these things as a machine learning expert okay so now that's what here I'm just written the generic method data processing here I am passing the data path so individual path I'll I'll be calling this method before training data and for testing data individually okay so now here I am doing very basic processing just simply dropping n a and then further you can write your whole lengthy code of data processing okay again I said here's the idea is not to explain the machine learning model building but to explain the end-to-end pipeline building okay so here you can write this data processing code and then from inside main method a training data path so uh I'm just generating the training data path by joining this directory and then training file okay whatever we have generated a previous step and then with that training data I'm just calling this data processing I'm just passing the trading data path so that it will it will return me process data and same thing I will do for uh testing data as well okay so whatever final training data I have got so I need to store that as okay back I need to store that so for that again I have created I will generate a process directory inside so if you again go back in parameter then we have a process okay so here are these parameters we have created a process inside that a directory path and then inside the training file and testing files individually I'm processing both the both of these file and generating them final data that will be kept in that process folder okay so that's what I am generating here using OS dot make directory I am creating this process directory and I just document that means if it is existing then it will skip okay so that's fine and then um final training data path I have generated and then final Trend data.2 CSV it will generate the CSV and similarly I'm doing for testing data as well same thing okay so this block will generate the um final processed train and data files train and test file which will be kept inside this directory okay next thing next step is like our data is now ready we need to train the model okay so for that I have created the train.pm app UI file so here you just see the training here again I'm just passing the yaml file path so which has all the parameters configured and then it will read and then I'm just reading the final processed training data and then final processed testing data path okay this is just to the joining so that it will create the directory and then random State from here from again from parameters here this random State okay and then Target so Target again I have defined here in the parameter the direct is quality because we have defined all the parameters here because if any change you want to do you can simply Define here for example you want to do the same thing for different data then that has different Target variables you can Define here okay so that the rest of the code will be untouched and then you have trained it and test it okay then what you have for Trend you need to do the X train y train right so from y train means the target so train Target you can take and Y test like test Target okay and then extreme so you can drop the target from X from train data you will get all the independent variable in the trend data similarly independent variable in the test data okay and then then you can again get the random State okay so from base uh and then number of estimators so this is number of estimators will come from here the train then I am passing 10 okay and then normally you simply uh Define the random Forest classifier you just pass the random State and the estimators okay and they do fit that is the training part it will do the training and then model directory so basically where model will be saved so this modularity again I am taking from the parameter file so here I have defined model directory okay so again if model director doesn't exists then it will create the model directory so that your code should not throw the error okay and then I'm simply dumping the models so now this is the training code and then I am just calling the training from the main method and I'm passing the parameter.yaml file path so this training is also done okay now evaluate so in the evaluate of course these are the basic libraries okay and then in the evaluate method what I am doing basically so I will be uh generating The Matrix Json file so which will have the all the ROC AC scores through points two scores for two score all those things and then it will also have certain plots so that's what I'm going to generate okay so here in the evaluate what I need basically parameter yaml file from where I will read the custom parameter sorry configured parameters and then I need the model I need the data basically okay and then the data category data category means whether I'm executing this performance in the training data or testing data okay and then this live so this live is basically nothing but let me show you here okay so this is basically uh so using this so this is uh you remember we have installed DVC live okay so using that you can uh log everything whatever plots you are generating whatever metrics you are generating or any Json file everything you can log using this live okay so that's what this will do so I have created an instance inside the main method basically okay and what that's what I'm calling so I'll come here okay let me explain this what I am doing here okay so basically you just understand live is for logging purpose okay then here I'm opening the parameter yoml file similarly like in other POI file and then I'm reading the Target and then from the data whatever for example you are passing the training data or testing data whatever data you are passing for that what is the target variable what are the independent variable okay and then you can do prediction by class to modular predict probabilities okay so you pass that data independent variables and it will predict the probabilities okay and if you simply want to predict the like the classes directly okay zero or one then you can simply do modular predict normal machine learning word right and then if you want to uh do the predictions by class like uh for example your uh Purge 2 class is one okay good wine for that you want to see the predictions then you can just filter out okay you can do indexing and you can have predictions with respect to good wine okay that's what I'm doing here and now you have done the prediction and everything now the thing comes left to record so that's where use DVC live to log a few sample Matrix so what you can do basically here um so I want to record average Precision values so using metrics so metrics have important here using skill and so metrics dot average Precision score what you can do here so like y quality dot values it will have the True Values so average Precision is cool except the True Values and prediction with respect to uh positive class so here I have a predictions with respect to positive class and True Values so that's what I am passing here basically okay and then it will return me average Precision score and similarly I want to generate the ROC AOC score then in Matrix dot Roc iuc score you can pass the Y value and Y value basically the actual like True Values okay and then predictions with respect to the positive class okay so then that's how we got average Precision score and roclc value and then if not uh live dot summary then what it will do live dot summary average precision and so basically here what I am doing basically uh I'm generating the Json values like average precision and Roc AC so that we can further log here so that's it here I have generated the blank key okay and then live dot summary then with respect to that key okay and what is the data category data category as I said here I am passing either training data or testing data So based on that suppose first type it is coming training data so average category under that we have one more Key Training data and then I am assigning that average Precision score okay and then here Roc AOC then training data then that value and similarly if I call this method for testing data then here it will become testing data okay ends up so here these are the some Matrix values Roc AC and precision values okay next I want to plot certain plots basically okay so for that live dot uh log SQL and plot so live provides this uh log SQL and plot and then here I am just I want to plot the ROC curve so that's it I've just defined the key Roc and it to plot this what no things we need the True Value that is y value and then prediction with respect to positive class and the name we can Define so Roc slash whatever whether it is training data or testing data okay so that's how folder structure it will form so it will have these things okay so this is normal code okay there is nothing here so the main work will start when we start building the pipeline okay so now here Precision recall and precision threshold so so we want to get these values how you want to get these values Matrix from a scalar dot matrix dot Precision recall curve you pass the white True Value and then predictions it will return all these three values okay this is also fine and then here I just I want to do some uh round off with respect to percent threshold so that's what I have done nth point and then Precision points like you just do zip Precision recall and precision threshold and take the annex value okay so this is fine and then Precision directory here I am just to want to create the evaluation and procedure okay so OS dot path dot join and evaluate dot uh sorry it will generate eval slash PRC directory okay so this it will generate basically and then that PRC file Precision file Precision Json file I want so inside this directory I want to create precision.json so this directory folder I have already created basically here okay and then inside that Precision sorry if you have whatever data it is coming train.json or test.json okay whatever data type you pass with respect to that it will create a Json file okay so json.dump basically it will write uh the certain things inside this Json file okay so what it will add basically it will create a PRC uh parent key and inside that it will create sub Keys basically inside and similar to Json structure and inside Precision it will have a Precision value it said recall it will have recall value inside threshold it will have threshold value and how it will come basically is for p comma R comma T in Precision points so this is this is how it has created the Precision points with respect to Precision recall and PRC thresholds okay so T is threshold that's how it will write these values okay and this app D is nothing but the instance of this file so that everything will be written in this file okay and then for like with four intent from left side so that's fine and similarly we can plot the confusion Matrix sorry with live DOT log SQL and plot you can Define the confusion Matrix and give the True Value and prediction basically the required data to plot the confusion Matrix okay then it will log everything and now we have the main methods the main method like basically main method we first calculate all the parameters you need for this evaluate method so from the parameter file so that's where first of all it will read the parameter file and it is reading with open and then it is creating the directory model directory it is reading the model directory from the parameter.yaml here it is the model directory okay and then it is opening that model directory dot model.pigal file which we have created in the train dot py okay it is loading and then it is creating the final train data path and is reading the train CSV it is okay similarly and then it is similarly it is reading the test CSV and then a UL path if we are defining the eval so everything all the evaluation related output will be inside eval path so here eval will be created okay and then using live so basically what I'm doing basically so os.path.join I'm joining this eval slash live okay this path so so that whatever you are logging using a live it it will be written inside this and then here I'm calling this two times first time evaluate will be called for train data second time you will wait will be called for test data okay and then finally live dot make server it will summarize everything okay and then what I am doing basically a dumb feature important images and show it with your plots so for that what you can do so here I am just generating a feature importance whatever importance features are with respect to this uh with respect to this random forest model so for that again the same machine learning world and then it will plot the feature importance plot and then again I am saving this using figure dot savefrik in this evaluate path folder okay so this whole code block is done okay so now guys what will happen basically and next so basically till now what we have done quickly very quickly we have defined a data we have defined data processing we have defined train.py and we have done evaluate.ui hyper parameter tuning I will talk about in next video okay otherwise it will become lengthy too much thing to absorb okay so now what we we need to do next thing we need to create the data Pipeline with respect to all these steps so for that what you can do basically uh so again this DVC pipeline file is again placed in the git repository you can take but let me explain so we need to create a four uh different steps so as you see here so the very first thing I have shown like this is how our data pipeline should look like okay so this is what we are going to create so data pipeline will contain four different steps so that's how you can define basically so the very first step is like um we need to Define data script so how so this is also uh basically um some content inside your yaml so basically we need to create a DVC dot yaml here similar to parameter.yaml and we need to add the stage stages are nothing but the different tasks within the pipeline okay so instead of typing everything this manually what you can do we have some option to generate automatically okay so for that we can we see I have option to Define stages like DVC stays add okay hyphen n hyphen N means name name of the steps in the pipeline so data is split and then p means parameters okay so it will Define the dependencies and output so what is the parameter depend tendency basically in this data isolate what we need we need a local path okay so data source from where we are reading the data source okay we are reading the random state from the base parameter of parameter.yaml we are reading the split ratio from the split parameter of parameter.yaml and then what are the dependencies basically we need the dot py file okay from the source so we have defined the dependency then what output it will generate it will generate data split folder okay it will generate these things and then what are the commands need to be run select python Source this uh split.pui file okay and this is not required you can remove this okay so let me run this thing quickly so from here guys main work start okay so we are inside that one let me run this so now you will okay okay so it is saying you are not inside a DVC repository so what you need to do we have any sliced git we have not in slash DBC so let me initialize TVC okay DVC init so now DVC is in a slice now if you see here uh dot DVC folder got created okay and how to do data versioning we have created another video you can watch that one and that's where I have explained like what is this config and everything basically config contains the remote path so we have to Define the remote path where we want to keep the versions so what you can do DVC remote add hyphen D and then remote path name so I am going to give the local path as the back end storage for data so that's where I'm naming it local so in my previous video I have done using GS bucket okay so you can refer that if you want to use the GS bucket so here what we can do basically so I have created a DVC remote local folder so you can see like it is empty folder there is nothing here okay so same folder I want to use so it is one label pack and one more level back inside project and then DVC underscore remote okay so now we have added the remote local so now you see this config file got updated immediately okay so now uh DVC is in slash now we can run that command so let me do CLS clear screen okay and let me copy that come on okay and now let me run it okay so now you see here there is a new file dvc.yaml got created and here here if you see okay so now it has created a stage called Data split and what command it need to run like python if once you type this python space SRC this py file it will execute this what are the dependencies basically uh data steel dot py because we need this data speed.pi which is inside the SRS folder if there is no file here then it will not execute and what parameters we need so from parameter.yaml this is the parameter we need okay and what outputs we need data is so basically why we need to Define this so it will execute this file only if there is a change in this either this display.py file or there is any change in the parameter then only it will run otherwise if second time you run then it will not execute this that's how it will reproduce the output from Cache okay so we have created the data split stage and secondly we need to create the data processing in this step we want to add a data processing stage okay so we need to define the name as data processing for this stage then what are the parameters basically so we need the split dot directory so where like splitted terrain and test available so we need the parameter to read that one okay we need display train file we need to test file parameter okay and then we need to process directory parameter we need process.train file parameter we need process.test file parameter okay so let me show you so all these parameters are coming from here so like these parameters and these parameters are defined and process to parameter.yaml file okay which we have created at the very beginning so we started with source and parameter.yaml only these two things okay rest everything will be created step by step and then we have dependencies dependency has been data processing dot py this python file dependency okay then what other dependencies like it should have pre-existed data split folder and then outputs output of this stage will be data processed so this directory it will create inside that it will keep all the processed files and then this is the command it will execute okay so this so let me run this whole things yes okay so yeah it has ran and immediately if you just look into the dvc.yaml it has created the data processing so command is this one okay and then what are the dependencies data split and this py file so data script folders should exist and this data processing dot pys.just okay and then these are the parameters which will be read by parameter.yaml and this will be the output of this step okay similarly next step is like the train so here again I want to define the next stage as the trains I've defined the name as train then what are the parameter dependencies so these are the parameters basically we'll be reading from the parameter file so whatever the parameters we need to read okay so these are the parameters basically and then here hyphen team is dependencies so dependencies like train.py then what other dependencies like data processed so data processed directory should exist which is the outcome of this stage okay and then other output basically it will train it will produce the model file okay so that's where output will be model slash model or pkl so that's where this model file model directory will be created inside this train Dot py and then the command is this one okay so let me run this thing okay so now if you see here a train stage is created with this command with these dependencies and with these parameters and this will be the output for this stage okay and now the last one is like evaluate so let me create a status evaluate name as evaluate and dependency for the parameter we don't have any parameters required okay and then dependencies like evaluate Dot py and what are the dependencies like model.pkld should exist other dependencies like data process this should exist and then hyphen M it will create the map for Matrix it will like it will create UL dot Live Well slash live slash matrix.json it will create and then output output like will be in terms of evil live plots okay and what are the other outputs like well PRC or other output eval important dot PNG okay so these all folder structure will be created the moment you run these pipelines I will run this Pipeline and then I will show you how this I will map everything okay and this this is the command so let me run this as well okay now you see here evaluate is also got created okay and it has certain outputs and then certain Matrix related uh parameters okay it has created so now till now left side nothing has created only dot TVC and then DVC dot yaml these things are newly created okay so now it is time to run the pipeline so what we will do so we have basically started with number of estimator as 10 okay and all these parameters will be used inside the these files okay and now how you can run DVC Repro okay using this command you can run the whole um a pipeline okay so let me hit enter now you see one by one it is running it is running uh Python data.sp.py and for that it's creating the vc.log so again dvc.log will be generated for uh like um some holding some cache data okay it will be similar to that yaml file then it is running uh processing so for okay let me explain one by one so it has run data split.py it has generated run data processing it has run stage and then it has run evaluate okay and it has ran everything and now you see here it has created many different folders the green one so it has created data folder inside that we have processed an SQL so data split will create Trend and test CSV then data processing will create final test and final train CSV okay and then inside UL we have like everything logged using live and then whatever Precision Matrix we have created test Json and train Json everything okay and then inside that model we have model.pickle file okay so git ignore is created rather they will be tracked by back-end stories okay so now it has reproduced this so next thing what we need to do basically if I run this uh let me do CLS and then run DVC tag so if I run tag then similarly what I have shown here this thing so this thing will be generated here okay now you see here so this five plan is generated okay so now what next we need to see uh let me do CLS again so that we have some space so what next thing basically we want to see The Matrix so DVC Matrix and so if I just run this it will show me what are the Matrix are recorded So average Precision with respect to test data it is 65 percent average Precision with respect to train data is this 99 average uh sorry rocioc value with test data is 88 and for Trend it is 99 okay and then if I want to see some uh like plots okay so now you see here there is nothing plot so for plots what you need to do basically you need to add some more things here so not uh in yaml inside your tvc.jaml okay so inside so like you have configured the stages okay so similarly you need to configure uh certain uh parameters to show the plots basically so what you need to do later so DVC Matrix so we have shown the Matrix okay so similarly we have DVC plots and so okay using DVC plot so it will show the plots but here it is saying no plots for loaded visualization file will not be created there is a reason because you don't have anything added for the plotting in the yaml file okay DC uml file so for that what you need to do basically you need to add certain more configuration so here if you see so like similar to stages we have plots so A4 plot so we need to add Roc for Roc term so DVC defines certain templates so I will give you the documentation link so that's where you can find the different templates okay so here this is the sample template and here in x axis we have fpr values so FDR and everything will be read from our Json so if you see here here and if you go inside ul and then inside here then inside plots so we have for Roc these these values so PR tpr everything is here with respect to 10 train also everything is here and if you want to plot CM confusion Matrix then with respect to train all these values are here okay you just need to Define in the yaml so that they can be read properly okay so for that what you need to do x value and Y value so where is y value exist inside a well live plots SK learn Roc then train.json okay so let's eval live plot sqln so what we have URL live plot sqln okay and then Roc Roc inside Roc we have these two files okay from this like for example if you want to run read for train then fpr value will be this one okay so all these values will be read and similarly we have for test DPR values okay and then same for confusion Matrix template is confusion and then x axis actual values and y axis this values with respect to train and test okay similarly Precision recall Curve will also have like eval Precision inside this folder okay and then we have a feature importance inside UL directly okay so if you go inside URL let me collapse this and then this feature importance okay it will be plotted it will plot it will get plotted here as well if yeah okay takes some time but it got plotted okay so now if I Define this whole things inside our dvc.yaml so let me go to dvc.yaml and paste that thing here okay so same thing I pasted here so now save and now if I run same thing DVC plot so it will create a DVC plot folder here and inside that it will create an index file let me hit enter now you see index file got created here it's a DVC plot and now if I go I just copy the index file but yeah and paste in the browser okay let me paste in the browser now you see here with respect to train and test everything got plotted here right so this is the ROC curve so this is a test accuracy this is a blue color one and this is the train one okay and of course the more this is towards this one the more good your model is and these are the configure metrics okay so I'm not going to explain how to read the confusion Matrix this I am assuming you all know because you are learning the end-to-end pipeline building okay and then this is the Precision recall curve and then this is the feature importance okay now what is next guys next is like you need to change certain parameter and then see how your model behaves how to track different versions of model right so for that we need to first track this existing code okay so what you need to do basically you need to do DVC pose okay so it will push all the data and model in the remote backend so now if you see the movement we have a ran I have cleared the screen but the moment we have successfully ran DVC space Repro okay it at the end it has said okay use DVC pus to track all the data and models okay so let me run DVC push so before running this DVC push make sure you have already added a remote backend in your DVC config file so this remote backend this one okay so we we have shown at the starting rate DVC remote now if I open DVC remote folder then here you see this file got created so basically DVC tracks the data and models in terms of binary large object that is a blob storage okay now it is tracked so if you do DVC status then nothing will be here everything is up to date and then you can do git status okay so these files are untracked so what you can do you can do git add hyphen iPhone all so everything is added then git commit hyphen M okay uh first version okay then and then git you can Define tag what tag101.0 and then you can do git push okay everything is pushed and then you can push the tag as well it goes iPhone iPhone tag okay it will push the tag as well so now if I go to my git repository and I refresh this okay then here everything is placed all the so if you see here inside data we don't have data file only git ignore file is present okay and if I go inside model then we don't have model file only git ignore files are there all the big files it like data and model related file it dragged in the backend storage and all the meta data related file track it is tracks in the git repository okay and here I see you see the two tags are created okay in the base file so when if you want to uh like git clone the original file the source file and the parameter.yaml you can get checkout base files or if for this version you can do 1.0 okay now the next very important thing comes in place okay what is that in parameter.yaml what I'm going to do I am going to change the estimators so let me do 25 so for example before that let me show um again DVC so DVC Matrix and so okay so here you see like our average Precision score for test data is 65. there is a chance to improve this right so I am not going to go deep level of hyper parameter tuning that I'm going to show in next video but here's a very basic one I am just increasing the number of estimated from 10 to 25 okay I'm saving it okay and now I am running DVC Repro again okay so here so now guys I have changed only the train parameter then ideally it should not run data is split and data uh processing pipeline okay so let me do the clear screen okay and now DVC Repro okay now let's see what it is doing see it data is split didn't change so is keeping that one data processing didn't change so skipping that one okay so the data processing don't didn't change so skipping that one as well okay and it is running the train stage because we have changed the parameters here right and that's where the see the goodness here you haven't went inside the trend dot py rather you are simply changing the parameter.yaml and everything else is running smoothly right and then it will what it is doing in the evaluate it will do certain updation because it has a new model generated okay now here you it is saying use DVC push to send your updates to the remote storage now at this point let me run DVC status okay so uh it the Italian pipelines are up to date because what happens with dvc3 Pro okay it adds everything so it also like we have git commit okay after gate commit git status doesn't show anything but still it has not pushed to your remote storage similarly if you do DVC stage it will simply add everything okay but it has not still pushed to your um your remote storage so that's where you need to do DVC push okay so let me do that thing okay and then get status so let me do git add hyphen iPhone all and then get commit hi for now second version okay what I did get okay so spelling mistake okay and now get tag version one dot let me make it 2.0 okay and then get both and then get close hyphen iPhone tag okay everything I post so everything is dragged now but the thing is like now you want to see the differences right so what you can do you can do git Matrix so if you just do these things let's see okay not get it's a DVC okay DVC Matrix so so DVC Matrix so it will so now you see before it was around 65 now it is it has improved 74. so the current currently workspace we have this latest version but if you want to see the differences get DVC Matrix diff okay if you do this thing basically okay so what is happening because we have pushed everything so that's where in the current uh working directory we have only the latest version so no worries let me do one more revision let me do a 35 okay so I will get the third revision okay so now in my latest directory we have the second revision with respect to 25 and now I am just doing what DVC Repro okay again so let's see what is happening it is doing everything again okay now okay so now we have our second revision with respect to 25 as well okay that's where our head will be and now we have a 35 as well the current one okay now if I do DVC Matrix so it will show like 75 it has not improved much but if you do DVC Matrix and div now you see here so it has shown that right so before DVC push if you run DVC Matrix diff then it will show you the difference as average Precision with respect to test head is telling 74 109 this is with respect to 29 and this is with respect to latest 35 okay and this is the change in the positive direction okay and then here one one there is no change with training and with respect to Roc you see values with respect to test here it is 92.0 sorry 92.0 percent here it is 92.6 percent this is the Improvement right and similarly if you see like DVC uh param and diff so here you see the parameter difference is from 25 to 35 so head whatever head Matrix are these are with respect to number of estimator 25 whatever workspace matrices are with respect to 35 right so that's where you can see okay there is a some improved performance and similarly DVC plots and div you can see the plots difference as well okay so it will generate new index dot HTML file so let me copy the path and go here so this is the old one okay I'm just plotting new one so here you see like we have multiple paths okay so if you go to the previous one so here we have only test entrance because with respect to only a 10 okay number of estimated equal to 10 so here it is with respect to 25 and 35 okay so workspace revision is 35 with respect number of estimators right and the head revision is like with respect to number of estimators equal to 25 so that's where here if you see the it is slightly overlapping because there is not much improvement with respect to number of estimators equal to 35 right so that's where but still you still there is a slight Improvement okay and we see like multiple graphs are plotted here this is with respect to uh 35 sorry 25 and this red one is with respect to 35 okay and similarly here you see like uh if we are just seeing the test one okay you just ignore for the training one here almost 100 accuracy we are getting with respect to test this is the head that is like a number of estimated is equal to 25 so this is the confusion Matrix and this is with respect to 35 okay this is the confusion you can compare it and similarly we have President recall and similarly we have importance feature importance curve so that's how guys you can generate the plots and then you can basically do the comparison right and now we have some if you do DVC status again you will not see anything because as I said DVC Repro will add any way into the staging and now we need to do DVC push we still have the get metadata files because everything is pointing in the metadata file okay so we have still not pushed the git file so let me do git hyphen iPhone the moment I do get commit and get post you will not see the differences locally okay then you have to check out the particular one so git add hyphen iPhone all it will add everything then get commit hyphen m third version right and git Plus let me Define the tag 3.0 hit plus iPhone tag okay now I'm sure if you run the DVC Matrix diff it will not show you see because there is nothing different right your like currently you have only workspace if you do uh DVC Matrix so it will show you because you have only the latest version right but now if you want to go to the uh the first version what you can do git check out V 1.0 now you will check out the very first version okay Switching to 1.0 now you need to do DVC pull okay now if you do DVC Matrix so it will show you the Matrix for the first version now you see here the 65 this is what you got with residue to the very first that's where we had number of estimated equal 10 now you see here number of estimator got changed right and now if I do get check out and what is the tag one 2.0 okay so now you will see here 25 here it will give you 25 and then DVC pull now you see 25km right and with respect to here if you do um DVC Matrix so it will show The Matrix with respect to an nstv is equal to 25 estimated 25 right so this is how you can switch between different tags different revisions okay so guys I hope you enjoyed this session and now let me so tell you some troubleshooting steps so basically guys suppose if you are getting some problem in a particular state for example you have done some mistakes here in the DVC stage ad and then you want to remove that one and add freshly then what you can do you can run DBC remove data split then it will remove this step and then further you can execute the same command it will add again okay so that's how you can do these things okay so I hope you liked the video and if yes then please don't forget to uh subscribe the channel and share with the ml Community that's how you can support me and be ready for the next video that's where I will show you the how to do the uh like hyper parameter tuning okay so that's where here you see uh if you go to the source here a hyper parameter tuning so this whole thing I am going to talk about in the next video and show you how to track different revisions again okay so that's it guys thank you very much
Info
Channel: Ashutosh Tripathi
Views: 3,387
Rating: undefined out of 5
Keywords: machine learning pipeline creating using dvc, experiment tracking using dvc, data versioning using dvc, dvc, dvc tutorial, dvc pipeline tutorial, machine learning pipeline building, mlops, mlops tutorial, mlops pipeline building, end to end ml pipeline building with dvc, ashutosh tripathi mlops, data science, machine learning, data version control, end to end machine learning project deployment, end to end machine learning, machine learning tutorial, ashutosh sir
Id: KjEkn5qz5zM
Channel Id: undefined
Length: 53min 28sec (3208 seconds)
Published: Mon May 01 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.