Revealing My New AI-Powered Data Science Workflow

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so my data science workflow has completely changed over the past couple of weeks with the introduction of tools like chat GPT and also the gpt4 model and get a co-pilot it's just not the same anymore and working with these new AI power tools is really a skill on its own so I've been really practicing with it putting it to the test being trying to be creative with these tools when it comes to my data science projects and in this video I'm going to review my new workflow basically everything that I've learned how I interact with these tools to basically complete my data science projects as fast and efficient as possible so this really is a complete Game Changer and these tools aren't going anywhere they will only get better over time so really if you want to stay relevant as a data scientist or engineer or analyst you really have to learn how to use these tools effectively and this video will help you with just that so let's get into it okay so in order to keep this as realistic as possible I have looked up a kaggle competition basically an old kaggle competition where we can download a data set from so it's about a multi-stage continuous flow manufacturing process and here basically is describe what the primary goal is and it is to predict some measurements giving the data now this project this data set is new to me I've only downloaded dataset looked into it and see if it's relevant for what I want to do but other than that I'm completely new to this data science project this data set as well so what I want to do with this is go through it brand new like I would start up a new project and then go step by step and show you my thought process and how I interact with GPT and also GitHub co-pilot to basically deliver on the primary goal predict measure instruments of the output from the first stage and we will get into this in a bit but that is basically the setup so I will share this document with you in the description here you can also look up the kaggle competition you can also download the data set if you want to follow along and throughout this video I'll be using my data science project template which you can download in the zip file over here if you want to learn more about that you can check out this video so the best way to organize your video and we will be using fias code for data science and if you want to learn how to set that up you can check out this video and then for our AI tools we'll use GitHub copilot and GitHub co-pilot labs and if you don't have that setup you can check out this video so how to use GitHub copilot play designs in that video I will cover everything and then finally we'll be using chat GPT so I am on the paid subscription so I will be using gpt4 their newest latest model which is a lot more powerful for coding than GPD three and a half but but if you're on a free version you can still follow along just fine but I'll be using gpt4 alright so I am now in a brand new vs code workspace you can see we're using the data science project template and I've the only thing I've done basically is I've downloaded the data so the vsv file and along with that comes a notes on dataset txt so here we can see more info about the data set so that will be useful in a bit but before we start as with any data science project we first have to create an environment that we want to work in and that is the first step where I'm going to ask chatgpt for help because I don't know about you but I always tend to forget the specific syntax and basically the commands the prompts for Konda to create and work with environment so I always have to look those up so the goal is to create a basic data science python environment from a yaml file so that is one what we want to do over here so I come back to chat GPT over here and I'm basically going to create the first prompt alright so here's the prompt create a condom yaml file called environment.yaml with the most common packages for data science I specify that I want to use Python 3.10 and I want to call the environment manufacturing process all right here we go so now it's creating yaml file over here and gpt4 is a little bit slower than gbt three and a half so while this is loading over here I'm just gonna come back to fias code and open up a new terminal over here and do a quick touch and environment Dot yaml and then I can come back to chatgpt and we can see the result over here so we have a nice setup for a yaml file and we also have the specifics of how to install this and like this is the command that I also always forget like where where does the F go Does it include and so this is just really useful so now I'm gonna just I can just copy that come back over here and I can already tell that there is quite a lot in here probably a lot of stuff that we don't need for now so let me just get rid of most of that so here you can also see I'm not just copying everything blindly I really look at my own specific situation right now and see I want to first start off with the basics and we can always add more libraries later so also come back over here to basically copy the command to create this environment so first let's save this file over here and now since we're here in the terminal already I can call Konda and create and make sure that your terminal is in the same directory as your environment.yaml file and now we can run this command and now it should load everything up and ask us if we want to proceed alright so the environment is now doing the installation I got rid of some more dependencies because I'm not sure but the loading took very long so it was stuck on solving requirements so I got rid of some stuff as well so now you can see it's installing correctly and it should be done any second now all right and it's done so what we can now do we can come to the source folder over here and I have already a setup in here which is called make dataset dot by and now from here on since we're working in a python file we can now select our brand new environment so here we can see python 3.10 manufacturing process so we can select the new python environment and then make sure that we can run everything so let's see if we can fire up an interactive python session over here let's get rid of the terminal and then see if we can load in this case non-pipe pandas and matplotlib alright and the Imports went fine so that means we now have a brand new environment that we can work in so let's load up the data frame and see if that works as well all good so first step out of the way and I'm also going to quickly install black just like this code is asking me for because I use black as in code formatter so let's just add that as well and make sure to you update it here for reuse later so if you don't know what black is it basically allows you to automatically format your code following the pep 8 style guide and you can set this up to automatically run this whenever you save a file within vs code and this is just really helpful to keep your code nice and organized alright so now that we are basically up and running let's come back to the original problem statement and figure out what it is that we actually have to do so again this is also new for me I've looked at it briefly so I know that there are two goals so we have a primary and a secondary and for now I just want to focus on the primary goal and that is predicting measurements of output from the first stage so we are dealing with a manufacturing process and it contains various let me see what was it machines so for the first stage which we will be focusing on there is a machine one two and three and they operate in parallel and they feed their outputs into a step that combines flows so this is probably some kind of chemical process that is happening and eventually all the flows from one two and three come together and then once they are mixed I guess they measure in 15 locations and these measurements are the primary measurements to predict so you can basically see we have three machines they do all kinds of stuff and then the flows come together and then we have 15 measurements and that is what we want to predict basically so this company is basically trying to figure out how they can create some kind of assimilation or a process optimization model to figure out what should happen in machines one two and three in order to optimize for the output that is coming out of these machines so that is all that I know for now so let's have a look at the data and try to figure out basically what it is that these machines are what they measure and then also the outputs what these outputs are so let's come back to our data frame and let's first run a data frame dot info and see what's going on and here you can see since there are a lot of columns 116 in total it doesn't show us the columns but what we can do is we can set the verbose equals to true and here you can already see GitHub copilot doing its thing so GitHub gold pilot will be more subtle within this project and I will illustrate when and how I use it but the main thing where where you see the big things happen that will be in chat GPT so we can run they have info and then for both is true and that will show us basically all the columns that are in there so the 116 in total so we have a timestamp so this is time series data and let me also quickly grab the notes over here so here we can also see this is a description provided by the company as well so sample rate is one Hertz which means there is a measurement every second and then we have the machines operate in parallel that's what we've seen measurements could be noisy we have set points and actual values and we have some ambient conditions so this would be probably be the room that these machines are in so we have temperature humidity makes sense these are actual values okay so here we have machine one two and three and we have raw material temperatures all kinds of measurements and then let me see we also have the outputs okay so here we have the 15 outputs that we want to predict and here you can also see there is a set point and there is an actual value I guess we can first trim this data set down to only include the columns that are relevant for the primary goal meaning predict these these values of these 15 locations and then not use the set point but use the the actual values so let's come back to our data set and then basically say selected columns equals and we are going to create a list over here and let me see can I just I am going to basically make that selection based on the columns from the data frame and then I'm going to say I want them up until what is it we want 71. does that make sense I think that is correct so those are basically the columns that we want in there and then we probably want to trim that further so let's first put this in here so we don't even have to create it manually we can do it like this and then we can say DF equals DF selected columns so now we should have a subset over here resulting in 71 columns in total and then we also have the now if we look at the we can basically run the same command again and now with less columns in total we also have a lot of set points in there that I'm not particularly interested in so here we can see or here I can show you probably how we can use GitHub Copart so I can create a comment over here and basically say get rid of columns that contain and it is set points so get rid of columns that contain set point in their name all right so now I can say basically DF is and now probably GitHub copilot will start to do something it will say give me the location of the data frame where the F column string contains setpoint and then it negates that so let's see if that works so 57 columns and let's save that run the info again and now we only have the actual values in here so it worked so here you can see basically I can work with GitHub co-pilot by just writing a comment first and then GitHub copilot basically already knows what I want to do so it's basically a prompt that you can also put into chat GPT but it's faster to do it like this so you really have to figure out like the nuances like when do I use a GitHub copilot and when do I use chatgpt alright so the next thing I see is that we have a timestamp in there and that is currently stored as an object meaning that pandas does not know this this is an actual time step so we basically want to convert that and look GitHub copilot is already suggesting hey do you want to convert this and yes we definitely want to do that so we set it like this so what we can do if we run this and now look at the timestamp again pandas now understands that we're dealing with a timestamp and since this is a Time series data set we typically want to set the timestamp as the index because this unlocks some of the functionality from working with timestamps within pandas so let's do that right now basically say like DF and then here it already knows hey you probably want to set it as the index and now we have the timestamp index over here all right and the data set looks pretty good for now so you can see how within this code we have worked kinda messy in a sense that we just have dumped a bunch of code over here one has a comment the others don't have that so what I will now show you is what I would do and what I would would use chat GPT for right now so basically we read in the panel's data frame and now I'm going to basically copy and paste all the code that we've created come back to Chachi BT and ask it to create a nice python function out of this code and I just dump it in here all right and you can now see that it's creating this nice function for us so let's just come back and we can basically get rid of all of this put in our neat little function store this and then it probably will also tell us how to run this so I can come back copy this as well come over here and now let's see if I start up completely Fresh So I start up with a new interactive session load the data frame so here we have our data frame with all the 116 columns I store the function in memory and then I just run it and boom we now have our neat and tidied up data frame with documentation so now once you come back later to this project or you hand it over to a colleague you actually understand what's going on and also basically the rationale behind the steps that we are taking over here so beautiful example so we are dealing with a data frame with all numerical values and there don't seem to be any missing values in any of the columns so that's all good so now our goal basically for this project if I'm understanding it correctly if we look at this from a machine learning perspective we basically want to create a model that can take in all of let me see these values over here and then predict the outputs of stage 1 and we have 15 values here in total so we could come up with an approach to create predictions for all 15 at a time or we can create a model for each specific value so starting off with measurement zero in this case and then create a model for just that and it will probably make more sense to start with that so to start simple so try and come up with a model that can given all of this prior information make a prediction for the output of stage 1 for measurement 0. all right so whenever I do a data science project I always follow this data science lifecycle to some degree so I really start with the business understanding and since we can talk to the client in this case we cannot know for sure why they want to predict this but it's probably like I've said from a simulation or a process optimization point of view and we are now basically in the data understanding and data preparation phase and to better understand that data basically we are now going to create some plots so as of right now it's pretty fake right we have looked at the brief description that was given on the kaggle page we have some info over here we know that we have some machines but what are we actually dealing with so let's briefly export this data set and then load it into a new python file in the visualization folder where we can create some plots did you see that this is actually pretty freaky I set plots and it suggested do you want to create a function called plot data frame is GitHub co-pilot also listening like what or is it just the logical next step I don't know but for now let's store it as a pickle file so we're going to do two pickle and then I'm gonna store it in data and then I'm going to to store this in the interim folder and then just call it data processed dot pickle so let's store it over here come to the interim so now we have a pickle file over here and we can save this file and come back to the visualization so visualize.pi and here I do the DSi Imports again so this is a snippet that I use from vs code and I import the data basically again and let's see let's get rid of this and now we should have a file over here that we can use to create plots of the data frame all right so we have a lot of columns that we have to visualize and I have an ID here and I'm not sure if it will work out but basically there are a lot of similar values as in we have machine one with all the raw materials we have machine two we have machine three and then we have all the outputs and we also have here the first stage temperatures and I think what would be nice is to create plots where we group certain values together so it's a Time series so they all share the same time stamp basically so we can use that to combine multiple line graphs together in a plot and now let's ask chatgpt to come up with that because coding all of that would take quite some time figure out what should we combine and how do we specify the for example multiple labels that we want to include within the plot to make sure we can understand what's going on so let me just copy all of this I'm gonna just copy and paste this even the index over here and then come to chat GPT and then first paste this in and then scroll all the way to the top and make some space over here okay so this is my prompt I want to better understand this data set can you provide me with code to create line plots for all of the columns and group similar data together make sure to clearly label everything based on the column names and then this is something I've added and I'm not sure if this will work out but I say don't Group by the machines but rather buy the properties because I can imagine how if for example you group everything from Machine 2 property 1 2 and 3 might be on a completely different scale whereas if you combine for example property one from machine 1 with property one from Machine 2 that will probably be on the same scale so that's my rationale behind this prompt so let's see what it can come up with and then we'll go from there but this is basically like a little trick that you should provide it with lots of input and especially with GPT 4 you can I believe insert up to 25 000 tokens so you can actually provide it with lots of information even documentation and here you can see what it's coming up with okay so we're going to use matplotlip and Seaborn so let me already import Seaborn I think that was installed in the environment right is that correct can we import Seaborn yeah so that was installed so Seaborn was in there all right let's come back to chatgpt and see what we are dealing with over here so it says group to columns based on their property so we have ambient machine combiner and Stage okay so it seems like it is combining the machine columns but let's just let's just see what it will do so we already have the data frame so we can basically start from here come over here so I don't expect this to work Straight Out of the Box because let me see what is it grouping now so we have ambient okay this so This should work but humidity and temperature will probably be on a different scale then we have all the machines and then we have the combiner columns okay and then we have the stage outputs okay that makes sense but now let's create a function to plot these lines all right let me actually I like to make plots usually a little wider that's just my personal preference and then let's see what happens if we plot the ambient columns over here all right so here you can see we have the humidity and the temperature so let's see temperature is the line above the orange line and then the humidity is over here so this actually works quite well since they are somewhat on the same scale even though they don't share the same units so this works but now the machine properties here we are going to run into troubles so let's just first get a brief overview of the mess that this will create alright so I interrupted it because it was taking really long but here you can see the problem that we're running into lots of columns and all on different scales so we don't want to combine all the machines like this so let's come back to chat GPT and ask it to change this alright so let's update this prompt and see what it can come up with and I always try to be pretty for both within my prompts and you don't really have to be correct from like a grammatical point of view so that's really nice but sometimes it just helps to really explain what you're trying to do so I say don't group all the machine data but split it up so that you group and then I give an example raw material property one from Sheen one two and three and so on and then again do this for all the machine data creating groups of three columns for all the data so let's see what's going on over here so function to extract properties from machine columns so okay this is pretty interesting okay so let's see what's going on over here so we have this new function and it looks pretty promising in the sense that it goes through all the machine columns so we first split those up and then we're going to basically split by the dot and then we are going to create a set so let's actually see what's going on over here so we have the machine properties look and now we can see that for all of the columns basically within all of the machines they these are all the unique ones that we have so we have the exit Zone material blah blah blah all this is basically everything okay so that is looking really good and now what happens did we actually change okay yeah so it did change up this function so that is a new one that is a new updated function that is used specifically to plot the machine columns and we don't use the plot columns but now we use the plot machine columns for this so let's make sure to store that and then let's actually see what's going on over here alright and they are appearing so first up we have a temperature machine one two and three and this is exactly what I had in mind by the way so this is already so amazing so look at the title so this is the machine properties and now we're looking at Zone 2 temperature we have a pressure this is again every time I use this this is so unbelievable look at what we have accomplished in so little time like this already and I know because I've been working a lot with process industry companies plots like this are really valuable and this could even be a whole project on its own just creating plots like this from the raw data because this this is so nice because now an operator can really look at how the machines are operating and the little differences between them and is it just really really amazing so here we can all sort of see some interesting stuff going on so some so really great all right and we're done and we have some beautiful plots over here and really literally like I imagined so all the units basically all these scales match up so now we can have a look at all of the data and have a brief look at what's going on so let me also quickly run the other two so that is the what do we have over here first stage combiner operations and then also the output so the output will probably also cause some troubles because we are dealing with 15 values and I can imagine how they are measuring various different parameters and they are all probably on a different scale but let's look at what it can come up with and again this is really amazing how we basically created a simple prompt it created it and then with like two extra sentences two extra additions to make it more specific it worked so coming back to how you work with these tools as a data scientist effectively is first of all really knowing what it is that you want understanding the underlying data and then creating really specific prompts for that and that is really a skill on its own and is really like the main message that I want to convey that I want to explain and I want to teach in this video how you interact with that and I thought like the only way to really teach it is to show you how I actually do it in the moment and be really creative with it and it just takes practice so let's see if okay yeah so we have the combiner operation looks really good that is a temperature so we have a temperature over here and here you can see that this is quite messy so here again for the outputs we have all kinds of different data for the sake of convenience since we are probably going to look at the outputs individually since we potentially have to create a model for all of those I'm just gonna ask chatgpt to split those up basically so create separate plots all right so simply asking at just a code to separate the plots for all the stage output columns and then I add to that provide me with just the new code because gpt4 also tends to be pretty for both in the sense that it could output all of this again and then simply adjust a few lines but again just shows you all of the code and since this can take some time I just ask it to give me just the new code so here is a new function plot the individual columns so we can basically add this to our visualization script save this and then let's see what's going on over here all right and now as you can see for all of the stage one outputs we will get a different plot and here you can really start to see that we actually have a lot of noise and potential outliers within this data as well which is something you typically see when you are working with process industry data but hey look at how beautiful this is we have a nice little script over here visualize.pi with four functions and then basically five lines to call those functions and create plots for the whole system basically and again this could be a project on its own I've done plenty of freelance data projects where the goal was just to visualize the data properly and now we've done it literally in a matter of minutes all right but now next the goal is of course to create models prediction models that can predict these stage 1 outputs and for now I already can tell that we have to do some cleaning first and we have to pick a column to start with and it seems like most of them appear to be pretty straight but that is mainly due to the outliers over here so probably once we get rid of the outliers and zoom in the range of the y-axis then we can probably see some kind of patterns going on over here but basically I am scouting right now for a parameter that we are going to look at first and why not just start with the first one so I'm going to look at measurement zero so let's come over to another file where we come up with the features so we're going to build features dot pi and as far as I can tell when I look at the other parameters so for example the machine properties the data seems to be all right in the sense that there are not many outliers so this one's kind of messy so this is a feeder raw material feeder parameter this looks quite messy but it could also be just how the process is operating and as you can see this looks pretty alright so we have some some spikes over here but nothing really crazy going on again this could also just be how the process is actually working so for now let's focus on a method first to clean up this output data starting off with measurement zero alright so I mean the new file built features reading a data frame and this is the measurement that we're looking at and I'm doing a quick quick plot over here now just calling the Dot Plot to see basically we want to get rid of these extreme values over here so let's ask chatgpt to come up with a function that can help us with that all right and here we go so create a function that can clean up the data I.E remove extreme values and the input should be a series and it should return the clean Series so this is really important like figure out what do you want what do you you want to input do you want to clean the whole data frame at once or do you want to be specific and go at it column by column so a series basically so that is what I've defined and then I also explain that we are dealing with time series data where the value should not increase fairly drastically so I know from experience that when you are dealing with time series data so let me come back you can really tell by like visually inspecting this data that this bump here doesn't make sense and that doesn't make sense because you can see stable line over here and all of a sudden you have a big big jump but it is important to state that specifically that it is time series data because sometimes if you use other more traditional ways of detecting outliers a sudden bump within a signal could still fall within the normal ranges of the data for example if the data is going up and down and you have these big increases so for example the data over here would be high as well so it would go up and then go down so following a sinus or a cosine signal basically and then it would look at the data as a whole and basically determine oh this value of like what is it 2021 isn't that extreme because it's we can also see that point over here but it's about the sudden increase basically so I'm really curious to see how chat GPT will deal with that so it's creating a nice Series so what we have we have a window size standard deviations okay so it's doing a walking outlier detection model basically with a window size of 10 seconds in this case so that that is good so let's see what this can do for us so we are going to define the function and then let's quickly have a look at what this will do clean a Time series by removing extreme values filling in missing values using linear linear interpolation okay so that is also one thing that I've added again coming back to my request I know the signal is pretty stable so after removing an outlier you can fill up missing values by linear interpolation which basically means that hey if we look at this we just want to get rid of the values over here and just let it continue in a straight line and now it is up to us to figure out if that is correct so let's see now what we can do is actually quite funny I didn't mention that right so it's already giving us the first sensor value over here so measurements measurement zero so sometimes this stuff is so weird like that is exactly what I'm doing okay we can input the whole series basically which is nothing more than just a bunch of numbers with the timestamp and we are going to clean it by using the moving average and the moving standard deviation and then we are going to identify the outliers basically over here using three standard deviations and then we're going to do an interpolation so now I guess the trick is to see whether a window size of 10 seconds is enough so we cannot really tell from this image like how long this period lasts but let's have a look so we are going to Define this and then run this and then we are going to go all the clean series and then okay so you can see that it hasn't been sufficient in getting rid of all of this data so what we can then do do is we can either increase or decrease the window size so let's for example see what if we use 5 for example we still have the data in there and now let's see if we bump it up to something like 25 and maybe we have to go a lot higher over here okay so by using a window size of 100 seconds so almost two minutes you can see that we got rid of the data over here and but we still have these values over here so let's just like bump it up like really high see what's going on okay okay so you can see over here that somewhere in the beginning the value just drops to zero and we can simply get rid of that as well by saying hey if there is a sudden zero value which is usually from like a process a sensor perspective if it's zero there's just an error basically so we can just add that to the series to the function basically so this is a really straightforward one and we could have also done this probably better with GitHub copilot so here you can see basically that we can set the zero values to NP none as well so coming back to the function over here we basically set the outliers we Define them over here and then we set them to none and then also we want to have the clean Series so basically like it's sading over here so that is the clean series and put that in here as well and then basically let's get the copilot command that replace zero values with none values so now if we run this we can also have a look and then we can probably decrease that and it might even get rid of everything all at once just by look okay so now we have a nice signal and I'm actually quite curious also what if we just do it like this because it seems from the data okay so we have that apply that outlier going up so it seems like most of the outliers going down were actually zero values so we got rid of those by using this over here and now we have to use the window size to get rid of the peak around 22. so let's experiment with values to get rid of that so once again was it at 100 I think we got rid of it then yeah let's make sure all right let's leave that for now so now this seems like a pretty clean signal that we can work with alright so let's store that in the data frame itself so we will override it and now here we can have a look at the data again and we're now good to go so if we further continue with this we now basically for the first part are going to get rid of everything other than the measurement zero for the outputs so we're basically going to say that hey we want the data frame okay so let's get it up until column 42 because that is the measurement zero value so let's store that in the data frame as well and basically why I do this is to make sure that we only have the predictor columns that we are going to use and we have the target variable now so that is why I've set it up that way alright so now that the data is clean we don't have any missing values we are going to ask if jgpt can provide us with another function that can add additional features so again I'm going to copy everything put it in here and then basically explain what I want to do boom so another query another prompt over here I should say I want to create a prediction model for stage 1 output measurement zero actual game providing with a function that engineer so funny thing is you don't have to be grammatically correct over here it doesn't really make sense but engineer features that could potentially improve the predictive Performance Based on all the other columns and then this is really interesting as I figure out the strategy that makes sense here given the name given the column names provide me with one function to add all the features to the data frame so this again comes back to being really specific what do you want I want a function I want to input the data frame and I want to get back a data frame with all the features it knows what the target is so don't just tell it hey create a function for feature engineering be specific so let's see what we are getting over here all right so this is looking interesting so it's looking at lags looking at the window size let's see what are we doing creates a copy create lags for specified columns rolling Windows statistics for specified columns rolling means then okay yeah so it's doing some basic time series feature engineering with rolling mean standard deviations minus and Max so window size lag features okay so we have to Define what kind of features we want to use you can customize this list to include only specific columns okay so let's just show you for the sake of demonstration how this will work if we use all of the columns but we don't want to use the target column and if I'm correct so it goes four column in lagged features and if I if I run this or lag features that will be everything and then window size what we can actually do and we can probably use GitHub Copilot for that we say basically so We're looping over and then I'm basically going to say pass if the column is this so the target then probably you have to put in another if statement and then if all right and then we'll say continue and I'm basically going to slightly change this I'm going to set it up like this and then indent this so if call not as the target then it goes into this Loop so basically means that we can pass in everything and it will just skip over the loop lag features window size 10 seconds it's pretty short I'm going to bump that up actually to a minute and then we're going to engineer everything all right and we're done so what do we have over here so now we have a lot so how many okay this is actually way too many columns probably ah I see so what is doing basically we Define the window size and we have updated that basically to 60 and as you can see it's looping over a range so it's not just using 60 it's looping over one up until 60. creating all of those lags and then for all of those lags it's creating all of these Statistics over here so that's why we have a shitload of columns right now but we can use another method to try and figure out what's actually the best columns are so this is not even that bad I would say now let's just focus on feature selection so let's quickly export this data frame so we say data engineered sure that is a proper name for now so we have the full data frame with all of the features in there and then we can close it and then basically come back to Features again create a new file say feature selection dot Pi okay and we're good to go again so now let's ask chatgpt to create a feature selection model for us alright so we now have a lot of features create a function that takes the whole data frame and select the 10 best features with regards to the Target and again I repeat what the target is return only the best features so what what are we going to use it's suggesting select K best from the scikit-learn library okay so this is going to be interesting all right and we're done so we have a nice little function let's come over here we already have pandas let's switch this up make sure we can import this all right and you can use this function to get a 10 best features for your data frame just pass your data frame and the target column so this is so awesome so we can put in our whole data frame which has all the features but also our Target and now let's see so we have K over here which is an INT is at 10 so here we can also increase or decrease the features that we want to get and it says X Y alright let's do it also quite curious how long this will take all right and it's finished so it works and we now have a data frame so if we look at the best features data frame it returns only the best features so it's a selection of the data frame and if I look at the columns here you can see that it's actually all of the Rolling statistics that we created using our feature engineering function and just to create a correction because I just figured this out the function for feature engineering is creating Lags from one up until 60 in this case for all of the specified columns and then also it's creating the statistics for just the window size so not for all of the lags in between but just the win of size because I noticed that it's just the rolling Min rolling maximum for these values over here you can also see how it's using data from machine one two and three so that is also really nice and also the output measurement so ah and since it's doing that I also notice a problem over here because we're cheating a little because we are feeding the target variable into the model as well into the prediction so it's you can see how we have specified that it should skip if it's not the target but we did not do that over here so let's also quickly come over here and add that line so we should now run everything again basically uh let's see all right so quick fix quick check we now if we load the data frame with all of the features basically and we do a quick check where the name contains stage one output measurement zero we only get the targets variable all right so that is a quick fix that I had to do and now I have to run this again all right and we're done again so we have the new 10 best performing columns and we can again see that we have a mix of machine one two and three and now we also have the motor Rampage in here but again it's all the rolling statistics so that is pretty interesting to see that the features that we've added are actually the ones that have the most predictive power with regards to measurement zero alright so then we can actually start to use this subset of the data frame to create a model so let's first export it again so I'm going to say best features DF and we're going to basically export that to the processed folder in this case so in the processed folder I like to use or like to put data sets that are ready for modeling so let's put that in there best features and let's copy and paste this and then we can come over here to the models and then train model do the DSi Imports again and say DF is read okay boom and now we have our data frame with all of the best features and I forgot I also have to add the target variable back all right and we're back and this was also pretty interesting so I said add the target variable and then put the column name over here to the data frame and then it suggested to use the PD concat and that takes in a list of a data frame and another data frame or a series and a specify axis equals one but those are those really uh tricky situations where you really have to remember the syntax or get a copilot can really help you with that alright so with our new data frame 11 parameters in total 10 input variables and one output variable we are going to ask chatgpt to set up a modeling framework basically for us to play around test with different models now one thing that we have to keep in mind here is that we have not shifted the data in this case and you must imagine that this is a process that all these chemicals flows Etc are going through and what happens for example with machine 1 at this point doesn't directly affect the output on the other end of the machine for example so there could be a delay and depending on the kind of process that is at play this could be a matter of seconds minutes or even hours before what is happening over here influences over things that are going on over here but for now let's assume that everything is lined up correctly and see what we can come up with so again let's just copy this whole data frame and ask chat TPT to set this up alright so this is the final data frame can you create a function that experiments with different models to predict measurement zero pick a selection of models that make the most sense in this case given the data and the column name so I'm pretty curious to see what it will come up with and we can already see that linear regression random Force okay so it's it's using some tree based methods that will be interesting and then compare the models using the mean absolute error and the R2 score visualize results please know this is time series problem so keep that in mind while creating chain train test split so this is something that I know from experience so if you use the random split when you are working on a Time series problem it will just randomly uh create that of course so with a Time series problem you want to make a clear distinction between uh or I would should say you want to keep the data chronological so your train split should be before your test split so that's why I've added that as well so let's see what we got over here so it is coding away while I am talking to you and basically explaining my thought process and chat GPT is just coding along and this is also what I now that nowadays really do within my work I really come up with something I ask a query and then basically I go back with other stuff that I have to do and then wait for it or I continue to work on some codes and then oh once it's finished I'll just copy and paste it so it's like I've said my workflow has completely changed over the past few weeks now that this is available because like look at all the codes that it is providing this would literally take like probably well over like half an hour for me to like set this up well and test it and then with the plots maybe even longer let's come back we have the uh let's see we have the function we have all the Imports so let's actually move this up and we have to install XG boost so we can do a quick pip install XG boost over here and let's also do a quick so I know that the light GBM modeled sometimes acts a bit weird on Mac but let's see okay and we're all good so if you're on Mac the light GBM you should install it using kunda if you use PIP IT can introduce some errors so data frame all the dependencies and let's make sure we store this function and let's quickly check it as well so using a test size of 20 so it is all good and then also see okay so it's using yeah train test it's using the index so it's not using the time series split or the standard train test split but it's just using the index so that is all good all right makes sense Target column yeah all good then we have all the models over here we're using default settings results training evaluate then results been depends okay this is looking really good and then also it visualizes the data this is pretty exciting and then what I really like it also just like provides you like okay here this is how you run it so we can put that in and so it takes the data frame and the targets so this is also a function that you can reuse for other code as well so it's already set up dynamically all right so let's see how this will work okay first one looking not too good okay okay results D what's that probably mess that up but okay we have the predictions over here like I've said they're not looking too good but hey this is our initial result okay so these results are definitely not great if we look at the mean absolute error we can see that the simple linear regression model actually performs the best in this case and you can see how in the beginning it kind of follows the trend you do have to keep in mind that we are really zoomed in on the y-axis over here so this is actually pretty close but then here there happened clearly happens something in the process and that is not captured in the data because it isn't aware of that so what I try to do is I came back to my feature selection script and increase the 10 features to 50. so now we're using the 50 best performing features and you can see how now there are also a lot of lags included in here okay so let's see if we can capture the data better right now so let's run this again okay that is not looking better as looking not so good so this is not really promising okay again so the data wasn't really able or the models weren't really able to capture what is going on here to suddenly make measurement zero drop okay quick little addition so I basically added all the basic features so all the original columns as well because I am looking for the event that is happening over here basically that makes the target variable lower all of a sudden and it just seems that it is not really present in all of the data but by combining all of the basic features plus the best performing engineered features we do get much better predictions so now if I come over to the results we see that we have some so the linear regression over here I don't know what's it what it's doing over here that messes it up completely but you can see that the mean absolute error for random Forest XG boost and light GBM is starting to get smaller we are still definitely not there we have really weird high negative R scores but you can see how it is somewhat starting to follow the data and it is I must say it is a pretty noisy signal so it's probably going to be hard to predict but we are getting closer but I had another look at the data and I noticed that a lot of the values are very different in scale so we have we have columns ranging in the order of 10 hundreds and also uh thousands so I think what would what could work as well probably not drastically but could improve some of the performance if is if we add a scalar to this so let's make another request simple add a scalar to the time series model comparison function and we'll just do the standard scalar from the pre-processing and now while we are waiting for the output from chatgpt I think it's also a good time to have a look at GitHub copilot labs and see what we can do with that so I have got a mixed results while using it especially with the custom brush which seems to me the most interesting one which is basically similar to like chat GPT where you can ask anything and it will directly update it within your code also considering the context of your code so that is really neat so I want to scale this data and basically we have to function over here so I'm going to select that and if you have not seen my last video what you can do with it you can basically say like make this readable so let's see this is probably already pretty readable because it is already generated by GPT but let's see what it can come up with but basically processing over here and now it has made the code a little more readable and we can also say we want to make this robust or we can document it so let's see so now it's adding some comments over here but it shoots in theory put this into a dog string to make it a little more nice so the kind it's not perfect yet it should have recognized that this is a function and it should put it in a dog string but let's try to make the same request using the custom brush so how that works is you select the code and then you custom and then basically I say add a standard scalar in here to scale the data prior to training only scale the X so and what sometimes happens is that it will replace all of your codes with just the new prompt so that is pretty weird so let me actually let me actually first copy this because it's also otherwise it's pretty annoying to to get back and now let's come over here and then say add the standard scale okay so let's see what it's doing so true let's see if it can add it into this code so now okay it added the scale it added the Boolean okay so that is cool if scale is true we initialize the standard scalar and we scale the data and we only do it on the X variables okay so that is awesome I think that worked so you can see let's now see what jtbd has come up with so chat GPT basically did the same thing it converted these variables over here but GitHub called Labs added the nice feature to include it as a Boolean yes or no so we can even we can set this to True by default so we don't have to update it within our function over here okay so this is pretty interesting so let's see if this still runs so okay linear regression still sucks okay so this is now loading but this was actually a really good example of how you can use GitHub called by the labs and a custom brush because this is what it's designed for but like I've said it's not bulletproof yet it's not there yet but it is working really well in some scenarios so I really how like how you can interact with your current code and then just ask it to input something or change something and it will do that so okay the new results are in and like I've said it did not affect the final results that much we still don't really have a really or a really good we have a very bad model over here but this also clearly illustrates why AI won't be replacing your job as a data professional anytime soon because you need that domain specific knowledge and this is a pretty hard setup right now since we are using data without much context as in we don't really know what the measurement zero is we don't really know what happened over here at least from the quick lens at the data that I've had so I'm also going to leave it at this for now because this video is getting pretty long I will probably continue with this modeling problem and create future videos about this because we still have 14 other measurements or so that we can look into and I really want to to explore the possibilities of chat GPT and GitHub Code Pilot Labs more with regards to tackling a complete project like this so I could have just taken the Titanic data set or another data set that I am already familiar with and know that the output of the model will be good will have a good accuracy but that's not really representative of what it's like to work with data in the real world so this is actually far more common you have actually no idea what's going on you get a data set you don't know how the variables are correlated and you also really don't know if there is any predictive power in this data at all like I have I don't know whether this measurement 0 actually can can be predicted using the data that we have so I think this has been a really interesting approach to show you my AI assisted workflow like this I don't see a lot of people using it and explaining it like this on YouTube it's far more pretty straightforward advice but this should give you to a complete overview of how I tackle my day-to-day projects currently so if you've been following along then I would really appreciate it if you like this video and also for subscribe to the channel so you don't miss future videos and if you want to learn even more about how I use chatgpt to tackle data science projects then usually check out this video next where I use another surface to collect a web data set and you can do this as well to basically do data science portfolio projects within a day or so so that's really awesome go check it out

Info

Channel: Dave Ebbelaar

Views: 22,599

Rating: undefined out of 5

Keywords: Data Science, Machine Learning, Python, gpt4, gpt-4, chatgpt, github copilot, vs code, github copilot x, ai data science, artificial intelligence, data scientist, microsoft ai, data analytics, data engineer, data science project, machine learning project, chat gpt, artificial intelligence and machine learning, artificial intelligence today, data scientist day in the life, future of data science

Id: th4j9JxWGko

Channel Id: undefined

Length: 62min 53sec (3773 seconds)

Published: Thu Apr 06 2023