Code Interpreter in ChatGPT - A Comprehensive First Look

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Applause] [Music] I'm going to show you how you can use code interpreter in chat GPT to do some cluster analysis and this is my first time trying this we're going to go through all of the steps together and let's see if it actually works so the first thing that I'm going to do is go into my account and I do have the chat GPT Plus account and if you do and you're paying for that right what you can do is go into your account go down to where you see your name in the bottom left corner click on the three little dots go to settings and in there you'll see beta features and this is first of all an opportunity to get some different plugins but here is the core code interpreter so let's give that a shot go ahead and select it and now to get it to work you now need to go into gpt4 and you see how I select that and right now it's just set to be the normal GPT 4 version but if I select code interpreter you notice what happens at the bottom here you get this plus sign gives you the ability to add a file so I'm going to do that and the data set that I have see here I'm going to bring it over it's right here and this is data that we're using for a project uh at Belmont University where we are looking at hypertension data and we're looking to do different things now what you're going to see in this data is it is community level data at the ZIP code level as you see here it's all these different factors um that's um the American Community survey data from the Census as well as the CDC places data plus we went and found some stuff that as you can see it's a little bit sparse and here's your CDC data and then we added a couple other things so I don't want to clean it because I want to see what happens with it it does have things that normally what I would do with things like outlying and the Knoxville one is I would turn those into dummy variables and we'll see if it gives us that opportunity so this is the data set so I'm going to go in here hit the plus sign and my data set is right there perfect go ahead and select okay brings it in now I'm going to just say what I want it to do create great five groups because I want five using K means clustering that's all I'm going to say see what happens great if this works okay great you've loaded a file and what's interesting is it's showing me so it imported pandas which that's how I would have uh done that it's starting to read it in you see that here it's even telling me I have 103 uh columns that is correct and it's showing me some of those and it says you know are you sure you want to use all of those and the answer is no I don't so what you'll see is I've got another file over here where I've already selected the ones that I want to keep and it goes see here it is a total of 74 of those I'm going to go ahead and copy those and let's just see if I was to come in here now and say let's see are also are there any specific pre-processing steps um so I'm going to say uh could you specify the columns use the following columns and I'm going to say standardize since it does mention it I'd like to standardize the data now I'm not going to go through any data science Concepts I'm just going to do this stuff right now go ahead and add my columns and that's all 74 columns right there see what happens all right I'm trying to see what happens with you all at the exact same time uh okay let's see what happens when it's showing my work you can see how it's selecting The Columns so what it's basically doing is creating a variable that variable is going to have 74 different quoted column names in there that's what it's doing right now this is so cool so cool or is it actually scary that's the part that we need to figure out here all right it's it's going through it's got my food assist there's my CDC ones binge uh high blood pressure uh do they take the medicine are they cigarette so these are percentages so percentage of people that smoke things like that so what it's seeing now is there are things where they're seeing some let's see now let's standardize the data okay thank you this will transform the data so that the distribution will be between 0 and 1 which is what I would do that's probably standard scalar let's go up and it says standardize using the scalar method standard scalar there you go I was correct haha I might actually know what I'm doing here I by the way am somebody that does use Python to do my data science uh work so I should be pretty familiar with this um the data has been standardized now let's proceed we will create the five groups now what it did not do was let me know I'm assuming that it filled in the uh missing values probably with a median I'd have to go and see that and there's how you create it here's your clusters look at this number of clusters five it even used the random seed that I would have probably used uh 42 y42 we all know why you use the number 42. um I'll just leave it at that and um it fit the model it then went through and it created the model see how I've got cluster two one two one and so on each row in the data set now belongs to one of these you can analyze these clusters to identify patterns and common characteristics among the data points wow okay so does it zero to four which is normal it always starts with zero and then goes to four um let's analyze the results and have bad grammatical errors uh in doing so to analyze results we can compute the mean of each feature okay and it will give us a centroid let's see if this works uh cluster one has relatively low median all right this is ridiculous it's literally giving me a description of each one of these are you kidding me remember these are General observations would you like to proceed to further analysis no because it's literally making me sick at this point um yes oh my gosh do I sound like I'm a little kid playing around with this stuff this is ridiculous cluster size we can count yeah I want to see that feature importance we could try to determine which features are the most important wow all right I want to start with cluster size see what that is going to do I'm assuming that's going to list out the four I'm sorry the 5 0 to 4 and it should tell me the counts okay there you go that's the number of zip codes that I had and it shows how many are in each one excellent what else can I do uh cluster profiles let's go ahead and do that would like to proceed with any others yes cluster profiles they should tell me I'm assuming means that type stuff for each one maybe some descriptives uh it's going ahead and creating the centroid so centuries are basically it's like if you had this big thing of dots it's trying to find that Middle Point to do its best to give you an approximation and let's see as you can see it all if it says stop generating and you see these little three little dots moving like this that's how you know it's still working I'm going to be quiet because I might cut and then come back just in case this is something that takes long okay um and that was about 20 seconds I would say maybe a little bit longer um and it looks like it timed out something that'll happen when you have 74 different features if I hit like three or four probably would have came out okay but creating centroids with 74 features is a Monumental task so I'm just going to say create visualizations and let's see what comes up all right we're testing it again I have no idea if this is going to work uh it's one common approach to visualization is dimension reduction techniques and says principal component so but if you think about cluster analysis cluster analysis reduces the number of observations or rows down into different groups but PCA or Dimension reduction does is it takes the 74 columns and it reduces those down so what I like to do PCA yes use PCA and 2D oops two D let's go all right I feel like I'm watching a football game and seeing what the next play is and all right it's not that exciting but it is to me I'm telling you this is so cool it's work finished working let's see what that is all right that is way too freaking cool so basically here's what it did is it created two different uh uh clusters uh within that and then it merged those into those different sets so then what I could do is see which of the columns fit into pca1 pca2 to be able to create this visualization in the world of data science this is cool stuff this is cool would you like to proceed to offer any others um what I really want is to see the rows of data and the cluster number let's just see what this does I'm assuming this is going to show me code and here we go here are a few of the rows that's good enough that's all I asked and you can see it's showing them there let's see oh it's just showing me a few of the rows and columns it's actually going through it and when you are cutting it it's going to keep going for a while here so basically what it did I got a feeling is a DOT head yep there you go so dot head means it'll give you the top five rows in your data and you can see the two one the two one and the one is different ones it shows you uh some of the different features for each one of those and now the only thing I don't know how to do is download this data how do I download a CSV file maybe I don't I don't know up it gives me that look very nice each row represents let's see what it's saying I have saved the data set with the added cluster com all right seriously seriously I created the file for me all right this is this the part where the old guy starts crying this is so amazing um it allows me to do that all right we're just gonna throw it on the desktop let's go ahead and open it bring it over and there's median age which was the first one I asked for and there's the Clusters now just uh just to do it let's go and do a quick little pivot table with cluster there we go let's go look at median age and we'll make that a an average how about got household size household income there we go um go ahead and make that an average I want to see so since I've been using this mostly for hypertension let's see if anything came out BP High go ahead and look at average let's even go to um go to diabetes and then there's um lack of physical activity let's go ahead and get rid of all of those decimal places and average BP oh you know what I'm missing here let's go ahead and I'm just gonna add this one back in and that'll be our count there we go move it so now we have a count and again I'm not trying to be fancy here but my biggest one is here I've got average blood pressure of 42 percent I've got one down here at uh 28 that's only nine of my ZIP codes I've got two here at 51. those are looks like middle of the row we got one at 46 26 different ones that is considered high folks uh for sure if it's that high average income you see you've got one at 84 000 um and another one at 64. here you go you got low income median age is anything particular but interestingly um what do you see when you see low income you see high blood pressure High diabetes High lack of physical activity um very very similar to a lot of the stuff that I see so let's let's talk about this here for a second first and foremost this was amazing this literally if I was to go in and look at the code is pretty similar to what I would have done um again I would go way more in depth here here's where I differ from obviously this I would do more with feature engineering I would go in and make more features either based on uh ratios or taking some of the categorical variables so it's for example I had basically urban and rural and I had that as a label and I could have made that a dummy variable make one called Rural and just one or a zero I also had the city and I think the city does have maybe a little bit and the city as in the metropolitan region so in Tennessee you've got Knoxville Chattanooga Clarksville Memphis and then Nashville and so we could have seen some different um uh things there that's a that's a couple of the things I I probably would have used standard scalar I would have also gone in and done some analysis using inertia models to determine the number of clusters that I probably would have used and maybe would be five maybe before I also probably would have used hierarchical clustering which by the way I could have done I was just um uh not sure if that would have worked and gosh I might do that just to see what happens um and that could have been a way to identify how many I could have had as well the analysis part I think I have to keep playing with that I think I'm going to do things especially with 74 Dimensions where it would probably keep timing out what I probably would have liked to have done which now it's probably a little late has probably gone in done feature performance or feature important and that would let me know which of those had the highest influence on what made the different ones and just to make sure that we don't have any overfitting in that way so but overall I'm pretty impressed uh it's scary if you really think about it and uh but what I see here is not the coder or the data scientist going away but your skills have got to change your skills have got to be about how do you take models and put them into action you have to stop thinking about how do I just code I just created a model based on math only no now it's your opportunity to truly use data science in a real situation in reality that goes from the true dilemma and the insights and to maybe even action that is good for a business good for an organization or in my case good for a community so there's your code interpreter 101 101 for cluster analysis [Music] foreign [Music]
Info
Channel: Data 4 All Podcast
Views: 1,172
Rating: undefined out of 5
Keywords:
Id: s1Bi_B8kHR0
Channel Id: undefined
Length: 17min 0sec (1020 seconds)
Published: Wed Jul 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.