With ChatGPT, you can do machine learning on your data without knowing how to code

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi I'm Chad Skelton and this is another video on how to use Chachi PT and the notable plugin for data analysis if you do not know the basics of how to use chatgpt and the mobile plugin please look at my very first video and I'll put a little uh Link at the top here it explains the basics of how you set it up essentially you need to have the paid version of chatgpt at the moment chatgpt plus then you install the notable notebook plug in and you apply for a free notable account and then you should be good to go but check out that video if you need some Basics on how to get set up um so one of the things that I found kind of neat about chatgpt and its ability to integrate with notable notebooks is that it actually opens up a lot of tools that previously have not really been accessible to people like myself that don't really have a really hardcore coding background so you know through years of being a data journalist and teaching data analysis and data visualization you know I've been familiar with some of the tools that programmers use like machine learning and fuzzy matching and natural language toolkits and all that kind of stuff and it all sounded pretty cool and could be quite helpful for my data but I didn't have the sort of confidence with python to be able to access those tools and those libraries and one of the things that I've been sort of playing around with recently in chatgpt is sort of seeing whether I can and access some of those more powerful tools without needing to code and I wanted to give you a kind of a neat example of that today where it actually allows you to use some of the machine learning capabilities in Python again without needing to know how to code and to illustrate this I'm going to use a data set that I've used in a lot of my teaching and training and then I worked on when I was doing a project at the Vancouver Sun on political donations in British Columbia so this is a data set on all the political contributors to political parties in BC over the past several years and what I've done here is I've summarized it by the biggest overall donors and so we have a total list of about 61 000 donors and in the elections BC data set itself it actually classifies each of the donors and I've included the three biggest types of classifications here which is a union a corporation or an individual now in this case I I know this information I don't need a computer to classify this for me but it occurred to me this would be a good way to sort of test out some of the machine learning capabilities of chat CPT because I could then check it against the real answer so so what I did is is out of this list of as I say about 60 000 individual donors I took a hundred uh Union donors 100 corporate donors and 100 individual donors um at random and I put them into a document here called donors matched and then so that's a total of about uh 300 total donors classified and then I've got a list of all of the donors in a big sort of unmatched document and and again this is kind of an artificial exercise but the idea here is to try to see you know in a situation where they weren't classified and I had thousands and thousands and thousands of donors you know could I manually classify 100 corporate donors 100 Union donors on 100 individual donors and get the computer to do the rest and and then see how good of a job it does because I can check it against the real uh answer and I and I know enough about machine learning to know that this is the kind of thing it can do well but I wanted to see whether we could do that in chat GPT so in my notable notebook I have uploaded these files donors match donors unmatched to my testing project and then I'm just gonna tell chat GPT again please set this project as my default project for notable okay so that was important um because that's where the files are that I need um notable in chat EBT to go find my experience has been once you set a default project it'll actually be persistent throughout various um chat sessions until you change it again but I figure it's just good practice to fall into so I'm going to now ask us and again I feel like that it's helpful in thinking through what chat GPT and notable can do to sort of play dumb a little bit so I know I don't know very much about it but I know a little bit about machine learning I know it exists but I want to pretend that I'm someone who doesn't know that it exists and see if chat GPT can help me out okay so I'm gonna ask a question like this honestly I've got some data on political donations in British Columbia Canada and I'm trying to classify the donors into one of three categories Union corporation or individual I've manually matched 300 donations in a file called donorsmatched dot CSV can you help me classify the rest of the donations in donors unmatched in the file see how it goes okay so Chachi PT gives me a plan and I should say here one thing I've noticed about Chachi petite I'm sure some of you have two if you played around with a little bit as it's quite idiosyncratic so a lot of these things that I show you guys in the videos I tested out first myself and sometimes it goes one way as soon as it goes a different way so it doesn't always give me this kind of plan even with kind of an exactly similar prompt sometimes it'll just launch right into the work itself but in this case it's saying we're going to load the data we're going to pre-process the data we're going to extract the features we're going to train a model evaluate the model classify the unmatched data and review and iterate if the results aren't satisfactory we need to go back before we start could you please upload the two CSV files donor hyphen match.csv and donors hyphen unmatch.csv to your notable project I'll say I've done that okay and as in all these videos I'll I'll edit out the boring Parts where uh chat GPT is thinking or working slowly but a reminder that you can always sort of look at these little boxes here to get a sense of what requests chat GPT is making to the notable API and what it's getting back in response so it says both files are present in the project let's create a new notebook um let's load the data from the CSV files talks about there being an error because of the file path fixes that shows me um the first few rows of the match donors and the first few rows of the unmatched donors so the match donors includes a class name column which indicates the class of the donor the unmatched donor's data does not have this column and our goal is to predict these classes before we proceed with the next steps could you please provide more information about the columns in the data set specifically what does each column represent and are there any any other details that might be relevant for the classification task so I'll say the column contributor name is the name of the contributor and the amount field is the total amount of money donated by that contributor okay and it's coming up with uh some possibilities of what they could do so it says name analysis we could potentially extract features from the names themselves a donation amount we could look at the donation amount external data if there's additional data available that could be linked to the donors donors that could be useful so let's do some exploratory data analysis to better understand our data okay so chat GPT um looks at the distribution of donation amounts so sort of seeing whether some types of donors give more or less but they say it's hard to make any definitive conclusions just from the plot some example names so these are corporate donations individual unions starts to make some initial conclusion so from the example it seems the names of Corporations often include LTD or Inc while the names of unions often include Union or Association this could potentially be a useful feature for our classification model next we'll need to pre-process our data and extract features one common approach is to use TF IDF which transforms the text into an numerical representation that can be used by our model we'll also include the donation amount as a feature let's proceed with this pre-processing and feature extraction and it just sort of stops here and this always happens in chat GPT so I'm just going to say please continue okay so it says it successfully pre-processed the data it includes the TF IDF values the words again I don't know what that means but it seems to be a machine learning technique split the data into a training set and a test set the training set is 241 samples the test set is 61 samples uh we'll next train a model on this data talks about trying a logistic regression uh says that model didn't perform very well give some statistics it's only predicting the union class is not making any predictions for the corporation or individual classes this is likely because our features aren't providing enough information for the model to distinguish between the classes so this might be challenging to build a model that can accurately classify the donors however we could try a few things to improve the model try a more complex model a more complex model like a random forest or a gradient boosting machine might be able to capture more complex patterns in the data feature engineering we could try to engineer more informative features from the data we have features based on the length of the name the presence of certain keywords collect more data text classification technique since the main feature is the name of the contributor we could explore more advanced text classification techniques include Word embeddings would you like to proceed with any of these options or do you have any other ideas okay so again I don't know very much about machine learning and I'm trying to pretend that I know even less but let's say let's try a more complex model okay the first option they give us okay so it says let's try that more complex model we'll train this random force model in our data it says the random force model performs significantly better than the logistic regression model gives me a bunch of stats which I don't fully understand um since the model has high precision and recall for the corporation and Union classes and reasonable precision and recall for the individual class the overall accuracy of the model on the test set is 0.89 I'm guessing that's 89 percent given this performance we could use this model to classify the unmatched donors however keep in mind the model's performance might not be the same would you like to proceed with classifying the unmatched donors using this model and I'll say yes okay it comes across an error because of some blanks fixes it we've successfully classified the unmatched donors using the random force model you can check the results in the notebook please know that the accuracy is dependent on the accuracy of the model which is trained on a relatively small data set uh if you want to save this classified data you can export it to a CSV file would you like to do that yes please it's always nice to be polite okay and it says the classified data has been successfully exported to a CSV file named classified underscore donors.csv you can download it from your project files notable so I'm going to go over to notable I'm going to go to my project and there we have it classified underscore donors.csv download that to my computer and take a look okay so because this was a bit of an artificial exercise um I can actually check and see how good of a job chat GPT did so so the actual uh data that I got from uh Chachi PT uh is here and it just has three columns so it's got the contributor name the amount and the class name for all sixty thousand donors in my data set but I loaded that into a spreadsheet and then I I loaded in the correct classification from the original data set and then just did a little vlookup so that for each donor I've got the predicted class according to the machine learning model that chatgpt did and the correct class and we can see already some correct but some mistakes right and if we look at the data in a pivot table we can actually see these are the correct classifications so for the true corporations of the 10 000 and change corporations chatgpt correctly identified about 9000 of them as corporations but incorrectly identified 1800 which isn't great as individuals and 95 as unions much better job in individuals almost all of the individuals were correctly identified as individuals and only a handful were not same thing with unions of the 469 unions in the data set 460 were correctly identified as unions and nine were identified as individuals and if we look at that in a percentage basis uh got 83 percent of the corporations right 98 of the unions write 100 of the individuals right that's obviously not quite true but I actually you have to get to uh I think two or three decimal places before you actually see the errors 99.957 percent uh correct um and in my playing around with these machine learning models uh in a situation like this where there's sort of three classifications it's just trying to decide what bucket to put it in usually one bucket is worse than the others and it's a bit idiosyncratic depending on how the model actually works so you know almost by definition it's being more conservative about identifying individuals but that means that there's more individuals that end up in corporations if the model was tuned slightly differently and I've seen the sort of the reverse happen where some more corporations get identified as individuals and then the corporation sort of success rate uh is a bit better and if we go to the pivot table we can sort of start figuring out kind of why some of these mistakes were made so um you know like what um individuals uh did um chat GPT incorrectly identify as unions if I double click on this right here uh it's actually interestingly um a bunch of um people that left money to um political parties in their will so so instead of it just having someone's name it was a state of something a state of something and that got Chachi a bit confused and it thought for some reason those reunions uh same thing for the corporations a lot of those were identified in properly as individuals if I double click on this and I can almost guess what this is going to be um well some of these are a bit more surprising but a lot of them are that the the corporations have people's names in them right so it's James M Cody law Corp uh Patricia Taylor law I'm less clear why Big Kahuna sports or The Innovation Resource Center got identified as individuals but but that's what ended up happening but overall you know a pretty good uh success rate if I can find the percentages here for a classification model on relatively little uh training data in my playing around with this the more training data that you give Chachi PT not surprisingly the better job it does of identifying things and obviously it depends on the the subject matter itself how easy or difficult it is to sort of classify things into different categories but overall you know I think pretty impressive that you know without knowing anything about machine learning not even in my sort of hypothetical example here knowing what machine learning is I can ask Chachi PT to match some data for me and it starts using some of these more advanced models and then as always one of the neat things about notable is that it puts all of that code into a Jupiter notebook and so you know you may not be super familiar with it but if you have someone on your team that's more of a coder they can actually look at you know the libraries that were used you know the specific code that was used to do some of those uh classifications uh of course also because chat GPT knows mortar um than just your data you could ask it something like you know can you please explain to me what the random forest model is and how it works okay so it talks about how it's a machine learning model known as an ensemble model it combines the predictions of several base estimators built with a given learning algorithm in order to improve generalizability and robustness here's a high level overview of how a random forest model Works take several bootstrap samples to rebuild a tree um then Aggregates the data and again this is still pretty pretty Advanced I'm a little bit lost but you could then sort of say can you explain bootstrap sampling to be a little bit more things like that so it's one of the nice things about Chachi BT that um it can sort of use a tool and then it can actually explain to you how that tool works so hopefully this was helpful again if you are wanting to know a little bit more about how the basics of this work I'd recommend you look at my first video on how you can do data analysis using notable and chat GPT you need the paid version of chat GPT plus and they need to install a notable plug-in and yeah if you like this video stay tuned I'll make some more as I go along and I find things about Chachi PT and and notable that I find interesting and helpful okay thanks a lot

Info

Channel: Chad Skelton

Views: 7,401

Rating: undefined out of 5

Keywords:

Id: L7jruf_T33E

Channel Id: undefined

Length: 17min 28sec (1048 seconds)

Published: Wed Jun 14 2023