ChatGPT for Data Science and Data Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome to the ultimate Chat GPT for data science tutorial. In this tutorial, you are going to learn about Chat GPT, how it works, and most importantly, how you can utilize it to make data science easier, faster, and more efficient. We are going to make Chat GPT right as QL queries, analyze data using Python and even trained machine learning models. So what is Chat GPT ? Chat GPT is an advanced language model that can understand and generate text. You can use it to create content, write articles, emails, and even write and explain codes. We can use it to generate data, write unit tests, and train machine learning models. Now let's move on and see Chat GPT in. Head over to chat.openai.com and if you don't have an account sign up, it won't take a minute. Once you've logged in, you're going to see the main screen with an input box to talk to Chat GPT. Now here's my first prompt list, the top 10 free courses for machine learning. As you can see, it listed some of the best machine learning courses out. We can also ask it questions about the answers it produced previously. Let's ask it, what were the key takeaways from the third course? It listed above, it was able to note that the third course mentioned above was machine learning by Andrew Ung. Not only that, it listed the key takeaways correctly. The takeaways it mentioned is exactly what you can expect from this course. Now let's use Chat GPT for data science. We start by loading a data set to chat GPT. We're gonna get our data set from W three schools, so head over to W three schools.com/sql l. Click on the itself button. It'll bring you to a page where you can write SQL querie. Change the customer table to the product table and run the S SQL L statement. You will get the query results below. Just copy the first few rows, including the headers. Then go to Chat GPT, tell it that this is the product stable, and then paste what you've copied. It was able to understand that this is a list of product. And it was able to detect the column names as well. It detected the product id, product name, supplier id, category ID unit, and price. The columns are self-explanatory and will be using it to write QL queries. Let's tell it to convert the data above to a tabular format so it can be easier to see. GPT generated the beautiful looking table for us. Now we can ask it questions about the data set. Let's ask it, what is the product that has the highest price? The answer is Michiko with a price of 97. Let's see if that's correct. Indeed, it's correct. Let's also ask it, what is the product that has the lowest price? It's any seed with a price of 10. Let's tell it to calculate the average price for the products above. Not only that it showed you the average, but it showed you how to calculate it as well. If you write it again, it might even show you an S SQL L query. Speaking of SQL queries, let's make it right soon. Let's ask it to write a query that gets the product with the highest price. The query looks good. It orders the products by the price in a descending fashion. And then limits the results to one. Thus effectively getting the product with the highest price. Let's also tell it to get the product with the lowest price, like the query above it did the same thing, but it ordered the price in an ascending fashion. Let's also tell it to calculate the average product price. The query is straightforward and it's using the average aggregation function. Now let's make things a little bit harder on chat GPT. Let's import a couple more tables and ask it questions. That is only possible to solve by joining tables together. Let's head back to W three schools and get the order details table, and let's import it into Chat GPT like we did in the products table. Now let's do that for a couple more tables. Let's do it for the orders table. The last table is going to be the supplier stable. Now let's ask it to calculate the average product price per supplier. This will require Chat GPT to join the product stable and the supplier stable together. Let's see how it does. It was able to join the product stable and the supplier stable on the supplier ID column, and it was able to use the average aggregation function with the group by supplier name statement. This query looks correct, but let's copy it and make sure that it runs correct. Just copy the SQL code, then head over to W three schools again and paste the query. Not only that, it ran without any bugs. The results look correct. Now let's make Chat GPT do a small calculation. We can ask it to write an SQL statement that gets the product that achieved the highest revenue. This would require it to join the products table and the order detailed stable. And then understand that revenue is price streams quantity, you can see that it understood that revenue is equals to price streams, quantity, although I never explicitly mentioned that. Also, it was able to join the products table and the order details tables together correctly. Now let's make Chat GPT write an S SQL L statement with three joins. We can ask it to get the employee that made the highest sales from the tables above. This will require it to join the order stable order details table and the products table together. It was able to join the orders table, order details table and the product stable together. And not only that, the S QL statement looks synt tactically correct, but let's see if it runs. It ran without any bugs and it actually got the employee ID that has the highest sales. Let's make Chat use window functions and sub queries. We can simply ask it a variation from the question above, we can ask it to get the employee that made the second highest sales. Wow. Not only that, it was able to use the rank window function, but it was also able to put the sub query into a commentable expression so that the query looks neat and. Now let's use Chat GPT to analyze some data in Python. Let's use Kaggle's heart attack analysis and predictions data set. For this tutorial, I'll leave the dataset link in the description below. Let's download the dataset by clicking the button on the operate corner. You'll get a file like this called hard dot csv. Now let's copy the first few rules to import it in Chat GPT. We can give it a small sentence like this is a heart attack dataset and then paste in the dataset. Chad GPT was able to understand that this is a heart attack dataset and it was also able to list down a couple of columns. It was also able to understand that the output column had the information of whether or not a patient had a heart. Now let's ask chat GPT to write a Python program that treats the dataset, gets the data types for each column of the dataset. Get the summary statistics for the dataset and drop any duplicate tools. We can see that chat. GPT used pandas. To read the dataset and to do the rest of the tasks. This Crip tackles all the points that we mentioned above. Also, it looks synt tactically correct. Let's copy the code and paste it in a Python notebook to see if it works. Will import pandas, then read in the dataset. We'll just need to change the CSV name to read the correct file. Chad g PT used D types to present the data types for the columns. Then it used the described function to print out summary statistics of the dataset. The described function presents all sorts of statistics. It presents the count, the main. Minimum standard deviation and the maximum as well as the present tiles for each column. This will give you an idea of how each column is distributed. Let's print out the shape of the data frame. You see that it has 303 rows and 14 columns. Then let's drop all the piros and print out the shape. Now it has 302 rows. That means that there was one duplicate row removed. Chat GPT is doing good till now, but let's see how it performs when we ask it to create some uni varied analysis. Let's ask it to create a visualization with the proportions for all categorical columns. Let's copy paste the prompt we used above and change the second and third points. Let's ask it to put all categorical columns in a list manually. This is because I wanted to choose the categorical columns and not depend on the data types because they are all numerical. Then I'll tele it to absorb the proportions of different values for every categorical column we have. We can see that it generated a list with the correct categorical column names. Then it looped over each column name in the list and it was able to use the values counts function and the plot function to produce the proportions by Chat. Let's copy the code and paste it in the notebook. Let me import the math plot clip at the beginning of the file and then run the code. We can see that 70% of our data set is six of one. We can also scroll down to see the different proportions for each column that we have in our data set. The output column is important to look at. It'll determine if your data set is skewed or not. This is not a skewed data set, and this makes training a machine learning model a little bit easier. Next, let's ask it for the distributions for all new medical columns. Let's copy paste the prompt above and change the categorical columns to numerical columns and the proportions to distributions. You can see that Chat. GPT also chose the numerical columns correctly and was able to use the plot function with the kind his to produce frequency plots. For example, age looks more or less normally distributed with an average age of 55. Then we can see the resting blood pressure is skewed to the right a little bit. You can also take a look on the other columns as well. Instead of histograms, let's say I want box plots. Box plots can be easier to interpret, so. All I have to do is copy paste the prompt above and change distributions to box plots. Let's copy the code and paste it in the notebook. Now we can see the box plots for each column. Now let's try boy varied analysis. Let's start by telling it to generate a heat map. Let's copy paste the prompt above and change the box plot to a heat map. We can see that it used seaborn for this visualization, so let's copy the code and paste it in the notebook. There are no strong correlations between the columns. The absolute value of a weak correlation lies between zero to 0.3. A medium correlation is between 0.3 to 0.7, and the strong correlation is above 0.7. As you can see here, we have no strong correlations between columns. Let's tell it to generate proportions with regards to the output column. So I want to have the proportions of a column when the output is zero, and another proportion when the output is one, when the proportions are different. This might indicate that this is a valuable feature to predict the output column. We can copy paste one of the prompts above and change the second point to listing all categorical columns except the output column. Then we want to plug the proportions of all categorical columns per each value in the output column. Used the same code that it used for the proportion graphs, but it used it in a double four loop where the second four loop is the output column. Let's copy and paste the code in the notebook. You can see that the first graph has the proportions of sex when the output is zero, and then the second graph shows you the proportions of sex when the output is one. But this sort of back and forth one, comparing the two graphs is quite cumbersome. So let's ask it to merge those two graphs into one graph. Effectively, we're going to have one graph per column. If we copy paste the prompt above and add that one categorical column should be in one graph, then it should produce the correct results. Now it generated a slightly more complicated code, but the output is worth it. Now, I don't like the stacked bar graph, so I'm just going to change the stacked attribute to faults. Now. There we go. This is way better than what we had before. Now we can compare proportions without scrolling back and forth. We can see that when chest pain is zero, the output is going to be most probably zero. Thus, the patient will not have a heart attack. We can also do the same analysis for the rest of the columns. Now let's create distributions. With regards to the output column. It's going to be the same as the proportions, but rather than having the proportions, we're gonna be plotting the distributions. Copy, paste the prompt above and change the categorical to numerical and the proportions to distributions. Everything else should stay the same. You can see that the generated code is very familiar. we can see that patients with a heart attack has a higher resting blood pressure than people that didn't have a heart attack. You can also check the other columns as well. Now let's generate box plots with regard to the output column. We'll just copy paste the prompt above and change distributions to box plots. Then we copy the code and paste it in the notebook. With box plots, you can compare the two distributions more clearly. Let's also create a per plot for all numerical columns using Seaborn. The code looks easy and straightforward, so just copy it and paste it in the notebook. The per plot is an efficient way to see scatter plots for all combinations for your columns. Lastly, let's train a heart attack prediction model. Let's tell Chad GPT to write a Python program that reads the dataset. Train a model that predicts whether a patient had a heart attack and evaluate the model using SK Learns classification report. The code seems on point. It first determined the feature columns, then it separated the inputs and the outputs of the model. Then it split the data into training and testing sets using the train test split function. Then it trained SK Learns logistic regression model, and then it evaluated the testing set using the classification report. Let's copy the code and paste it in the notebook. Let's run the imports first. Then let's define the X and Y variables. Let's then split our data into training and testing sets. Then let's fit our model and let's evaluate it. Using the classification report, we received an F1 score of 83%, which is not bad at all. This is actually impressive. Regarding that, we didn't code anything. So that's it guys. There are endless ways to get creative and used Chat GPT. I hope you found this video helpful. If you enjoyed it, please give it a like and subscribe for more videos like this.
Info
Channel: Code In a Jiffy
Views: 86,412
Rating: undefined out of 5
Keywords: chatgpt, python data analysis, analysing data with chatgpt, openai, stable diffusion, chatbot, spacy, google vs chatgpt, using chatgpt to solve programming challenges, jesse jcharis, good ai, data analysis with chatgpt, building ml models, codex, data science, machine learning, natural language, natural language processing, nlp, data analysis
Id: Nfc5XWK9ioQ
Channel Id: undefined
Length: 21min 31sec (1291 seconds)
Published: Thu Feb 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.