How to do data analysis using AI with ChatGPT and the Noteable plugin

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there my name is Chad Skelton I am a former data journalist from the Vancouver Sun uh currently teach journalism and data visualization at Quantum Polytechnic University in Surrey British Columbia which is just outside of Vancouver and I teach an online Master's level course in data storytelling and visualization at the University of Florida and I am recording this video because I wanted to show you what chat GPT can do when it comes to data analysis and visualization and the main thing that has made this possible is a plug-in from the folks at notable which is a Jupiter notebooks website and I've tweeted a little bit about this I've Consolidated my tweets a little bit on a blog post but I thought it might be helpful for people that are more visual Learners to sort of see what chat GPT plus can do in a video so I am going to explain in a little bit the steps in how to set up um chat GPT plus to do data analysis and visualization but I thought I would jump right into the exciting part which is what chatgpt plus can do and in order to show this to you I'm going to be using a data set on bike thefts so this is a data set that I worked on when I was a reporter at the Vancouver Sun it's five years of bike thefts from The Vancouver Police Department got the date and time of every bike that the location uh in the terms of 100 block the latitude and longitude it's about 7 800 rows I have uploaded this data set to notable and I'll talk all about how to do that later I have set up the plugin in chatgpt plus and we are going to go so please uh analyze the data in the bike thefts dot CSV file and produce some data analysis and visualization so one of the things that I've found um most uh useful in sort of getting a sense of what gbd plus can do is to use a data set that you're really familiar with this is a data set I did a bunch of Stories on I have taught it in all classes for a decade but to play dumb like I'm not telling uh Chachi PT or notable what's interesting in the data set what kind of charts I want to produce I'm basically saying here's some data tell me about it and let's see what it can do and I'll just warn you ahead of time chat GPT plus can be a little bit slow so at certain points I might edit the video or fast forward it so that you get to the good parts but let's start our journey here okay so uh it starts by telling me uh what is in the data set so it's talking about a date time field a hundred block field a Latin longitude field now interestingly it's providing me some metadata here on what the fields are which does not exist in the data set itself so so it's using some of its kind of broader knowledge about the world in the large language model that is chat GPT to come to the conclusion that date time is a date and time Field 100 block is the block where the theft took place and the district is the district where the thrift took place shows me a little preview of what the first few rows of the data set look like in my experience with chatgpt this is sort of its routine when it's analyzing data as it often will describe the fields and the data for you and then show you the first five rows of the data set if you get annoyed with this I've actually found you can add to the prompt you know no need to show me the first five rows because it sometimes takes a little bit of time okay so it's doing a check for missing values which again is kind of helpful uh in terms of kind of data hygiene knowing um you know if there's any problems with your data this is probably a step that a lot of my students sometimes forget to do so it's kind of neat that and interesting that that chat GPT is doing it without again any more prompts for me other than analyze the data visualize the data okay now one of the the big things that I've discovered so far in working with uh Chachi PT is that is that it often comes up with errors it stops thinking um or it um will sort of stop Midway through um in this case it looks like there was some sort of error you can just regenerate the response sometimes if it just stops simply saying continue is enough for it to pick up uh where it left off now in my case again these are the The Growing Pains of new technology clicking uh regenerate response was just not working and so I started a new chat and just pasted the exact same query uh that I had before you can see it up here please analyze the data in bike thefts dot CSV and produce some data analysis and visualization you can see it gives me those descriptions of the fields missing values also tells me unique values which is helpful in terms of getting a sense for things some basic statistics so the count the mean minimum 25 50 for these different fields um it's also looking for missing values and now we're starting to get into the media visualization and I want to point out again I I've given it no prompt right other than having to restart because of a technical error other than analyze the data produce a data analysis and visualizations right I've given it nothing else to work with other than that so as we go down it starts to do some basic visualization so it starts with the distribution of thefts over different districts so we can see there's the most in districts one and districts four not as much in districts two and three I know because I live in Vancouver that the city of Vancouver is broken into four policing districts interestingly there's some sort of error in the data there with a 99 distribution of thefts over time so you can already sort of see by hour right that there's a sort of a a rise in bike thefts around the um sort of supper hour here it gives me this weird location of thefts which sort of does it as a scatter plot so again it's not perfect I would have done that as a map that's not super helpful um but it also is providing me some actual summary of the data which I find interesting as well right so from the first plot we can see that the majority of thefts occur in districts one and two the second plot shows that thefts tend to occur more frequently in the afternoon and evening hours the third plot visualizes the location of the thefts right so that that's the one that's a bit of a of a mess um now again let's try to play dumb here let's imagine we don't really know much about this data and so I'm going to Simply say do more analysis and more charts that's it that's all I'm going to tell Chow GPT and notable okay so it says gonna look at the trend of bike thefts over time the distribution of thefts by description the correlation between the time of theft and the district okay so it's showing uh the number of bike thefts per year uh in the data and we see that it goes up quite a bit okay a pretty simple bar chart um showing that not surprisingly most of the bikes that are stolen are worth less than five thousand dollars but there are a few in there in the data set that are worth more than five thousand dollars okay now giving me a heat map showing the thefts by hour and how they vary by District interestingly allowing me to sort of see that the patterns are are pretty um similar between districts one and four in terms of a lot of thefts kind of in that supper hour um I'm going to add this one thing where I'm going to say sorry show me the thefts by our chart again okay so let's show me the theft by our plot and when I look at this um one of the things that strikes me and we often talk about this in my class is that there's definitely you know a spike around noon there's a spike around dinner time uh then theft sort of go down but then there's quite a sort of major spike right at the Zero Hour like quite a bit more thefts than there are at the 11 pm Point um and way way more than there are the 1pm points so you know you can imagine sort of late night thefts but why is there such a huge Spike at the zero hour so I'm going to ask it you know why are there so many bike thefts right after midnight okay and it starts to give me some theories about um why that may happen but the part that I find um most impressive about this is is Chachi PT actually catches a mistake that most in my class do not which is that um it's not actually the fact that more thefts happen right after midnight but that it is a data error sometimes if the exact time of the theft is not known it might be reported as having occurred at the start of the day which is midnight or this is sort of the same thing at the time of the theft was not recorded it might be entered as zero zero colon zero zero by default leading to an over representation of thefts at midnight now I think one of the most um impressive uh and kind of mind-blowing things about what um Chachi PT and the notable plug-in can do already is that almost all of this analysis uh came with virtually no direction for me at all like it was please analyze the data in the bike theft.csv file and produce some data analysis and visualizations to which it did a lot a pretty impressive job you know do more analysis and more charts and give me thefts over time um gave me um how many are over 5000 how many are under five thousand a heat map comparing districts and time and then when I had a question about the data didn't have any idea what the problem might be but just like well that's weird uh it came up with some really uh well thought out theories for what might be going on now I'm going to show you how how you can then though ask more directed questions and in this case it's raised an interesting idea like could you look at the distribution of thefts in the hour between midnight and 1am to see if a lot of them are just zero zero zero zero right so so it's raised this possibility you know these might be data entry errors or reporting bias you know we don't know when a bike is stolen we just throw it in as we don't enter the time and it ends up being zero but presumably we could see that in the data right because if the thefts between midnight and 1am are all throughout midnight to 1am that suggests it's a real phenomenon if they're all right at 0 0.00 that's a different issue so you know show me that distribution okay so it's now showing me the distribution just of the thefts between midnight and 1am and we can see this huge Spike right at the zero hour rate um and chatgpt tells us why that's meaningful right so uh we can see that a significant number of thefts are reported exactly at midnight this could be due to reasons mentioned earlier such as reporting bias or data entry errors the number of thefts reported other minutes within this hour is relatively low right this suggests that the spike and thefts right after midnight might not necessarily indicate that more thefts are occurring at this time but rather it could be an artifact of the data reporting process this is pretty impressive I think like it I I ask a very basic naive I don't know anything about data question why are there so many it raises a good theory you know I guess you know if if we want to say what will gp.gpt version five or six do maybe it would jump right into like let's actually see which of these might be true but you know with a little bit of extra prompting like well can we look at that it looked at it and it doesn't just make the chart it says this is what the chart uh means now as impressed as I am by this you know they're they're it's not perfect like um instead of giving me lot and longitudes it gives me this weird kind of Dot Plot I think that might be because the longitude is formatted in my data as long as opposed to Lon which is more standard uh and this didn't pick up on one of what you know when I've analyzed this data and did stories on it was one of the most interesting patterns is that there's a very strong seasonal pattern um to bike thefts but this is where we can you know up till now just more as a test of what it can do we've been giving it very minimal prompting but you can give it more prompting and the prompting can be you know of a natural language variety it doesn't have to sort of be sum the data like this and group the data like this like we can just sort of ask something like are there any season null patterns to bike thefts in Vancouver okay and it gives me a very um straightforward chart showing me that there is a clear seasonal pattern in bike thefts with more bike thefts in July and August than other months uh the chart is a lot more colorful than I would probably have gone with I'm more of a minimalist when it comes to charts but it's pretty um and and again it goes that extra step of um you know chat GPT knows a lot about the world as a whole so it goes it sort of helps to put that data into a little bit of context now in this case the context is kind of obvious but it says you know um this pattern could be due to more people biking in the warmer months leading to more opportunities uh for bike thefts okay so a fair amount of a pretty impressive analysis with very little prompting and then some deeper analysis with them with some pretty light prompting um I think this is pretty impressive now the other cool thing about this is this whole time uh it's been using the notable plug-in and we can just ignore this unverified thing this is a weird sort of technical thing going on with uh chat GPT at the moment but every time it it sends an instruction to chat gbt you can actually open this up and you can see what the API call looks like with the response um from notable looks like so you can kind of get a sense of what it's telling notable and what response is getting in return uh and what's interesting here and cool is that all of the work that you're doing is then saved in this workbook right so it created this bike thefts analysis workbook um and this is now in my notable account and it contains all of the code segments of what analysis it was doing you can see these missing values unique values from earlier in analysis it's got a copy of all uh of the charts so you've got this kind of record that you can go back to and and if you're trying to learn code I think this could be really valuable right like so I'll confess that my knowledge of python is extremely limited um during my last few years when I was at the Vancouver Sun I wrote maybe a dozen very very basic web scrapers built a couple of Twitter Bots that scrape some data and tweeted stuff out and that's about as far as I got so so I'm pretty naive when it comes to python I absolutely could not code what is in this notebook and and but but I can sort of start to make sense of it and I and I could see myself learning more about um what python can do in terms of data analysis and data visualization through this like allowing me to give chatgpt plain language instructions uh and then having the output be in in code and then you can also share this notebook publicly with people you can share it show them what you've done all that kind of stuff which is pretty cool uh so I I have to say I'm really impressed at what uh Chachi PT can do with this notable plug-in uh we're very early days already I can only imagine how much better this is going to get but this to me allows people to do even putting aside the python side here you know to have a data set and and to let to have chat GPT be able to do uh some basic analysis for you and and even if they miss things you know they'll probably catch things that you might have missed right and so you know I'm pretty impressed with the ability of this tool to do quite a lot of cool stuff with with very little prompting other than take a look at my data tell me what's interesting right um I'd like to just wrap up with the part that I want to that I sort of skipped at the beginning which is how do you do this because it's kind of the boring part and I wanted to jump into the the exciting part um but basically uh this is possible if you have the paid version of chat gbt which is chat gbt plus it costs about twenty dollars a month um us you also need an account on notable um so you go to the notable website and you click sign up it's free to have an account and then once you've got the account um you can just go to app.notable.io and that'll take you to the main page of notable you can create a new project in notable by just clicking create project giving the project a name I created this project called testing and then um when you're in uh chatgpt under gpt4 you want to click on plugins and then under this little thing here you've got the plug-in store and you just find the notable i o plugin in this list of plugit plugins and just simply click install there'll be a simple authorization step which is explained here I believe just to connect chatgpt with your notable account and then you're pretty much good to go the last thing you might want to do is just if you've created a test project you want to tell um chatgpt and the notable plug-in that that is the default notebook that you're going to be using so by default that's where we'll put all of your or sorry that's the default project and that's where it will buy and put all of your new notebooks and all that kind of stuff and where we'll look for data sets you just click on here and say please set my default project in notable to like that okay and then it says it successfully set your default project and then it comes up with the name for some reason I I was having difficulty actually saying please set my default project to testing or or whatever the name of it was so pasting in the URL of the project seems to seems to be the best way to work there uh now in another video I'm going to talk about different ways you can get data into um notable you can sometimes provide a link uh to a CSV that exists somewhere on the web sometimes you can point it to a data set uh somewhere and it's able to download it it also seems to in some cases be able to use apis to grab data from other sources but if you've got a data set that you want to work with and I do think in terms of testing out chatgpt testing it out on a data set that you're comfortable with um that you know the ins and outs of is probably the best way to go I found the simplest thing to do is just to upload the data sets to your project so this is the testing data set here just clicking on upload upload whatever data set or data sets you want to use and then just make a note of the name of the data set so in my case it was bike theft.csv and then when you get started on your analysis you can simply say to Chachi PT please analyze the data in the bike thefts.csv file and produce some data analysis and visualizations or whatever your file is okay so as I say I'm going to make another video that talks about how you can actually use chat GPT and the notable plugin to find data using apis which is something I kind of stumbled across which is kind of neat um this is the very first public YouTube video I've made I used to be more into Twitter and blogs and stuff like that but I'm hoping you found this useful and if you did let me know in the comments and I'll post some more as I continue to explore chat GPT and and in particular it's data analysis and data visualization capabilities um if you are looking for the data that I'm using in these YouTube videos If you Google Chad Skelton GitHub on my GitHub profile page in the folder called Data there is a bunch of old data sets from when I was a data draws of the Vancouver Sun but at the bottom I've got a folder here called chat GPT where I'll be uploading um the data files that I'm going to be analyzing in these videos um my website is Chad skelton.com where I will occasionally blog and at the moment I am posting links to all of my Twitter threads on chatgpt and I'm of course on Twitter at uh Chad Skelton so um if you're finding this useful or if you've discovered some other cool stuff yourself uh with chat GPT uh please let me know and yeah I hope you found this video interesting and helpful and get a chance to play around with chatgpt's data analysis and visualization functions yourself thanks very much
Info
Channel: Chad Skelton
Views: 87,512
Rating: undefined out of 5
Keywords:
Id: A1ualvzgJoo
Channel Id: undefined
Length: 20min 12sec (1212 seconds)
Published: Thu May 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.