@LangChain Pandas Agent and GPT-4 for Data Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
LinkedIn is a great framework that makes it very easy to build applications using large language models today I'm going to show you how to use the pandas data frame agents of link chain and with it we're going to analyze and explore a data set before we get started make sure you have pandas and Link Chain installed on your computer and make sure you have a open AI API key to make it easier on your first project you can just Define the API key on your jupyter notebook so you can say open AI underscore API underscore key and then paste your API key in here or if you want you can also add it to your environment variables and then read them with python.inf library but we're just going to Define it here for now if you use Langston before I'm sure you know how easy to use it is so the first thing we need to do is to import some modules one of the modules that we're importing is chat open AI so that is the module that helps us communicate with the model and the create pandas data frame agent from link chain agents that's the one that's going to help us to alterations or calculations on our data frame just to show you how to set up and use the pandas data frame agent I'm just going to start with a very easy data set for now and then we can go into a more interesting data set but that is an Airbnb data set all the Airbnb listings in Amsterdam by the way if you want to follow along with me as I'm coding you can go to the description and get the notebook from GitHub all right so let's call the iris the famous IRS data frame from pandas is an example I forgot to import pandas of course and let's quickly take a look at this data frame and then I'm going to create a chat variable we will start a chat open AI object the model name will be gpt4 because you want to use gpt4 one thing you need to keep in mind though to use gpt4 you need to go to your openai account and set up billing and add some funds to your account I added just ten dollars to begin with and that has been enough for me to build the code for this video and test that it works and even to go through this video and show you the examples so if you just want to start with something just want to experiment with it ten dollars should be enough to begin with I think there needs to be a dash in there and we can also set up the temperature the temperature value is between 0 and 1 and that is basically the randomness of this model so or the results randomness of the results that you're getting back if it's zero it means that it's going to be kind of deterministic and you're going to get the same answer to the same question if you set it to one there is a high chance that you're going to be getting varying results and then I'm going to call the create pandas data frame agent function that I've already imported I need to pass the checked variable to it I need to pass the data frame and let's keep it like that for now and this will be my agent one thing to note here is if you already have your open AI API key in your environment variables you don't have to do this but if you passed it or if you create it just a normal variable on your notebook for it to make it easy to use you also need to pass a named variable open AI API key to make sure that you can do your authentication with the open AIS API all right now actually our agent is ready to ask questions to it so let's ask some simple questions to see that it works run for example what is the average C pull length the average sleeper length is approximately 5.843 so we can actually check if this is correct I just need to say Iris and no not even that much we just need to get the simple length and do a mean and yeah that's that's exactly the correct answer uh I'll show you one other thing actually if you want to understand how uh the model came to this decision and what coded use and what the um thinking process has been you can set the verbose to be true and then let's ask another question to it maybe something a bit more different I can say what is the Max simple with and I'm also not going to include the underscore here let's see if it understands that I'm talking about the second column here with for um setosa all right so as you can see now we have more information coming in it's telling us what it's doing thought is to find a maximum simple width of the species setosa it understands that setosa is a species I need to filter the data frame for rows where the species is setosa then find the maximum value in the simple width column it even tells us what code it uses which is great if you want to make it look like you did it and it says I know the final answer the maximum people with for setosa is 4.4 uh which is great all right so if you just wanted to learn how to use the pandas data frame agent with gpt4 on link chain this is honestly all of it but if you want to hang around I will show you some more interesting questions so that we can test the abilities of this agent so I'm just going to do the same things that we did before import pandas and all the import all the link chain modules that we need but right now instead of importing a sample data set I'm going to import the Amsterdam Airbnb data set that I mentioned before so just call this Amsterdam Airbnb pandas read CSV Amsterdam airbnb.csv all right so let's understand this data set a little bit it's a data set that has more than 8 000 listings and has 10 columns so let's take a quick look at it see what we have all right so the index column is not read correctly so I'll just fix that index call zero all right that looks better so we have the listing URL we have the name uh description whether the host and identity is verified so it's true or false it holds the super host or not so we see only t or F for that which is great because then we can um check if the line chain agent can actually understand that true T stands for True F stands for uh false uh how many bets there are amenities in the list that's also great um and then review scores rating and the price we need to create the agent first again of course so let me just copy and paste the code again here I just need to change the data frame from being the IRS data frame to the Amsterdam Airbnb data frame okay now we have our agent I'm going to start asking let's start with an easy one I want to see how fast it works with eight thousand listings so I'll say what is the average price all right this is actually interesting I thought I was asking a very simple question but turns out I'm not so let me stop this for a second it looks like I've run into some rate limit but before that so I wanted to ask the random I wanted to ask the average price of all of these listings and but the price is a string object so what happens is it needs to convert it to a float before it can calculate the average so that's what it tells me it's doing uh says to find the average price I first need to convert the price column from a string to a numeric type then I can use a mean function to calculate the average which is great this is what it's doing here it's replacing the dollar sign which basically make removing it and then turning it into float and then using the mean function to get us the mean value but in the meantime it looks like I reached a rate limit so let's see if I can run more questions before I continue I just want to make sure that I cast all of the price into float type so that we're not going to run into the same problem over and over again and that might be causing the problem with the rate limit because it needs to cast everything into float on the price column and then do the necessary things but it's good to know that it's going to be able to deal with that problem too so let's ask a new question and this time I want to ask something a bit more vague I'm going to ask it which listing has the best value for money and this way you can also see what is how it's going to Define value for money all right it says to determine the best value for money we need to consider both the price and the score the rating score okay so it's quite simple I might have expected for it to maybe include amenities or the number of beds or whether the host is a super host or not but I decided to use the rating and the price and then sort it and then get the best one so apparently the townhouse in Amsterdam that has 4.65 ratings is the best one we even get the URL for it all right next question so I want to see if it's going to be able to deal with the fact that the amenities is a column of lists so I'm going to say which listing has the most amount of oh even that's even too simple I'm going to ask ask which listings have uh Wi-Fi I'm not even going to mention amenities so let's see if it can figure out where to look for uh whether a listing has Wi-Fi or not all right that's very interesting so what it says is to find a listing that have Wi-Fi I need to check the amenities column of the data frame if the string Wi-Fi is present in the amenities list then the listing has Wi-Fi I can use the scr contains function to pandas to check if Wi-Fi is present in the amenities column which is great and even after that it's not done it tells us the listings that have Wi-Fi are those and those whose amenities column containers to Wi-Fi uh however the output is too large to display completely so I'm just going to show you the listing URL that is quite impressive that it was able to understand that the Wi-Fi is an amenity and it's going to be listed in a list of strings and then it's going to turn it into a whole list into a string and then look for the word Wi-Fi and it looks like nearly all of them have Wi-Fi which is you know not surprising what if we look for what would be not common kitchen maybe maybe some houses don't have a kitchen oh yeah just stop it first okay my last attempt coffee maker the main issue is I'm kind of getting spooked by the fact that it is returning nearly all of the listings um right I just got it it's not actually just because I keep seeing 8 385 here I thought it was returning nearly all of them but actually the amount that is returning is 6054 which makes more sense so it seems like this is working uh there are no problems and uh just thought maybe I got it but no and the last thing I want to try is whether it can distinguish what T and f stand for so I'm going to ask it what percentage of hosts are super hosts all right so it says to find a percentage of hosts that are super hosts I need to count the number of T at true it even says what it stands for um that's quite impressive honestly uh this is great you can actually get answers to vague questions I mean maybe not the answer that you want like especially with the um best value for money maybe you want it to include some other things but maybe it's even possible to tell it to say like make sure you considered a number of bedrooms make sure you're considered a number of bathrooms for example uh it looks like that might be possible uh I hope you enjoyed this video I hope you learned how to use the pandas data frame um agent if you have any questions or if you have a interesting way that you can use this agent let me know and I will see you in the next video just as a last note I wanted to mention as a reference so that you know how much money you might be spending with gpt4 is I started all of this project to it ten dollars in my account and you know there is the times that I ran the queries on screen and off screen as a preparation to make this video and right now it's at 7.82 and of course we're not running really big queries very complicated queries and I did run into some rate limits uh while I was asking the questions but so you know just so you know just so that you have a reference point in terms of how much money this might cost when you're using it for project
Info
Channel: AssemblyAI
Views: 34,117
Rating: undefined out of 5
Keywords:
Id: ZIfzpmO8MdA
Channel Id: undefined
Length: 14min 11sec (851 seconds)
Published: Fri Sep 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.