Talk to your CSV & Excel with LangChain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[00:00:00] All right. In this video I'm gonna be going through how you can use LangChain to talk and extract information from CSV files and Excel files just by using natural language to query them. So in this case, we're gonna be going back to the open ai language models. So you will need an open AI key here. You can see the latest version of LangChain that I'm using. And the data set that I'm gonna be using is the Black Friday Sales Data set. So this is available from Kaggle. I've put a link to it here if you want to go and find out more information about it. Really, we're just using it. As a simple way to bring in a CSV file and test some things out. In this case, I'm bringing in the trained csv. There's both a trained CSV and a test csv. It really doesn't matter here because we are not training a model. We're not doing anything like that. We're just using another model to query the CSV file. To just get a sanity check of what's here, what [00:01:00] I've done is basically set up pandas. Which I'm sure many of you will know and then use pandas to read in the csv. And then just have a look at what, the first five records and what the column names are. So you can see that we've got gender, and you'll notice that gender is f and m. We've got age, which looks like it's bucketed age. We've got stay in city, current city number of years. We've got marital status. We've got a bunch of different things in there. We're not gonna use this particular one that I've set up. This is just a sanity check. LangChain will use its own one to load it in. The way we're gonna do this in LangChain is we're gonna use the CSV agent. So the CSV agent , it has a number of things you need to be a little bit careful about because what it's actually doing is running, a Python agent under the hood. So you want to be a bit careful of anything like a prompt injection attacked or something like that. So if you are using it for end users that are not yourself or not the people you trust, [00:02:00] be a little bit careful about the environment that you're gonna run the Python agent in. In this case, we're just running it our. So it's not a big deal at all. Okay, so we're gonna bring in, from LangChain agents, we're gonna begin the create CSV agent. We're going to also bring in the open ai large language model here. And we're gonna create this agent. And remember we always create an agent by passing in a language model. So that's the first thing that's being passed in here. We're setting the temperature to zero. This is to basically, we don't want it randomly changing the facts that are in CSV when it's feeding back, et cetera. We wanna try and reduce hallucination as much as possible. So therefore we set the temperature to zero. We're passing in the CSV file itself, and we're gonna pass in verbose equals true. So this is basically just so that we can see what's going on with the promptsas it goes through. All right, so if we have a look at the agent we can see that, okay, it's certainly got, a number of language models in there and language model chains in there. You can see [00:03:00] that some of the, this has got the prompt in there that we've got, which we'll look at in a second. It's also got things like, it's gonna have an input and a scratch pad. So the ideas of a scratch pad is that it can take some output from One Call to the language model and use it as part of an input for the next call to a language model. Let's come down and look at the, and we can see that, as it goes on, it passes in a bunch of examples of the data and stuff that it's got in the agent. Now, it's not parting all that into each call, right? That's not the idea here. What it we've got here is we are basically passing in a prompt. So let's look at the prompt. I can just copy this prompt and it would be better if I just paste it in there. and now we can read the prompt quite easily. We can see that, okay, this first prompt is basically going to be, you are working with a pandas data frame. So this is the same thing as what I set up, but it's using its own loader for loading it. And we'll look at, you could actually use that loader as a separate component later on [00:04:00] if you wanted to. You are working with a pandas data frame in Python data frame is the name of data frame is df. Very conventional sort of thing. And then it's basically telling it, these are the tools that you can use below. In this case it's just using the python REPL, right read, evaluate, print, loop that's going on here. And then it talks a little bit about the format that it can use. And what it should output for actions for observations and any thoughts. And then we can see that what we do is we eventually, we are passing in the head of the data frame. So this is just like what I printed it out, up at the top of the notebook. We're passing in the input question and we are passing in the scratch pad as this thing goes on. So let's kick it off and start by asking a simple question how many rows are in there? So you'll see that the. agent is doing a similar thing. So actually, just one thing I forgot to point out is when we look at the the actual agent, we can see we're not using the React agent here. We're using a zero shot agent which is slightly different than the one that we used [00:05:00] when in the video I made about agents. All right, but we've still got a similar sort of concept here where we are gonna make a call. It's going to decide, what sort of action to take. So we can see that we've gone into this agent executor chain it's passed in, how many rows are there. And it's thought is, I need to count the number of rows. So then it's gonna use the Python REPL for that. And it's gonna basically just pass in the data frame and say, okay, what's the length of the data frame? The data frame is 550,000. So then it gives us back this final answer. The next thing Is we can ask it something a little bit more complicated. So how many people are female in here? So this is an interesting one because we haven't said, in the actual data frame, it doesn't use the word female or male. It uses f and m. But, and it uses the column name, gender. So we haven't used the word gender. We haven't used F, we haven't used M in the same way. But sure enough, we can see that, okay, I need to count the number of people who are female [00:06:00] and then it works out that okay to do that. Because it's got the head of the data frame, it can work out based on the columns that, ah, okay, it's gonna be in the gender and it's gonna be f and I just need to count that. So it's gone and count, counted all of that up and then it knows the answer and it's giving us this answer back 135,000. We can ask it more complicated questions too, so we can ask it things like how many people have stayed more than three years in the city. So that's one of the fields that they have is stay in current city number of years and can see sure enough it's worked out that it needs to filter the data frame for people who've stayed in the city, in their city for more than three years. And it basically just writes the pandas query here, gets the information, returns that. So gradually as you go along should try it and do more complicated things. So here I'm asking it. Okay, what about, how many people have stayed for more than three years and are female? So again, it's done a nice job of taking input. Writing the [00:07:00] pandas query, which is gonna be, the stay in city greater than three and gender equals f and it's able then to return to us, how many of those people are female? We can ask it, quantitative questions where things like, okay, are there more males than females? Now we know thisbased on it. And we can see that, okay, what it's done is in its scratch pad. In the observation, it's basically outputted that it's counted the number for males and the number for females. And then from that it's able to then work out that, okay, there are more males than female. So that's the basic agent. And this will work for, any CSV that you want to put into it. If you want to actually make a custom agent, you could use the LangChain document loader and use the CSV loader. So this is a very simple sort of thing of just loading up the CSV yourself, and then you would put in some sort of custom chain here for a specific task that you want to do. If you want to use an Excel file, no problem. You just basically put in some code to convert your Excel file. So this is the example of the [00:08:00] Excel file and put it convert it over to be a CSV file and import it that way. So LangChain doesn't have native , it doesn't have Native Excel import. And it's important that you understand that what we're doing here is we are operating on it as if it's a data frame and not necessarily strictly like a spreadsheet where we could look at very specific cells and formulas in those cells, that kind of thing. It's a little bit different than this. So here we've brought in another one that was an Excel file. I've just asked that, okay, what are the column names? It's able to work out, that, okay these are the column names and tell us what they are. Then I said, okay, there's age. So what is the average age? It's able to calculate that out. If we look at some of these that are a little bit more complicated. Okay. Which country appears the most and how many times does it appear? It goes through, it works out that United States appears the most, and you can see that once it's got this information, it's going back to the language model to rewrite that information. So the country that appears most is United States appearing 48 [00:09:00] times. We can even do things like. So here we can basically ask it, okay, what's the ratio of males to females? And this is a different CSV file than we had before, and the other one had a lot more males. This one seems to have a lot more females than than males. But so this gives you a sense of what you can do with this for both CSV files and Excel files. And it allows you to write very simple little apps that allow people who don't know how to use pandas, for example, to query a bunch of data really quickly and find out information that's relevant to them. Hopefully this was helpful to you. If you've a, any questions please put them in the comments. Remember, if you like this video, please click and subscribe. I will see you in the next video. Bye for now.
Info
Channel: Sam Witteveen
Views: 55,564
Rating: undefined out of 5
Keywords: OpenAI, LangChain, LangChain Tools, python, machine learning, artificial intelligence, natural language processing, nlp, Huggingface, langchain, langchain ai, langchain in python, gpt 3 open source, gpt 3.5, gpt 3, gpt 4, openai tutorial, prompt engineering, prompt engineering gpt 3, llm course, large language model, llm, gpt index, gpt 3 chatbot, langchain prompt, gpt 3 tutorial, gpt 3 tutorial python, gpt 3.5 python, gpt 3 explained, LangChain agents, csv, excel, spreadsheet
Id: xQ3mZhw69bc
Channel Id: undefined
Length: 9min 49sec (589 seconds)
Published: Sun Apr 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.