LangChain + OpenAI to chat w/ (query) own Database / CSV!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey welcome to part two of The Lion King tutorial and today we're going to look at one of the most commonly requested workflow which is to use a large language model like GPT for example to ask questions on your own database or you're on the CSV file or even a sqlite database this should be very practical seeing that everyone above the age of 6 has a database or they find ways to persist the data in some form of a tabular format like the CSV if you're an adeo grandmother of free guess where you're storing your gardening inventory or Home recipes exactly in a CSV file so let's continue from our life of in part one we're using the same environment node independencies if you want to follow along I don't watch the first one to get some background of what land chain is or just put a pause go to the go the rapple do a pip install and resume the video if you're ready let's just create a new file just leave me Demo 2. P1 and you can begin with the Imports we know we're gonna need to use Lang chain so let's might as well just bring that in we know we're going to use the open AI model so let's do that and I also want to bring in the database and the SQL database chain and I'm going to explain both of this in a second now this directory is exactly the same as in the last video in part one of this series The only difference is that I had a new directory here called Academy and I'm going to give you a quick preview of the CSV file now I'm not going to be able to show you all of the columns because it would expose the email address lessons and such and these are actually this is a real life data set contains real information my customers information his first name last name addresses and stuff you can think of them as being you know spotted from a CRM or from a waveform right that's that's how you can think of it I'm going to close it now in the academy.db this is a sqlite file converting from a CSV file to a sqlite file takes about three to five lines of code very very simple if you want to have if you have a CSV file and you want to practice SQL on the CSV file then you can either use DB I have a video covering that you should I'll put the link in the description if you go and take a look at it or open up sqlite and just do a conversion from CSV file uh going to a simple light file that takes about three lines of code so I also have a video for that and I'm just gonna put a link in the video description so if you want to check that out feel free to do that okay I'm going to assume that you have your own CSV file and you have your own um you know sqlite file database file or you can have a remote database server running somewhere right all you need to do is to be able to specify your database URI so that would be wherever that is so this could be some Azure DB running this could be some password running somewhere I could any anywhere right another kind of very common workflow is to have something if you use like bash maybe this could be in your Bash RC or D shell DRC and then you would have something like dot OS and then you can say get F and then you would have your I don't know how you name it but you can you might name it database URL for example that is also one common word flow all right so as long as you can specify that DB URI you should be able you should be good right I'm gonna remove all that because I'm not even gonna use that I'm just gonna read it from this if you don't have any sqlite and you don't want to learn how to convert the CSV to a sqlite font go to my GitHub there's a ton of fake dummy SQL I thought I can download and play around with that okay in my case here I'm going to use sqlite and I'm gonna pre slashes and I'm gonna specify the path to this so this is Academy slash academy.db right so the first Academy we first do this folder that's the name of that and the second one of that is academy.db why do you call it Academy because I run a data science school called out with my data science academy and this data is actually exported from that which is why I can share with you I can share with you other sqlits you can use it it doesn't matter okay and then you create a second line to just say okay this is my your URI this class instantiate a class from that so you could go ahead and say something like SQL database then you say from URI and then you pass in your URI and that's it so this is the only line you need to change right now at this stage right you want to you want to either use a relative path you want to pass in a direct path uh ideally you don't want to do that because it's probably very sensitive information unless you're just running this on your local computer that's fine but if you're doing any kind of real database you don't want to do that you want to use something like a uh putting them into your dot EnV file in your dot EMB file and then from here you just do a lot load dot app so I'll show you how to do that so you can see from dot m import and then from here big one call the load.net and just call it so this will load all your environment variables from the dot EnV file so all you need to do is to create a DOT EMV file and you specify the file and then make sure that that is not in your gate commits so put the get enough file in there and then just ignore that file right away right so you don't want to have this persisted anyway you don't want to accidentally commit that want to GitHub and now that you have DB URI and you have DB the next step is actually very very similar to what we did in the first video so when I have llm and we're going to say something like this we're going to say open AI and we're gonna initialize with a few parameters here for example we just do the simplest thing we said temperature equals zero so temperature controls how varied the answers are if you want to set a temperature of one or close to one for example it's gonna try to be a bit more wild a bit more random a bit more unpredictable temperature zero is a bit more determined and stick in a way right so the closer to zero the more deterministic you get and the more conservative the answers get to so in this case here we're not going to try and make it unless you're doing some sort of very creative writing creative prompt then you can have a higher temperature you will try to do something like um you know helping you write a Noel or write a song you can set the temperature to be 0.7 for example uh in our case here I'm going to build a q a I'm going to query my database but in nature language right so I want to query initial language get some answers from the DB so I don't want my temperature to be high I don't want it to try anything too long I just want it to be a bit more deterministic more predictable and that's that's why it's just a temperature zero all right so let's go ahead and now create a chain and this chain would just be SQL database chain and SQL database chain it's going to take three parameters actually just two it's just gonna take two parameters the first one is the LM so in llm we actually already have the llm so what we need to do is you just pass that back in and then you need a database and in this case here it's just the DB they create that's it but I want to see the step by step that is doing each step so I'm going to say we're both equals to true and that will do it and that is kind of how you set up the chain and 10 lines of code actually if you remove the spacers seven lines of code you're done all right so let's go ahead now and maybe uh try to query our database now in nature language try to get some prompt ultimately this is still a SQL database it still needs to execute SQL queries it cannot execute English so GPD has the job of doing a translation of taking English translate that into SQL and then execute the SQL against your database all right so let's go ahead and do that so DB chain um let's do a few one right so let's do the first one let's run that and let's do uh how many let's start with some something simple how many rows is in the responses table so how many rows is in the responses table of this DB something very very simple should be able to generate the SQL so if gbt is not enough it would take this sentence convert that to the select and then if we just select count from responses and just return the comments oh it's there's 25 000 rows of data that's good enough for me right now we're gonna maybe up the difficulty a bit we're gonna try and make something a bit more interpretive a bit more open-ended we can say try to describe the table for me describe the response table for me right something more open-ended we also gonna add a few more maybe one or two more so start another one let's say um let's take a look at the CSP file there is the country column so there is the Indonesia the United States there are uh there's a few more Hong Kong Singapore all right so I'm gonna ask you to summarize that I'm gonna say what are the top three countries where this responses are from so I'm gonna something a bit more tricky you need to be able to summarize that and then find top three right so I'm going to ask the question there and I'm gonna have the last one here so I'm gonna say BB chain run and I wanna maybe have something a little more difficult a bit more special a bit more different I will try to ask GPT to try and generally summary I said Give me a summary of how these customers uh came to hear about us all right and maybe maybe add a bit more content what is the most common ways they hear about us and that should do it four questions um very very simple let's go hit open up autumnal push this to the side this file is called demo 2. py so I'm just going to say Python and demo2.py and let's run it done all right that's pretty fast let's go ahead and take a look at the answers so the first one it says entering new SQL database chain how many rows is in the responses table of this DB this should be easy I I even give you the answer already I say just do a select count from response and you should get the answer and so it says here this is the SQL query it used so it has to do the job of taking my English convert that to SQL run that give me the SQL query generate the result and then finalize with an answer it needs to do a couple of things there right so you couldn't just go ahead and arrive at this answer it couldn't do that it has to derive the answer from the result which is further derived from the simple query all right so there are all together 207 rows in the responses table of this DB okay fair enough let's look at the second question describe the response table so what it does is very interesting instead of using any kind of like describe it does a select all from response limit five it looks at all of them and then it tries to formulate an answer by looking at all of them so I'm not sure um limit five it tries to take a look at the first five rows I will need to censor some of this part because some of the emails are showing right here but uh okay that that's something I'm gonna have to do so I cannot show you my customers email not that I'm not not that I don't trust you anything but it's a violation of YouTube's policy so I have to put a black box around this area when by the time I come to editing okay the next one the response table contains 15 columns so it gave you the number of these columns give you the names of these columns the third question what are the top three countries where these responses are from so this is the SQL query it needs to again translate the English to SQL it's a select country con country s-con from response this looks about right to me and it generates the um SQL result and then it generates the answer the top three countries where these responses are from are Indonesia Vietnam and the United States okay fair enough the last one give me a summary of all these countries or how these customers came to hear about us what is the most common way that they hear about it so I didn't even put that correctly but okay so it first does this it says select how do you hear about this because he realized that to answer this question he needs to figure out the right columns to look at and it figures out all of the different 15 different columns the column that really matters to it is the how do you hear about this con which is quite smart right so there are 15 columns it looks like this it says what which columns would contain the clue to this answer well it could be the uh this this column so this is called a particular column called the how do you hear about this Con and it doesn't generate it doesn't come does the group by and then finally It produced this nice aggregation table search engine 76 social media online adds 54 media publishing 37 email uh eight and then others too we have a partner called coin works so that's two so the most common way customers heard about us is through search engines followed by social media online as media publishing email address columns so that's quite neat because if it has to figure out of all these columns and figure out which one for these answers it generates the SQL to do that for you right so that's pretty nice okay so that's how you use Link Chain against your SQL sqlite or SQL database all right but what about using it with a CSV file so let's go ahead create a new file I'm gonna say demo 3.p1 by now there's a lot of things that are similar so I'm just going to copy of that and I'm going to remove things that I don't need so which is probably off that so I'll keep the dot end I'll keep the length chain maybe not all of them maybe just the first open AI but now I'm using CSV so I'm going to look at my CSV file and try to query do a q a against my CSV file so to do that I'm gonna bring in the load machine dot document lotus.csvloader import CSV loader all right and then I'm gonna just load all my environments and I'm gonna have to specify my file path so let's do my file path let's say Academy this is the name of my folder again remember that and then I'm gonna say this is academy.csv file because that's what I name it then I will just have to specify my loader and my loader is just from the class instance here CSV loader so I'm going to just create a instance of that pass in my file path done now once I have my four path I loaded that I haven't really read the data in there so let me write that in I'm going to collect data I'm going to say loader.load and then if you want to you could just print the data I guess all right if you were to run all of this Let Me Clear My screen let me say python demo3.py you should be able to see all the CSV 4 okay so all that's printed again I would have to maybe censor this out because I can't show you the email addresses of the customers but it turns out that link chain has something even easier for for you and it uses this concept for the agents I'm gonna make a video on it in the future about what agents are for now let's just go ahead and just import that so I'm going to say from linkching dot agents import quick CSV agent and now I no longer need the CSV loader and I will no longer need this loader as well so I'm going to delete that delete this as well all right so let's clean that up as well let's create the llm and this is the same as the last code example that you see here so basically this line just copy that and then let's create an agent so agent equals to create CSV agent and what do we need we need the LM we need the file path we need we said we're both equal to true because we wanna again we want to see the method we want to see breakdown right now I want to mention that the the if you look at the documentation this agent calls the pandas data frame agent under the hood right so this is actually using pandas data frame so what it means is that if uh when you use this agent it calls the panel's data frame which in turns calls the python agent which then execute the llm generated python code so it's going to have to generate the python code for you then you have to execute it for you and then give you the answers for it um this could be potentially bad if the LM generator pattern code is actually harmful so you need to know what the code is doing set the verbose set the set the verbose flag to be true and see this process unfold right and make sure that the python code is not harmful and by harmful what I really mean is that you don't want this python code to be assessing different things on your folder on your directory or looking at your systems looking at your Os or maybe injecting some kind of code or maybe opening up some sort of path you don't want you to do that so you want to assemble both to true so that you can see what it actually is doing under the hood so I like to set it to be versus true I guess that you know maybe if it doesn't matter to you too much um and you have a lot of faith in our future overload then I guess you can set it off to be false right so let's go ahead and ask a few questions agent run so let's ask maybe three four questions as well same thing as the same drill so the first question what percentage of the respondents are students versus professionals that's something that matters to us because we run a coding bootcamp we care about how many of our students are actually students from how many of our students are actually working professionals just trying to upskill themselves so let's ask that first question then the second question would be something like this we could be something at least the top three devices um that the responders use to submit their responses and then finally let's do a third one um and that'll be it so let's say consider IOS and Android this is a bit gonna be a bit harder this is going to be a bit trickier why is that because if you look at the data set in here in the CSV file there is a column here that actually says that you know what is the os's and it has like mac and IOS and Android and stuff so I wanted to basically be smart enough to consider iOS Android as mobile devices and then I want to say something like what are the percentages or what are the percentage what is the percentage not what are what is the percentage of respondents that discovered us through social media uh submitting this from the mobile device and that's about it so we have all the three questions we're good to go we should be able to just run it so let's run over say Python demo3.py and let's see the agent executed chain so the first question and and you know what's really interesting is that it breaks down the thoughts for you so it says the first one right what percentage of the responders or students versus professionals is this I need to know how many respondents or students and how many are professionals this is a thought right so it does an action the action is listed here as well and then it does the input remember how I said that they actually just delegated to python uh the pandas and then pandas will then generate the right python code to just query that right to do generate answers so pandas would do this if you're familiar pandas you see this um you'll probably know how to write this yourself it says look at the occupation and then generate the value cons for example and then here you get the students 25 to 575 so that's pretty good pretty good pretty smart second question list the top three devices that respondents use to submit their responses so again the thought process I need to find out which devices are most commonly used so it start off with that it says take the device column do a value console find the top uh top three and then it gets the answer phone and tablet so what are the top three devices it just gives you two just one and tablet so oh no that's actually desktop as well so yeah desktop 185 phone 17 and tablet 5. so that's those are the top three devices all right cool now the last question consider IOS and Android as mobile devices what is the percentage of respondents that discuss that discovered us through social media submitting this from a mobile device and again it has to do this this is a bit more trickier because it needs to figure out an image to use things like it's in it needs to use things like uh you know maybe a bit of subsetting and it needs to basically change this condition and condition then it finds a shape divided by all of that and then the thought is that okay I know the final answer and that's 18.18 and that's it all right so this is kind of a quick tutorial to how you would use land chain to query your data do q a against your own CSV against your own database but doesn't matter if it's sqlite doesn't matter if it's a remote databases or if your grandma asks you to help with squaring some home recipe you know against the CSV file then now you know how to do it using open Ai and using land chain and that sums up the part two of this series in part three we're gonna go into maybe PDF we're going to look into a few more use cases common use cases but I want to keep the video short um I hope you learned something new and I'll see you again in the next video thank you bye
Info
Channel: Samuel Chan
Views: 35,122
Rating: undefined out of 5
Keywords:
Id: Fz0WJWzfNPI
Channel Id: undefined
Length: 19min 30sec (1170 seconds)
Published: Tue Apr 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.