Langchain tutorial. Query a database with OpenAI's ChatGPT

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm happy that LMS are taking over querying databases SQL is not that hard and way too many people were making way too much noise about why you should learn SQL llms thank you this video I'm gonna walk you through how to query your database in this case postgres database with Lang chain framework that will be using chat GPD under the hood this is not a deep dive for the Deep dive watch my next video we'll be working with a sample database that describes a music store which sells albums and it has information on albums artists playlist genres Etc and also has information on the financial side of things so the invoices and then on the customers and employees as well this is a good database because there are many relationships and we can see if if our LM chat GPT in this case manages to figure out that those connections need to be made we'll be going through this in a Jupiter notebook now thing you need to remember that length chain has many many ways to interact with your database and here I chose just two and I don't think it makes sense to cover all of them because the the framework itself is changing literally daily and what you see today will be outdated in a week but you will have a good understanding of what it can do and and this is solid base I encourage you to to watch this video till the end here we simply so all of those helpers they come from blank chain all of what we use in this tutorial here we create connect to our local postgres database and we can take a look at the data here so very simple albums artists genres playlists invoices many to many you know relationship with an invoice line Etc we connect to the database as always with Lang chain we create a chain and then the chain generates queries as an output so it takes a question in and generates a query out as I mentioned before this is not the the Deep dive so we won't be diving under the hood if you want to Deep dive watch my next video here I'll just show you what it can do so if we run that and we ask what is my data about it will print out a query which in chat Jeopardy's opinion should give you a good idea of what this database is about now let's see each it at it suggests us to run a query which basically lists all tables and all columns but it doesn't run it itself so this chain it doesn't rhyme run the the query itself you can run it basically lists of all the tables and columns not very helpful okay but now we can start asking more sophisticated questions like what artists have the highest and lowest numbers of purchases so to answer that question you would need to join multiple tables and let's see if chat Gypsy manages to do so okay so here we have two queries we see that each query does a join of artist one album two track three invoice line four which is pretty cool and then it does the grouping by the artist's name and does the aggregate on the number of purchases so it went through linking four tables and then drawing aggregate in this case in just number of Records it's pretty cool and then I asked for highs and lowest and it prints out two queries I think that is that is helpful if you're building some sort of a front end on top of this you can easily parse those queries out and now I want to walk you through a slightly more sophisticated chain that was a create SQL query chain now we'll we'll move into the agents and the agent will take care of a fetching the query and then be executing the query same thing here we we are being a bit more specific because this kgpt model shows good results if you're online chose good results on explaining the data working with SQL code here we connect to the same database and here we specify that we'll be using open AI functions it doesn't make much of a difference in this high level overview but you will see how that makes a difference in a deep dive okay let's instantiate that well let's ask the same question what is my data about now this time around it will give us most likely something more helpful okay based on the tables and the database the data seems to be about a music store so in my opinion that is much better than just returning this I like it how Lang chain under the hood took that query executed it passed the results on to chat GPT and then chat GPT reasoned from from what it saw from the tables and columns descriptions and said hey it looks like it's about a music store okay now let's let's move further what artists have the highest and the lowest numbers of purchases it takes a while to return the result and here we go we have Aaron Maiden is number one with 140 purchases and this other guy Adrian has only one purchase and now imagine you wanted to double check if this results you know are right so what you could do is you can ask chargpt to show you the queries that it used to to produce the results and I do it that by simply appending this short you know ask in the end of the prompt okay same result which is good because as I mentioned before the the outputs are non-deterministic now let's check if that is really so I'll copy over this query here for the highest number of purchases okay let's take it apart takes the artist table joins it to album makes sense then joins the track and invoice line it may seem weird that it joins it to track but again if we look at our ERD diagram invoice line joins to a track not the album so if you look at the invoice line here we basically charge per track on an album which is which is interesting I guess I guess that allows us to later on do statistics and analytics on a track level so all of that makes sense then we go invoice okay and it decided to calculate number of invoices on which an artist's name appears I think that is sensible cool let's run that so that's Iron Maiden and it appears in 140 invoices now the question I have is can it be that because we're joining two tracks can be the the Iron Maiden tracks appear on 114 voices so not the the artist itself let's double check that by um using the distinct keyword and ah and now it's very different now the actual result that that is correct the status U2 is the artist that appears on the most of invoices right and Iron Maiden it's artist whose tracks appear in most of the invoices so really here it's a space for for discussion but I think that our LM did a fair job on on finding an artist whose tracks appear in the most of the invoices right that's something you would you would usually specify inside of your organization how you calculate the most popular artists for example so it does those kinds of queries and now I'll go and ask something about the customers I'll ask hey how many customers are there so first and how many of them made multiple purchases there are 59 customers in total all of them made multiple purchases which is helpful I played around with it and if if you ask for specifics as in like how many what is what is the highest number of purchase per client it gives you that information we can go double check with our database let's look at the and the customer table so basically we'll go and count everything in the customer table yeah so 15 and customers that's correct also we can ask questions like hey which employees sold the most and how much it sold the least and how much cool so it gives us the result and we we can kind of double check how it did this but let's compare to the the employee who sold the most thing to remember here although the questions that they may look convincingly correct it's always good to double check which but if you're a developer for us developers it means that you need to go under the hood and to understand what exactly is happening so here for example which employees sold the most and how much it says that that's Michael and he sold 49 dollars worth of something and here the employee who sold the lease is Steve and The Tall Man of sales is 720 which doesn't make any sense that's exactly what I mean that you need to be double checking out of curiosity let's ask it to show us the the queried used in one case and in another so that's for so that's the for the employee who sold the most and this is for the employee who sold the least let's compare the two so this one was used for the employee who sold the most of this one was used to determine an employee who sell sold the least so in both table examples would do employee invoice invoice line and then we Group by oh interesting so in the first case it groups by employee ID which is correct and sums the the unit price multiplied by inverse invoice line quantity I just correct and here decides to group by employee ID and first name last name which is not ideal but let's see the results three years 49.62 here's 37.62 you remember the first time it returned us 700 something so that's peculiarial again results you get from LMS are not deterministic so you need to have all sorts of safeguards on your end to make sure that you know your consumer gets only the correct data here with the median um I played with asking child to show me the Midian cells and interestingly enough at first wasn't able to do so it said that hey postgres doesn't have a function to calculate a median and then I told it hey use use this function and then it calculated the median but then once again I asked it to calculate a median again and it told me that it wasn't able to do so because some column was missing oh there you go so you told me exactly this that there is no column related to sales but once again you cannot be sure of the results let's see if this one runs through so this one did run through and it did generate a query because I asked it to generate a query here show me the query used to generate the answer um physician in this case you didn't get confused by the absence of a column related to sales so the lesson here is yeah you should try tune later on I will record a separate video explaining what you should be aware of when using llms wank chain also in context of querying relational databases I hope that was helpful do let me know if you have any questions or feedback and I'll make sure to address it in the next videos take care [Music]
Info
Channel: Denys on Data
Views: 3,869
Rating: undefined out of 5
Keywords: data, data analytics, data engineering, langchain, llms
Id: e8HMHLr4Nk0
Channel Id: undefined
Length: 14min 56sec (896 seconds)
Published: Mon Sep 18 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.