Chat and RAG with Tabular Databases Using Knowledge Graph and LLM Agents

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to another video from llm 0200 series as I described in the previous video in this video I want to focus on Knowledge Graph and uh to be more specific I want to show you how to perform Q&A and rag uh on tabular data using Knowledge Graph so first I will tell you why Knowledge Graph uh should be considered as one of your possible solutions when you're are dealing with uh tabular data uh then I will also show you the series schema which is as you know this is a sub series for llm 0200 series that we started uh about a couple of months ago then I will go through Knowledge Graph fundamentals I will tell you about nodes relationships and uh knowledge that we need in order to construct a knowledge graph from our data sets then I will tell you how to construct the knowledge graph then I will show you the chatbot SCH schema the chatbot that we want to uh design uh in this video then I will go through the knowledge graph agent that we want to use uh from Lang chain and after that I will go through rag uh approach the approach that we want to use to uh perform Vector search using uh graph database on our tabular data and then I will cover two projects the first one is the one that I've developed for this video specifically uh in which we are going to take our tabular data sets create and construct the knowledge graph gra uh populate our graph database and start interacting with our graph database and at the end of the video I will cover a project from Microsoft in which they have developed a medical chatbot from unstructured data and I think that is a fantastic project and it gives you a much greater Vision on how you can use Knowledge Graph and what are the use cases for this a specific approach this slide helps you understand when Knowledge Graph can be a better Choice compared to a rack project a rack pipeline so first of all knowledge graph is suitable for both a structured and unstructured data and in this video we are going to see both scenarios and I will show you two different projects for the structured data we are going to build a chatbot ourself and for the unstructured data I will cover quickly a Microsoft Project to show you how it works next domain specific applications so Knowledge Graph is very powerful especially if uh the design knowledge that you want to construct from your data set is already known so for instance if you have multiple tabular data and if there are a specific relationships that you know they exist within your T tabular data sets or among them between those multiple data sets that knowledge is going to help you to construct a much more effective Knowledge Graph and helps you to leverage the knowledge shared among those databases in order to have a much more powerful chatbot compared to designing a rag Pipeline on top of those TBL data sets Knowledge Graph also brings explainability and traceability to the table so it allows us to ask questions that can be answered using multiple knowledge points within a single database or even multiple databases which is not accessible with normal Rag and direct Q&A and you can exactly Trace where that specific answer came from so what are the US use cases for Knowledge Graph the first use case today I mean Knowledge Graph can have a lot of use cases but just to give you some idea the first one is for instance a chat bot for unstructured medical reports from multiple doctors or hospitals for instance you can ask how many of the patients were diagnosed with that a specific disease or these type of questions that again you cannot use rag to get the answer for them the second scenario is a chatbot for accessing complex relationships between tabular databases but on the other side rag also has some advantages uh compared to Knowledge Graph so rag can be implemented very easily but you just have to keep it in mind that you are just relying on some sort of a semantic search between the question and the content inside your vector databases and that would be it so for instance you want to let's make it simple you want to get the average uh age within your CSP file so you have a column called age and you want to to get the average rag cannot give you that answer but leveraging llm agents and leveraging these type of approaches allow you to interact with your databases and ask exactly these type of questions so Knowledge Graph compared to rag is not as generic as a conventional rag approaches it requires more technical knowledge the slower implementation So based on experience I Knowledge Graph uh to take much more time to develop and to test and the final advantage of rag is in my opinion rag is right now more mature than knowledge graph so deciding whether to use a knowledge graph or conventional rag depends on your data's characteristics your applications ESP specific needs and considerations such as scalability flexibility and explainability so so again one thing to mention about the explainability in case you have a sensitive data again just like the one that I mentioned the medical reports this is a sensitive data and the the answers that you get you are getting that from that data is very important explainability will be super important to those type of pipelines and you can also consider to use a combination of both if it suits your use case so I'm sure I'm going to be asked after this video like can we combine Knowledge Graph and rag yes you can combine them both you can leverage Knowledge Graph rag you can even use SQL agents in order to connect and communicate uh with your databases at the end of the day everything depends on your project specification and what you want to do uh with your project one very important note is the llm model matters to be more specific there are two features of llms that are very important to me the first one is the context length and the the second one is the cipher query knowledge of your llm so when we are connecting these type of Agents like the SQL agent that we saw in the previous video or the knowledge graph agent that we are going to use in this video to our databases there is a great chance that the amount of data that is going to be processed using that that large language model is going to be huge so the context length and the understanding of your large language model from the data that is processing is super important and the second one is the knowledge for Cipher query because again to my personal experience I use GPT 3.5 I use different type of actually GPT 3.5 I use GPT 4 and as I started to use the most recent models of open AI I realized that the performance of the large language model is getting better and better this is the series schema that I I started in the previous video so as you know in the previous video we covered the blue pipelines and we used a SQL agent to start to start interacting with our SQL database in this video I want to go through the green pipeline so here I want to use Knowledge Graph agent to perform Q&A on tabular databases and I will also show you the second approach of uh using rag with tabular databases and then the second part which is going to be the last part of the video it's going to be much shorter I will just quickly cover uh the medical chatbot project so here I want to go through some some of the fundamentals that you need to understand what is knowledge graph and how to construct the knowledge graph Knowledge Graph is a database that stores information in nodes and relationships note properties we are calling them key and values and they can also have labels that groups them together and relationships have Direction type and properties so let's go through this simple uh simple image here as you can see here we have two notes one of them is person the other one is YouTube video the relationship that these two have so keep it in mind that relationships contains two notes the relationships that these two have together is the person produces the YouTube video and that person can have a key such as name and a value such as farot and you you can also see the direction the type and property of that relationship and this is the way that we represent that relationship between those two notes in text so as you can see the notes are inside parentheses and the relationships are going to be inside brackets so let's make this node a little bit more complex so I have two nodes person both of them have have a relationship with you YouTube video one of them produces YouTube video and the other one watches YouTube video so again if you check here you will see that in the text style I can write the relationship and the notes like this so I have a person that produces YouTube video and this is the direction of the relationship on the other side a person watches the YouTube video so again each one of these nodes and relationships can have its key and value or its properties let's say so for instance for this note I can have the key uh name the value farad for this one let's say the key is name uh and the value is mik so for the properties itself I can have period weekly for instance uh the properties can be specific to that uh relationship or to that note so they don't need to be similar for instance here I can have a property that is year and the value is 2024 and for the YouTube video itself I can have the topic let's say AI projects so how to construct a Knowledge Graph the first side is to have the knowledge on how to construct a Knowledge Graph so in case you already know how to construct a knowledge graph that would be the ideal scenario that is the best case scenario for using this technique and as you can see the positive is probably the most uh highlighted positive aspect is the consistent graph structure which I will explain what what do I mean by consistent here and the other one is that the the negative aspect is that it is not easy to implement so you need the domain expertise to construct the knowledge graph the second approach is to use large language models to build and construct the knowledge graph for you so the advantage would be it is easy to use and the dis advantage of this approach would be the inconsistency in the knowledge graph structure so now let me explain what do I mean by consistency and inconsistency in graph a structure so as you know large language models are non-deterministic which means that when you pass an input to them the output is going to be generated based on the probability of the next token so if you pass the same input to a large language model there is a chance that the output differs after a couple of runs so that is what I mean by in consistency so if you start passing a text to a large language model and ask that large language model to construct uh the knowledge graph for me there is a great chance that if you start passing that text multiple times the knowledge graph that is going to be generated in the output differs from the previous ones so that is what I mean by inconsistency and that is probably the biggest downside of using large language models to do it for you from the ground up so that is just the most important thing to keep in mind and one of uh the large language models the very famous ones that you can go and explore is called llm graph Transformer uh there is a lot of information about it on the internet so feel free to search and also L chain uses it so you can actually use Lang chain to uh use this Transformer model and this is just a great model to use however there is another approach that I want to highlight which is a hybrid method so the way that I personally like to construct knowledge graphs is not to fully rely on large language models but to use them in order to let's say to come up with a good structure for my data set and databases so in case I don't know exactly what Knowledge Graph I'm looking for I will start communicating with the large language model until um uh we can come up to a solution that suits my project in order to build the uh Knowledge Graph so by that as soon as I have that structure then I can either construct that Knowledge Graph myself or again I can use a large language model and instruct that large language model to construct that a specific graph structure that I just uh developed with the help of another model again the whole point is to have that consistency so the pros of using a hybrid method is consistent graph structure and you definitely would need less expertise in order to build that structure because now you're Consulting let's say with a large language model the cons is a still it requires some basic knowledge of coding and graph knowledge implementation this is the scheme of the project that we want to develop in this video so the green uh pipeline is representing the data preparation for graph database as I mentioned I'm going to use tabular data and for that our pipeline is able to handle CSP files and Excel files from those data we are going to construct the knowledge graph and we are going to populate our graph database uh using the constructed Knowledge Graph the yellow pipeline is for interacting with the graph agent and graph database so as soon as the user asks a question that question is going to be passed to a graph agent that agent is going to create the cipher query use it to query our database get the results and op to this point everything is automated and we don't need to do anything but what we get at the end is the final answer from the agent for our question the blue pipeline is for interacting with the embedding model llm and graph database so as I uh mentioned in the previous video I want to show you a second way of Performing rag with tabular data in this video I want to show you how to convert one of the columns into vectors get the embeddings stored at a embedding into a graph database so as soon as the user asks a question that question is going to be passed to an embedding model in order to get the embeddings then we're going to perform a vector search using the questions embeddings and the vector index in our graph database then we are going to retrieve the content the some results would be retrieved we are going to pass those those results along with the system role along with the user's question to a large language model to get the final answer so I mentioned that I'm going to use a Knowledge Graph agent in this video in the previous video I showed you how to use a SQL agent in order to interact with a SQL database so if you remember I mentioned the difference between uh the agents that we are using and a Q&A pipeline versus a rag pipeline uh this is something that I covered in the previous video and I mentioned that an agent would get the question that we pass to it so technically it's going to be the first step of an agent which is an llm that llm is going to convert the question into a query that our database understands so in the previous video our database was a SQL database so it was converting our question to a SQL query then it uses that query on our database to get the results from our database those results are going to be passed to a large language model in order uh to get the final answer I brought the previous agent here because if you have watched the previous video it would be much easier for you to understand how uh the graph agent works because technically the performance is exactly identical the the only difference is instead of a SQL database we have a graph database and instead of SQL queries our agent needs to convert the question to Cipher query that is the only difference so the question is going to be passed to an llm Cipher query is going to be generated it's going to be used on the graph database then the result is going to be passed to a large language model to uh get the final answer again I'm going to use Lang chain agent for this video as well just like the previous one so I found out that these agents are a great head of start even if you want to design your custom agents it is good to have these agents on the side in order to compare the performance and see uh how they would do on your database and for the second uh part of the pipeline which is interacting with the vector index in our graph database I want to show you how to convert one of the columns which we are going to create synthetically actually in this video uh get the embeddings and then store it in a graph database and then uh start to perform rag on it okay so to construct the knowledge graph for our chatbot I want to use a data set which is a movie data set it contains some information about different movies so it contained the movie ID the release date the title of the movie the actors the director and a genre of that movie along with the IMDB rating so this is the knowledge graph that we want to construct from our data set so our data set is going to have a person note that is going to have two different type of relationships with our movie node so as you can see right now movie is the central node somehow movie is that dot that is going to connect all the other dots together however there is no limitation on how big your graph knowledge can be but this is just again a demo of how you can construct a knowledge graph for instance from right now two different data sets because as you you can see I also have some nodes that do not exist in our data set so I'm going to create them synthetically so our person node is going to be connected to the movie either through the directed relationship or through acted in relationship so these are the two columns that are going to be stored in our person node and we are going to break down each column and we are going to actually extract each individual from those columns so it's not going to be as we have it here but it's going to be for instance Tim Allen is going to be an actor in let's say Toy Story movie then our movies will have a relationship with genre and the relationship is our movie is in genre and then we are going to connect the the second note to that relationship and these small tables on the side of each note represent the properties of that node so our person node is going to have a property which is a name that name is going to point to any of these let's say these names in director or actor's names but depending on which name was selected that name is going to have a different relationship with our movie note so if for instance forest vitier was selected that name is going to uh get connected to our movie note through the directed relationship for the genre note we have a property again that is the name and that is just simply the genre of that movie our similar movie has property name which is the name of the movie our location again has one property which is name our movie note has multiple properties it is going to have the ID property as you can see it here it is going to have the release date the tagline title and IMDb rating so tagline is again a column that I'm going to create syn ically but you can see title IMDb rating and released also in our database so this is the knowledge graph that we want to construct from this data set and I will show you how it's done in a few minutes this is the second part of the video so as I mentioned I'm going to cover a Microsoft Project and in that project there are two main things that we are going to see the first one is how to prompt an llm to construct the knowledge graph for us so in the first part of the video which is this one I want to construct that Knowledge Graph myself but on the second part of the video we want to use a large language model to construct the knowledge graph for us so what would be the use case for that uh in case you have an unstructured Text data for instance that is just not easy to create Knowledge Graph from it you can leverage a large language model to do it for you the next thing that we can learn from that project is how to create Knowledge Graph from unstructured text this is exactly what I was uh talking about all right now I want to show you how to implement the things that we covered uh in the presentation but first let me let me show you this chatbot the project that we want to develop and uh just to give you an idea of what we are doing when we are going through all the notebooks so this is the chatbot that we want to design in this project and I've already asked a couple of questions from the chatbot and you can see that uh it gave me the answer for all of them our chatbot has three functionalities the first one is to perform Q&A with graph database using improved agent the second one is to Q&A with graph database using uh simple agent and finally performing rag with graph database so don't worry about this improved and simple I will uh talk about them and I will show you what do I mean when we are going through the notebook but just keep this chat B in mind as big picture so when we are going through the notebooks and when we are preparing the database this is what we want to accomplish so in order to prepare the chatbot the first step is to prepare the graph database I'm going to use no 4J for the graph database and there are two ways that you can use and create graph databases in no 4J the first one is to use it remotely using uh their web service or you can use the desktop version so this is what you will see as soon as you open up no 4G desktop for the first time and you will also see this example project they provided you with an example as well in order to make the graph database work there are a couple of a steps that you need to go through and there are some changes in the configurations that we have to apply manually so in the read me file of the project if you check it I have provided you with all the steps necessary in order to make the vector database work but in this video I want to go through them one by one so you can uh see actually how that works because if you are doing it for the first time it is going to be a little bit confusing but as soon as you go through it a few times everything uh will be much more clear so in the execution section I wrote create and start a graph database in no 4J remotely or using the desktop app upgrade your graph to be at least version 5.17 and then install the plugins so let's go through the first two steps so I've already uh opened my desktop I just click on new create a new project let's also change the name of the project to sample movie Project you know what let me just create call it sample movie okay so in the next step click on ADD and add a local database or you can also connect your database if you have created it uh using their web service remotely you can use this option in case you created a database remotely but I want to create a local database so the name of the graph database let's just call it graph dbms at it is I will give it a password I will give it a simple one one to8 for this video there is a great chance that when you are creating uh your database for the first time by default it's going to be on 5.13 I assume so make sure that you choose at least 5.17 and the reason behind that is we want to use geni plugin on the database to perform Rag and to perform the vector search and as far as I've explored and realized that 5.17 is the minimum version that allows you to do that so in case you're using the previous versions you may not be able to perform the vector search on them so it's already on 5.7 for me I just click on create all right so my graph database was created in The Next Step which is a step three I wrote install these two plugins so let's go to graph database so as soon as you click on the graph database itself you will see some details of of the graph database the version addition and status you can also reset the password in case you forgot the password and here you see the plugins so I need two plugins for my graph database first install this one and then install the data science Library perfect I have my plugins ready in the next step it says modify the config file of no forj and these are the modification that we need to apply I will explain what they are uh in a moment so first let me show you how to find this file so in order to modify the file that I showed you you can either click on these three dots and go to settings this is that config file that we want to modify or open folder and open configuration so as soon as we click on that it is going to automatically open the folder in your uh hard drive that contains the configuration files for your graph database and here you see we have no 4J config this is the file that we want to modify so let me just open it right here perfect so this is my file and these are the modifications that I want to apply the first one is comment out server. directories. import equal to import so this is exactly here right on the top I'll comment it out and on comment dbms do security. off Ena equal to true this is already uncommented for me so that is great the next step is make sure this line is set to so first let me copy the line true and of course it should be un commented so what we are doing is we are giving our graph database the permissions that it requires to in with our CSV files locally because the graph database automatically will not allow you to load the CSV file and read the files from it but with these changes we are actually uh allowing it and just giving it the right permissions and then in the next step make sure that this line is set to this one so if you pay attention there is one main difference here and that is GNA so I just added gen to the plugins of My Graph database so we can use it for the vector search so I comment this one out and I'll add this line here next make sure this line is set as so let me search actually it's right here so first of all I need to uncomment this line or let me just completely copy this line and paste it here again I want to uh activate the Gen plugin for my V uh for My Graph database and these are the two steps that are required uh for that and finally copy the Gen plugin from products folder into plugins so up to this point all the changes that we needed to apply to our config folder is done so let me just save using contrl S close the config and go back to my desktop app I will again go through three these three dots open folder and here you can either use the dbms folder or you can use plugins it doesn't matter so I will just jump onto my dbms folder so this is the folder that contains all the information for my database as you can see I have the products folder here if I go here I will see no 4J gen plugin automatically generated in my product so this is not something that you need to download or you need to find uh by yourself you can just find it in products here that is why I mentioned up upgrade your graph database to minimum five 5.17 so I will just copy it and I'll go through plugins and I will P paste uh that file here so this is all the steps that you need to take in order to make your graph database functional using the tabular data that we want to work with locally and also keep it in mind that in case you want to use the graph database with a different source of data for instance your data is sitting on blob storage in Azure or somewhere else there might be some other changes that you would need to apply to make it work but for our data sets which are located in the data folder in my project right here this is going to be just fine and we can now start interacting with our database so that's it the next step is I open my graph database the desktop version and I will start the graph database perfect so the graph database is now running so technically it means that we can connect to graph database and start interacting with it but right now our graph database is empty there is no data in our graph database and we are going to populate it in a moment so in the next step I want to go through all the necessary knowledge that we need in order to design our project but just to go quickly uh through the project structure just like the previous projects we have the configs folder with the config inside inside it we have our data folder that contains our data which is the movie CSV uh but actually I'm going to download that uh data set directly uh you will see it in a moment we have the explore folder that contains all the notebooks that I want to cover and then we have all the codes for our chatboard and its backend in the source folder so as soon as you open the explore folder you are going to see four notebooks in the explore folder itself empty graph database explore movie data test connection with graph database and test GPT with graph database so these are just two notebooks to allow you to verify first your connection with the graph database and second your connection with open Ai and your uh GPT model and your embedding model so I'm not going to run this one but in case you face an issue with open AI come to this to this notebook make sure that you can establish a connection from your PC or laptop to open AI models and start interacting with them and as you can see here like by these three green check marks it shows that I can connect to my graph database without any issue and from here I can start interacting with it okay so now let's go into movie rag graph database if you open this folder you will see a sub folder Q&A and rag using open AI directly as you know I use Microsoft Azure to interact with the graph database but you can also use open AI directly so I provided you with some notebooks that uses open model directly however you need to establish the connection with open AI by yourself so these are the five notebooks that I want to go through in order to design our chatbot the first notebook is to prepare and save movie data this is the URL from which you can download the data and for Simplicity I'm only loading the first 20 rows of that data set and this is just some more information number of rows and columns you see that there are seven columns with these column names and these are the title of the movies that we have in our data set next I want to create some synthetic data so the thing is the whole point of using a knowledge graph is not only to be able to create and construct that Knowledge Graph within your database but also to create and construct the knowledge graph among multiple data sets and databases so I don't have a second database but this cell right here is creating three separate column names along with their values synthetically and I'm just going to consider it as my second database so imagine that I have two data sets one is movie and the other one is this one right here and there is a mutual column in them which I can use in order to merge or to just connect all the dots together let's say for Simplicity I have the title like the original movie names in this data set as well so for each one of these movies I have a tagline which is an explanation of that movie a short description I have a location for each movie I'm also creating a similar movie and also keep it in mind that all this data is synthetic and like the data itself might not be correct the descriptions the country or the similar movies might not be actually relevant however it is going to do the job for us and it's going to allow us to create that graph database that I showed you in the presentation so I will create these lists and I will just add them to my data frame so this is my data frame right right now so imagine again that I just merged two different data sets so I have my data frame ready I will save my data frame perfect so as soon as you save your data frame your it is going to be saved in data folder moviecore CSV so this CSV file is going to be the file that we just created together in this notebook in the next notebook I want to prepare the data from my CSV file and I want to populate My Graph database so let's go through it step by step step the first cell is just for uploading some of the libraries that we want to use I'm going to use Lang chain community and from there I'm going to load no for graph I'm going to use pandas to load my CSV file and see how like uh verify what we are doing and I'm using pip rout uh and here just to avoid using the absolute path in my project so you can easily uh use the project in different operating systems in the next step I just connect uh to my graph database here is the description of the knowledge graph that we want to create from this database the same Knowledge Graph that I showed you in the presentation and here is the image of the knowled knowledge graph so I will explain the process while I'm going through the code but feel free to pause the video and uh go through the description as well so here I'm just trying to load the data frame and see the column name so when I was developing it I was just trying to look at the column names and think of different ways that I can create the knowledge graph from there so this is why I have it plus I'm going to use the path the directory of the data frame in Cipher query that we are going to use for populating our graph database so this is the knowledge graph that we want to create and this is the code that does do it for us okay now let me explain to you how we can construct the knowledge graph from our CSV file and I will also print the top three rows of our CSV file so I can start explaining easier keep it in mind in order to interact with the graph databases and also be able to query those databases having a basic knowledge of cyppher query is essential I've also provided you with a separate notebook called query Movie Database with Cipher and I provided you with a bunch of different Cipher queries just to give you uh a chance to familiarize yourself with these type of queries and understand what happens inside them but in general in case you are familiar with pandas and in case you are familiar with SQL queries it is a very quick catch and you can easily understand what is happening inside those queries it just take a few minutes in the beginning in the first like for the first time but as soon as you spend a few minutes and understand what is happening then that would be super easy okay let's go through each piece and see what is happening first of all in order to create the knowledge graph I load my CSV file and I Loop over each row and I take the value of those row uh that row so this is how I'm loading my CSV file this movie directory is a variable that I defined in my Cipher query and I'm passing passing this value to this variable and this value is just the directory of my CSV file and it only accepts a string so keep that in mind I take the values of each row and I store them in this variable called row and I can access the value of each column using for instance row. title in case I'm in the first row row. tile will bring Toy Story for me this is exactly identical to Panda squarey in each row I'm going to create the notes that I had in mind these five notes and I'm going to assign their properties their properties values and also I'm going to create the relationship between each one of those nodes to the other one that uh they have a connection with so let's go through them step by step using merge you can start creating a node and assign it a label for instance the first node is movie I'm assigning it label movie this is another node and I'm assigning it the label person to movie note I'm assigning the value of all its properties so it has an ID it has released title tagline and IMDb rating and again you see how I'm passing the values for each one of those properties and in case you need to convert any of the values uh for that row you can do it inside Cipher for example here I'm converting released which is a string uh to date and I'm also converting IMDb rating to float after creating the nodes you can also create the connection between that node and whatever note that it needs to be connected to for example person Noe connects to movie so let's look at here person node connects to movie with two different relationships it is either directed or acted in and the relationships depend on the column that it was extracted from so for example if it was extracted from actors it is going to be connected to movie using acted in if it is extracted from director it is going to be connected to movie using the directed relationship so for instance for this note person which is coming from the director column I'm connecting that note to movie using the directed relationship so the relationship starts from person and P represents person here it ends in m movie and M represents movie from here and the relationship between them is directed there is only one last thing that I want to mention so this whole query is a big for Loop over the rows of my data frame but some of my columns might have multiple values stored in them so for example actors director and genre have this characteristic and they might contain multiple values as you can see here however these values can be easily distinguished and separated using this separator right here so since I want to create a note for each one of these names or each one of these genre what I will do is I will also Loop over each one of those column values and I extract each one of these names and I create a separate node for them so this is what I'm doing here for example let's go through this one for each actor in row. actors so for each actor in row. actor and you can understand each one by using the separator as you can see here create a node and create the relationship we have nine columns but you just saw that we are going to also break down each column value in case it contains multiple values so let's first run this cell let's print the schemo perfect so we can see all the relationships and the not nodes and the data types that they contain here and let's print the number of nodes you can see that our number of nodes is 155 so up to this point we created a graph database and it contains all the information that we had in mind and now we can easily start asking questions from this graph database in case we know sof Cipher query so right now I can easily start let me just quickly ask find the name of a person that is Tom Hanks and uh just return it to me here we go so there was indeed a Tom Hanks in uh the person nodes the person noes that I have in my graph database so you can start easily interact with your graph database and the database contains all the information that we had in mind in The Next Step my goal is to create embeddings from t tagline column as you can see tagline is somehow a description of each row technically it's a description of each movie and each row belongs to an uh a separate movie so I'm going to create an the embeddings of tagline and store those embeddings into my Vector index so let's do it here so load the libraries I will load the data frame load the open AI credentials that you want to use uh with your embedding model uh actually I'm going to use this one okay and then I will just create an instance from my Azure openai and I also create a function that can take a text file and give me the embedding so I will just use that client that I just created in order to uh generate the embeddings and I return the embedding of whatever text that I passed to this function perfect so for instance let's print a few of the taglines in my data set these are the taglines and in order to generate the embeddings I can just start passing that function in a for Loop to each one of those tagline so I just created the embeddings and let's see what we did number of vectors 20 because we have 20 rows in my data set and now I have 20 different vectors the dimension of each one of those vectors is 1,536 which is the embedding dimension of this model and it is important because when we want to create the vector index in our graph database we need to pass the dimension to it and this is the top five value of one of the embeddings as an example I have my data frame I create a new column called tagline embedding and I assign all those vectors so if you look here you will see that in my tle line embedding each row now contains a vector with 1,536 values in it in the next step I create a vector index Inside My Graph database this Vector index for now it's going to be empty again I provided you with all the comments that explains what is happening in each row but essentially I'm creating a vector index I'm assigning it the dimension that we just saw and also I'm assigning it a similarity function which is a cosine similar similarity function as a default in The Next Step let's just print some information about that uh Vector index that we created these are the descriptions of each one of these lines here so just keep it in mind that ID needs to be a unique value population percent is the number of values inside your vector index compared to all the nodes in your graph database so right now since I have a tagline for each row my population percentage is 100% so next I want to populate my index just like the way that we assigned properties in the previous Cipher query you saw how I'm assigning this property for for example I'm going to use the exact same Cipher query but to assign a new value to it so I'm going to Loop over my data frame this is my data frame I'm going to Loop over my data frame I'm going to take the rows of my data frame one by one I'm going to extract the movie ID and I'm going to extract embedding I querry My Graph database I find the node with the same ID as as my movie ID and to that node I assign a new property called tagline embedding and to that property I assign the vectors in my data frame so we just created a vector index we populated that Vector index and now that Vector index is queriable and we can perform Vector search on it so let's do one so again print the schema if you print the schema you will see that new properties is added to movie and it's called tagline embedding and it is a list let's extract something randomly so with this Cipher query I go through my graph database and I extract any note that contains tagline and tagline is not null in that Noe and then I return it and I print the value so this is somehow just randomly picking one of those uh notes so these are the two values that it Returns the adventure takes off and this is the tagline embedding for that specific note perfect so it is safe to say that we just went through the most difficult step of preparing this project because constructing the graph knowledge and populating the graph database in my opinion is the most difficult part of this project so now in the next step I can start and ask questions from My Graph database using a lang chain graph agent so just a reminder as I showed you in the chat bot my goal in this notebook is to design two different agents one is just I'm calling it a simple agent which I will show you what it is in a moment and then we are going to work with the agent and improve it a little bit I will show you why and I will show you how in this notebook so at the end we are going to have two different agents and I'm going to use both in the chat Bo so first let's load some of the libraries that we want to use I'm going to use Lang chain Lang chain community and I'm also going to load my open AI credentials and I'm going to create an instance for my large language model and also I'm going to connect to my graph database there is one note that I want to add right here actually all my notebooks have these two models in them I explored with two different GPT 3.5 in the presentation I mentioned that the llm model and its capabilities in writing Cipher query and also its content length is very important so with these two GPT uh models actually I found out that this one performs better than this version GPT 3.5 turbo 1106 so even by switching between the same model but different versions and different uh release of that model I received different performances this one performed better that is why I mentioned in that presentation that your model is important and in case you use GPT 4 that is going to be even more powerful than this two and to a great extent actually gp4 is superior compared to these two for two main reasons it's content context length and its ability to write code or let's say to write Cipher queries I'm going to use this model the weakest model let's say that I found out based on my experience just to again give you the start so from here as soon as you start improving your model you are just going to make the chatbot perform better so I have my connection to my graph database let's just again print the graph schema perfect so I'm going to use three questions to show you when a simple agent will fall short and why we need to improve that agent so one of my questions is what was the cast of the casino so for this question the agent simply needs to find casino in movie notes and connected to actors note and find out all the actors that played in that movie the second question what are the most common genre from movies released in 1995 so it needs to go through movies filter the ones that were released in 1995 and then go and find genes are of those specific movies question three what are the similar movies to the ones that Tom Hanks acted in so the agent needs to find Tom Hanks in actors connects and finds all the movies that he acted in and then connects those movies to similar movies and find out what are the similar movies to those ones so these are the three questions that I want to use let's go through the agents so just like the previous video you can just simp simply uh load one of the Lang chain uh classes in this case graph Cipher Q and A chain it requires graph and mainly it is because it is extracting the graph a schema from that graph it requires llm and I'm going to just put verbas through in order to see all the steps that it is going through next you can simply pass the query in in a dictionary and invoke the chain so in in case you do it you will see that the chain comes back to you with LM response so the first question was about uh what was the cast of the casino and these are the cast and the actors who were in Casino let's also pass the other two questions the second question was what are the most common genres for movies released in 1995 the answer is I don't know the answer let's go through the third question What are the similar movies to the ones that Tom Hanks acted in the answer is I don't know the answer so this is why I'm calling this agent the simple agent literally you have to run two lines of code and pass the values of like the graph and the llm to it and you have access to the agent but the agent clearly is not powerful enough to give us the answers for question two and question three so from here my main goal is to improve the agent in order to be able to also answer the other two questions and just to give you some more context in case I use the other GPT 3.5 that model will be able to give me the answer for question one also it will be able to give me the answer for uh of the question two but it won't be able to answer question three so as you can see just by switching the model we will be able to get the answers for two questions and we will only need to improve the agent for the third one so that is again a very important thing to keep in mind use a powerful large language model uh in the back end so you can have the full power and the best performance so just to make it more simple for you in my source utils I have this module called import chain. Pi I will open this modu right here so we can actually I will put it here because I like modularize this improved agent there and it is easier to have to understand the big picture from it so in order to create the agent these are the steps that I need to go through so let's also scroll down a little bit improved agent contains four main steps detecting entities in the user input match entities to the database so it it extract the entities and it also searches to find out where those entities are located in the database then we create a custom Cipher prompt that takes the entity mapping information along with the schema and the user question to construct a cipher statement so as you can see we are going to help the model to understand where the model can access those entities in the graph database and finally it is going to generate the answer and give the answer to us so these are the steps that we are going to go through PR prepare entity chain prepare a cipher prompt using that the entities that we created prepare Cipher response prepare response prompt and finally I'm going to prepare a chain from all the previous steps that now not only sees the user's question but also has access to a cipher query that is able to extract those entities from the database so this is very important these steps are going to allow us to have a much more powerful agent than the one that we just saw so let me close this one and again go back to our notebook so the first step is to detect the entities I'm not going to go through the step by step of what what is happening inside the codes and I will just scheme through them and go through them quickly because it is just going to make the video very long but feel free to go through them I added a lot of comments a lot of description of what is happening you can also go through L change documentation feel free to just dig in and uh dive deeper if you want to understand exactly what is happening so for that we need a class called entities and we are going to pass these two prompts which essentially tells the agent what to extract it is going to tell that llm to extract person movies and years from the text you can add or remove more entities in order to match it to your own data set so for now for the movies data set actually even not the years one this is something that I use for personal exploration but for now we only need person and movies so let's extract these two so as soon as I extract these two like I prepared The Prompt I can uh put them in this chain of create a a structured output chain pass the class that we created pass the llm and pass the prompt that we created and as soon as I pass those questions to this chain you are going to see that it is going to extract the entities within those questions in the next step I can tell my agent what to do with all those entities that it could extract from the question so we we extracted person and movies I'm telling the agent that extract the person and movies notes and store them into a variable called p and try to find out where p. name contain the value or p. title contain the value so since we extracted the person names or movies those names can be found in one of these two directions and in case you could find them just tell us where did you find them so this is a function that wraps everything together let's create that function and let's run it so we had two questions question number two and three were the questions that our agent failed to give us the answer for as you can see right now it could give us the location of Tom Hanks in our notes it says Tom Hanks maps to Tom Hanks person in the database but for the second part since there was a year involved it didn't return anything because I didn't include the year in this query so I use the year personally and I try to explain to the agent how to also extract the years from my database but that just would make things a little bit more complicated so I removed it for this step but I strongly recommend you to start as I did start playing around with these uh values with this database and start to you know uh teach the agent more information and more information until it is dominantly like able to extract the information that you are looking for another important note is that our search is now relying on the correct spelling of our entities however just to make it more reliable apply fuzzy search and make sure that even if there is a misspelling in your uh entities your Cipher query is still able to retrieve the locations of those those nodes and the properties that those entities represent so in the next step I just want to use the knowledge and the information that we found earlier which as you can see Tom hangs maps to Tom hangs person in the database and create a cipher template and inform our agent where to look for those specific entities so again I'm not going to go through these steps but now the cipher query that is uh being generated from those information looks like this as you can see now passing that question the cipher query that was generated includes the question includes The Entity and also includes the place that that entity exists which is a person so this is going to allow our agent to easily find that name and start extracting information for it and finally all we need to do is to wrap everything up has the cipher query that we just generated to our agent so it is able to do everything for us so let's wrap everything up now this is our agent let's start asking the questions question one two and three interesting so one thing is the agent was able to actually give us the answer for all the three questions although we saw that the process somehow failed to extract the information for the second question that is an interesting observation that I had so in general introducing those extra information to the agent was somehow allowing our agent to even return the value like the answer for the second question as well that is why I actually removed that year from my Cipher query here because I wanted to show you that even without introducing the exact piece of information to the agent it is somehow able to to understand and to retrieve the answer for that second question so up to this point we have an agent that is able to give us all the answers that we are looking for and this is just another uh question how many of the movies have the action genre in them and there are five movies and we will see in a moment that when I'm going through the chat bot that actually all these answers are correct perfect I want to give you an exercise so I want to give you this question from the movies that were taken in the United States how many had the comedy genre this is a question that we are failing to get the answer for I'm not going to explain why it is failing fix this question open a PLL request on the repository with an extra notebook that is able to solve this question I would check your answers and I would merge the pull request for you in case you are interested to work on it but this is something that I wanted to just leave here because keep it in mind that even with the current configuration with the improved agent there are scenarios and cases that we still cannot get the answer for however even this question you can solve it but the main concept that I wanted to explain here is it takes more time than a rack pipeline to implement a proper graph agent and start interacting with your graph database and this is exactly why I told you in the beginning that having the knowledge of what Knowledge Graph you want want to construct is going to help you a lot because if you know what you're looking for in your database then just like these questions you can start improving the agent you can start improving the whole system until you can extract the information that you are looking for perfect so this is the fourth notebook and let's jump in to the final notebook all right in this notebook I want to show you how to perform Vector search on that last Vector index that we created from our tagline column and we stored it we stored the embeddings in the graph database so let's first load the libraries load openi credentials again creating the embedding function and also load an instance of my Azure model then let's also connect to our graph database first of all the user will ask a question we convert that question and we get the embeddings next we perform the vector search using the embeddings that we just created from that question and for that you can use this function db. index. vector. query noes to perform the vector search this function receives two variables like values for the variables question embedding and top K and as you can see I'm Reet retrieving the top three values and the question embeddings are these embeddings here so let's print the top three values what was the question what movies are about crime heat Tom and Hawk and Balto I believe if I'm pronouncing it correctly and the reason that they were retrieved is because of these descriptions a Los Angeles crime Saga the original bad boys and part dog part wolf all Hero and you can also see the scores uh for the vector search for each one of those results in the next step I pass this result along with the user's question along with uh the system role to my large language model so you can see that this is how I'm preparing the prompt question the results and the system role and I just asked my large language model to give me the answer there are several movies about crime some of them include this three if you pay attention the large language model is not able to use its own judgment on the retrieved result right now this retrieved result with these three sentences are just not enough for the large language model to understand which one is actually about crime maybe this one is not about crime but how does it know this one is clearly about crime and this one sounds like is about crime but you got the point our model is going to only rely on the information of one single color col to come to the conclusion and give me the result so in case you want to perform the vector search on the information of one column make sure that your information is it contains enough information for the large language model so it can understand what is happening so in case if I ask what movies are about to love let's just change this question again you can see that even if I ask the movies about love the first movie is the Los Angeles crime Saga the second movie actually has something about love in it and so again you can see that the model tells me heat grumpier Oldman and Tom and Hawk are the movies about love so that's it and in this cell I just wrap everything together to perform the whole process in one execution and this is exactly how I'm going to do it in the back end so we just went through all the notebooks and we saw all the steps that are needed to design our chatbot so now let's see the chatbot itself and then I will quickly show you the Microsoft project for medical chatbot so in order to run the chat okay so first first the chatbot again is in Source you can see the app.py it has the user interface I kept it very simple it has three functionalities the improved agent that we just saw the simple agent and uh the rag side of our project the default value is on the improved agent and it's going to communicate with our backend through these two functions which are technically the same both of them are pointing to this function and as you can see the function checks for the user's choice so if it is the simple agent it just uses the simple agent if it is the improved one it uses the improved agent and if it is uh the user is asking for rag with the graph database it performs all the steps that we just saw in that cell and run the rag Pipeline on it so let's run the chatbot and ask a few question so if I want to run the chatbot that is how I can run it and let me go to my explore folder and open explore mov data here I have a couple of questions along with their answers from our data frame and I just want to ask those questions and verify the performance of our chat bot so the first question is what movies did uh Powers Booth acting so let's open up our chat bot okay okay let's ask that question from our improved agent so this is giving us sudden death and Nixon the answer is sudden death and Nixon let's just quickly perform the same uh search Fe our simple agent our simple agent was also able to give us the answer but clearly if you switch to rag it is not going to give you the answer because it is going to perform the vector search on the values of the tagline column and that single column has no information to give the large language model enough knowledge to give us the answer but as you can see our large language model is just hallucinating because it lacks the information that it needed to come to come to the right conclusion so I will skip this question this question and this question let's ask this question so what are the most common genes for movie Rel for movie is released in 1995 so I will just go to the first let's check the simple agent I'm sorry I don't have the information to that question let's test the improved agent perfect the most common genes for the movies are comedy with 10 adventure with six action and romance with five and children with four let's check comedy 10 Adventure and romance ad Adventure 6 romance and action five and children 4 so let's ask another question that we saw uh in the notebook but also let's ask it here so what are the similar movies to the ones that Tom Hanks acted in so right now even our improved agent is not able to give me the answer so perfect you can see that actually sometimes oh now it did so as you can see the performance is not as stable as you would like it to be and again the main reason is because of the large language model itself so if you improve it at least using a better GPT 3.5 or GPT 4 would definitely fix these small mistakes at least so the similar movie is Finding Nemo and the reason behind that is Tom Hanks the only movie that it appears for Tom Hanks is to Toy Story and for Toy Story we have Finding Nemo as a similar movie and finally let's ask this question from the movies that took place in United States how many had the action genre so this is the question that I showed you in the notebook that feel free to work on it and let me know if you could solve it and fix it so let's test this one and I expect to see an error perfect you can see that the chatbot is not able to actually get the answer for that question and that is why I left that question for you as an exercise so we just saw the performance of our chatbot we just saw all the necessary steps that we need to take in order to design this chatbot on our CSV database you saw that depending on what information I want to extract from my CSV database I need to modify the agent and I need to give it enough power to do it for me so now I want to show you the Microsoft project that I mentioned earlier in the video and I want to just go through some of the key steps in that project and explain how they designed the medical chatbot all right so in the chatbot I mentioned that they have designed this knowledge graph from unstructured Text data so this is the prompt that they pass to the large language model and we are going to see the code actually in a minute and ask that large language model which is gp4 to construct this prompt to construct this knowledge graph for them there are a few Keynotes here first pay attention uh to how they created their entities nodes and their relationships so this is a perfect example of how you can ask for example GPT 4 to construct a knowledge graph but as you can see they already knew exactly what they are looking for so these are the nodes that the authors wanted the large language model to create for them with this piece of information along with these type of relationships which as you can see on the right you can also see all those relationships in uh in that Knowledge Graph so there are two Keynotes that I want to mention here first if you run the project with any model except GPT 4 like any of the older models like GPT 3.5 any of them that project is not going to run for you and at some point you are going to run into an error and the main reason behind that is GPT 3.5 is not able to handle the amount of the context length that they are passing to it along with the instructions that it is being received from The Prompt the other thing is as soon as you start to run the project and you try it multiple times actually you will see that gp4 is going to come up with various structures of the knowledge graph which might be the same let's say at least 90% the same as what we asked it to create for us but at some point it is going to generate different nodes and different relationships as well so that was a very interesting observation that I had when I was working with that project because there is that element of inconsistency in the performance of gp4 and you have to keep it in mind not gp4 any large language model especially when you are dealing with such complicated prompt and those type of unstructured data this is a very complex task and at least for now I believe gp4 is among the best models that you can use out there however uh in the future by the new model getting longer context length and also becoming more powerful I assume these type of problems are getting easier to solve but still at this point these are the two big challenges that we are facing so now let me show you the code and just quickly go through it so in the readme file they provided you with all the information that you need to execute the project the project is using a stream lad for the user interface but the part that I personally am interested in so in the UI as soon as you have the graph database you can start interacting with it and uh start chatting with it as you can see they are emphasizing on the use of gp4 32k at least so this is the model that they used and let's go into ingestion file and let's go into ingestion do iynb so this is The Notebook that I'm going to go through because this is where they construct the knowledge graph and store the information inside the graph database so from the top they just import all the necessary libraries that they have in mind this is an example of that text another example from those data so first they clean the data they load the open a credentials they create a function that gets the text like this example here gets the system role and all the information that the llm requires and generates the output for us so this is the function and this is the graph that they are trying to construct and this is the code that I just showed you and then they go through the text and they P actually right now they just pass one example the one that I showed you earlier here to the function just to test it here is the F the test they test the model here and then from the result of the GPT model they extract the entities the nodes and they extract the relationships in order to be able to store them in the graph database this is what they are doing here and then is the time for the data ingestion they connect to the graph database they create all the necessary nodes that they had in mind these are the notes that they had in mind you can see that it says for each unique case for each unique person Etc and then they write a for loop on all the data to run the GPT model on them and to store the information inside the graph database so keep it in mind running this code as they have mentioned here is going to take a couple of hours but this is a fantastic project that you can build using the power of the knowledge graph and a powerful large language model what this project allows you to do is then to extract information that might be very beneficial for example for medical purposes doctors can now start exploring the database going through different information and understand very valuable information from those unstructured data so that was the second project thank you very much for watching the video and I hope to see you in the next one
Info
Channel: AI RoundTable
Views: 9,690
Rating: undefined out of 5
Keywords:
Id: 3NP1llvtrbI
Channel Id: undefined
Length: 83min 34sec (5014 seconds)
Published: Sat May 25 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.