How To Query OpenAI Embeddings - OpenAI | Langchain | Python | Pinecone

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone in this video I will show you how can you query or open AI embeddings which we have already saved in the previous video so if you don't know how to save it I would recommend you to go ahead and check it in this video the one thing which I am doing differently is rather than inserting we will be using up search so in case if there are already embeddings present for the same text then we will just go ahead and overwrite it in this way we will not create any duplicate values so without any further delay let's quickly get started okay so what we are doing here is I have taken bunch of text files here and you can see on the left hand side under the stores directory I'm having four text files so each will have certain text and we will be analyzing this file and then we will ask shared GPT to answer a few questions for us now we will not be directly feeding this and all the text to chat gbt rather we will be using embeddings here which are already saved in Pinecone so if you are not sure how we are doing it I will quickly walk you through it and in this entire process we will not be writing every single piece of code I have grabbed few Snippets from open AI cookbook as well as few Snippets from Pine Cone documentation so we directly plug in here and we'll tweak it a bit so that's what we are going to do here so let's get started so the first thing we need what we are doing here is we will be reading this text file and pushing into a data frame so let's go ahead and do that so I'm going to import the required packages first and the first one is pandas because we are going to use the data frames here and then we need to use stick token because we want to calculate the number of tokens which are associated with given text and then I'm going to say import open AI let's quickly execute this okay next thing I am going to use tick token underscore encoding and here we will use gpt2 so get encoding and let's say GPT so this is the embedding we are using then we need to read all the text files which are placed under the source directory store directory so for that I am saying file in OS dot list dir and in this we need to pass in our file name so it is understood then store let me do this colon so here we will be reading the text and I can write open let's use app string because we need to concatenate the file name we need to frame the complete path so I will be saying C colon test store and inside store we have the files so here I will put curly brace and use the file variable which we just constructed above and all this will be in read mode so let's put r okay and now we can easily read the file content using F dot read now we have the file contained we need the token so whatever the data frame we are constructing there we need few columns so the First Column we can say the file name from which you are expecting the text one column we will have for the text which is inside that text file then we'll have a column which will tell you how many tokens each of these these rows or text are consisting of then we will have a column to hold embeddings and one column we need for ID so that pine cone can deal with the search efficiently so let's go ahead and first collect the number of tokens here so we will say tick token dot encode .encoding dot encoded encoding dot encode oops and inside that we need to pass the file content so for this content we will be getting number of tokens so let's say token total tokens we need now and here it is nothing but the length so once this is done we will append these results in some list so for that I'm going to create a list contents so it would be easy for us to append the things append oops and inside append we need to give file and then we will provide file content and then we will say token total tokens okay so this is like a one entry in our list so once this is done now we need to construct the data frame out of it so for that we'll say uh DF which is for data frame and say pandas Dot data frame inside that first we will be passing the content which we just grabbed and then we will just label those columns I would say at the First Column I will say is the file name then we can say file content and just name it as tokens so these are the three things which I am just assigning some column okay so what I did okay so once this is done now we will go ahead and add another column embedding so DF embeddings you can name it anything vectors embeddings whatever you want so embeddings and how we will be calculating is DF Dot file content because we want to calculate embeddings only for the text which are written in the file so we will say file content dot apply because you want to apply for every single value which is in our data data frame and let's go ahead and quickly use Lambda and we'll see so this is the thing which we have already done in previous video so I am not talking more about these and we'll say input equal to you can say X and engine so this is the embedding engine which we are going to use so let's call it let's use text embedding error002 you can use any if whatever you want embedding add a 0 0 2. so this is what we have and then we need to say data so this is the construct why I am doing data then the first element and the happening because this is how you will get the output of this open AI embedding create function and if you don't know the syntax it is clearly written on the documentation so you can definitely go ahead and check that out so you will get to know the exact hierarchy what is residing under which element okay now we got the embedding column the only thing which we need is some uid column or some ID column on which pine cone can index in search vectors so for that I will probably put it here will generate some random uid using uuid4 and here we will say DF and let's name this called new column as ID so these are the simple python code which you can write by your own or in fact you can grab it from anywhere that's not a big deal in this so and we want to do it for entire column all the rows we want to do so that's why we are saying for Loop and I think we are done so at this stage if you want you can retain this data you can save it to CSV file just name some file so just give some file name [Music] my embeddings dot CSV so it is not required but if you want you can definitely go ahead and save this data and here I can show you what we have so let's quickly run it oops [Music] encoding token and coding and code so there so it should be this variable here because this is what is holding our encoding hope this time it works so here you can see that we initially we were having just four text files and now we are having a data frame which is holding the text file name it's this related content embeddings and the respective ID so this is how we can generate a data frame out of it now with this data frame what we are going to do is we want to we need to extract the embedding and then push it into the pine cone so for that let me go ahead and import few more things so first thing we need is blank chain Dot Vector stores import pine cone and one more thing we need here is for open AI embeddings so we'll say Lang chain dot embeddings dot openai import open AI embeddings I will go ahead and import pine cone as well and then I need a configuration because this is where I am having my openai key so once these initial things are imported we can go ahead and Define our embedding model as well so embedding model I'm taking text embedding error but you can take the one of your choice embedding add-on 0 0 2 and another thing we need is the embedding object I will use the one which I have created sometime back so embeddings which will call open air embeddings and it will take few parameters like open AI API key this I can grab it from my configuration okay so let's run this thing we have our initial setup ready next thing is we need to deal with the pine cone so for pine cone you can go to pinecone.io and create the index there and then start working on this part what I am doing and if you don't want to do that we can do it directly here itself the way I am doing so here we need to initialize the pine cone database and the first thing it expects is the API key so you can grab the API key from Pine Cone and if you don't know just check out my previous video and then I will say environment equals for me it is U.S Central U.S Central one minus gcp so this thing is done and then we need to set the name of the index so we can say index name cos I am taking name based on the data what I am having but definitely you can name it what uh based on your choice homelessness because this data is about homelessness and what are the causes and effect of it so this once this is done next thing is we need to check if the database or the index is not already present then just go ahead and create it so for that you can add one check here if index name not in pine cone dot list indexes then go ahead and create one so create index inside this we need to pass few parameters so the first one is index name then Dimension Dimension is one five three six this you can check it on the open AI documentation as well and the metric you can use dot product or cosine I am going with the cosine here a bracket so once this part is done we are good with our setup let me go ahead and quickly create an object because this object we will be needing for working with this pine cone okay so let's execute this hope there are no more errors okay now we have got this thing next thing is we need to observe this entire thing into our pine cone database so for that I have already written some lines of code let me quickly grab that so what we are doing here is we are trying to click so this is the code you can find it on like Pinecone documentation site itself but I just tweak it a bit based on my requirements so what I'm trying to do here is this is for the progress bar which we will be seeing in a while and then we are creating a bunch of hundred and here what we are doing this is some simple calculation to grab every single bunch out of that and we will be taking all the embeddings which are present in our data frame and here we will be taking all the IDS which are present in our data frame once those are there collected next thing we are going to zip them so that we can create a single record for our insertion so the single record consists of IDs embeddings as well as the data so data is nothing but it's a combination of file name and the file content and once this is done we will be assigning it two vectors and putting up Sir so if you are running this code multiple times in that case also it will not create any trouble that's the reason we are using upsert so let me run it and you can see this is the small progress bar which which is coming over here okay now in order to verify whether it is inserted or not you can quickly call index.describe so one way to verify is using this and here you can see that the eight records are inserted okay so next thing what we are going to do here is uh we need to take query from the user and we need to generate the embedding for that particular query then we will be taking users embedding of the user's query and match it with the database so if it's matching then it will pull up all the relevant documents for us okay so let me quickly type write few things here so first of all we will be doing take query and generate embedding up for it then we will be taking output of this first statement use output of this above statement and query database so this will give us like list of docs or text whatever you call it and as a third step what we will be doing here is we will be using output of above staff which is Step 2 and create a completion prompt so this is like a multi-step process so let me quickly go ahead and do it for you so and one note here although four documents were there but you can see here eight because in my database four records were already there so I am using the same index which I used earlier so that's the reason I am seeing here eight documents okay let me paste some code here which I grabbed it from the open AI cookbook itself what we're doing here is we are trying to construct The Prompt so for constructing the prompt what we are doing is we are taking the embeddings and we are just passing uh creating the object out of it so this is for text Adder and then we are saying get me the query so whatever the user queries let's say user is saying what is the reason of homelessness so for this query we are generating the embeddings and push it into this so this is holding the embedding format query which is passed by the user and then we are saying for that query index uh query the database and get me the top five matching document my top three matching documents so it will give you all the top three matching documents and we will be collecting the content of those and setting it as a context so this is the context which will be passing it to chat GPT for getting our answers so rather than passing all the eight documents now we are just saying get me only the top three matching ones and pass it as a context now we have the context it means we have the relevant information which we want to use for answering users questions so next thing is we just need to construct The Prompt so let me show you how we are doing this so notepad I will take it so in this notepad I will show you how exactly we are trying to construct so we are saying uh to chat GPT answer the question based on the below context and what is the context this is some text which we grabbed it in this particular statement this is the text which we received from our pine cone so we have this and then we are saying question this thing so we are saying this uh here we are entering some new lines then we are putting some user question user will put some question here and then we are simply saying answer so this is the prompt which we are passing it to chat GPT so let me quickly reiterate this so we we are saying to chat GPT answer the question based on the context below and this is the document or the information from which you need to answer my queries so then we are supplying the question here and just hitting the enter so that GPT will type its answer over here so this is the simple thing which we are doing it over here and here I think I need to define the limit limit is the variable which I created here so we can say 3097 so this is the limit to which I'm saying that do not consider text to More Than This limit if we are going Beyond this limit let's not consider that so let's make it three seven five zero okay so that's the reason here is the check if it is going Beyond limit then just stop the prompt and no need to consider now how to handle this that we will see in particular probably in my next video but right now let's stick with this and go ahead so it is successful it means we are able to construct our prompt successfully next thing is we need to create queries so now we will just making a call to this above function which we just created this is the question which user is giving and we are saying prompt with context so this is the prompt which we will be sending it to open AI to answer our questions so here let's call this and I can quickly show you how exactly it is constructed so this is how it is constructed answer the question based on the below and this is the text which it has retrieved from my text file and at the end you can see that this is the user question and answer oops so it should be what let me run it again okay so once this is done now we need the only thing which is remaining is we need to call the completion endpoint and execute this particular query so that we can do it using open AI dot completion dot create and inside this we need to first of all pass the engine so you can use any let's go with DaVinci zero zero three and few parameters as usual so prompt prompt is nothing but the one we just constructed about so I will just copy it over here and then we'll have temperature equal to zero will take Max tokens equal to 350 for now and let's take only one matching so top underscore P equal to 1 and I think we are good with this let's run it first so it is complaining about the model does not ah so that's so this is the response which we have received so the text is this the question which we have asked is what are the effects of homelessness and this is the answer which we have received so now if you want to extract the exact text definitely we need to parse it properly and grab it so if you want to get the text let me store this output in res and here itself we can say result and it is under choices here you can see its choices and first level you will have the text so for first level we will be saying 0 and then I will say text let's run it and here is a perfect answer so question is what are the effects of homelessness the effects of homelessness include death reduced life expectancy increased risk of serious Healthcare so all these things are coming directly from those files so this is how you can utilize spine cone to retrieve your indexes based on the search query and the best benefit about this is that you need not to pass your entire data to open AI to get your answers rather you can just pass short data the only the context part so that you can save some cost by reducing your token size so I hope you enjoyed watching this video and thanks for watching
Info
Channel: Shweta Lodha
Views: 9,682
Rating: undefined out of 5
Keywords: Artificial Intelligence, Programming tutorial, Programming, OpenAI, Machine learning, Shweta lodha, ChatGPT, What is embedding in OpenAI, What is langchain?, Integrate OpenAI with Langchain, OpenAI token error, How to Save Embeddings permanently, How to save embeddings in vector database, Saving embeddings in Pinecone, How to use langchain with pinecone and openai, How to query pinecone database, How to query vector database, OpenAI embeddings, How to save openai model
Id: biJScl-csmA
Channel Id: undefined
Length: 25min 36sec (1536 seconds)
Published: Tue Mar 28 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.