OpenAI's New GPT 3.5 Embedding Model for Semantic Search

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're going to have a look at how we can use openai's new text embedding model creatively named text embedding order 002 to essentially search through loads of documents and do it in a super easy way so we really don't need to know that much about what is going on behind the scenes here we can just kind of get going with it and get really impressive results super quickly so to start let's just have a quick look at how all this is going to look it's very similar if you follow any of these videos very similar architecture to what we would normally use we start with our data source just ice going to be over here and we're going to take that and we're going to use the New Order 002 model to embed these okay so what we have in here are sentences some text goes through like this and what we're doing here is creating meaningful embeddings so for example a two sentences I have a very similar meaning within a vector space because that's what we're converting them into vectors within that Vape space they will be located very closely together and of course we know that open AI when they do something they do it pretty well so the expectation here is that the rda002 model is going to be pretty good at creating these dense Vector representations so from that we're going to get our embeddings I'm gonna just have them in this little square here what we're going to do those is we're going to take them over into Pinecone which is going to be our Vector database so where we essentially where this will live this Vector space so we have our database here and they're going to go into there like that okay so this process here is what we would refer to as indexing okay we're taking all of our data and we're indexing it within Pinecone using the rda002 model now there's another step to this whole pipeline that we haven't spoken about and that is querying so querying is literally when we do a search so let's say some random person comes along and they're like I want to know about this um we don't know what they're asking about it's a mystery but they have this query they've passed it to us what we do with that query is we take it into r.002 we embed it to create a query vector so it's going to be a smaller box called xq and we're going to take that over to Pinecone here and we're going to say it's pine cone returned top K so top K is going to be number let's say we say three or five let's say five return the top K most relevant vectors that we have already indexed so we return those now we have five of these vectors they're all in here one two three four five and we return them to the user okay but when we return them to the user we're actually not going to return the vectors because it's just numbers it won't make any sense we're going to return the text that those vectors were embedded with okay and that is how we will build our system now it's actually super simple this chart probably makes it look way more complicated than it actually is so let's take a look at the code so we're going to be working from this example here so we have dops Pinecone IO dots open AI uh we're going to open this in colab and just work through so we get started by just installing any prerequisites that we have so we want to install the Pinecone client open Ai and data sets so go ahead and run that okay that would take a moment okay great so come down here and first thing we're going to need to do is create our embeddings now to do that we need to initialize our connection to open Ai and for that we need these two keys so we need a organization key and we need our secret API key so to get that we'll head over here we go to Beta openai.com and you'll need to log in so you can log in at the top right I've already logged in so I can go over click on my profile and I can click view API Keys okay and the first page you come to here is the secret now here you can't copy this it's already been created so what you need to do is create a new secret key so I would do that and then you just copy your key here then with that secret key you need to paste it into here I have mine stored in a variable called API key then we return to the openai page we go over to settings and then in here we'll also find our organization ID so we need to copy that and that will go in here and I've mine sawed in another variable called orgy so I will copy that now I can run this and what we'll do is we'll get a list of all the models that are available as long as we've authenticated correctly so you can see we have this big list which we initialize with this open AI engine list so we're just seeing everything in there and I don't know if maybe Arda is at the bottom maybe not so I'm not going to search through it but we'll see which model we're using here so this is a new model from openai and it's much cheaper to use and the performance is supposedly much greater so we'll go ahead and we'll try this one out so text embedding are the zero zero two and just as an example this is how we would create embedding so open AI embedding create and then we can pass multiple things to embed here so this we have two sensors and that means we will end up outputting two Vector embeddings an infinite model we just pass the model that we'd like to use so this one okay so we run that and if it worked correctly you should see that we have these these vectors in here okay and some little bits of information and so it's pretty it's pretty cool now one thing that I would like to demonstrate here is okay are these vectors do they have the same dimensionality and what is that dimensionality now there are provided to say model so we would expect them to have the same dimensionality so we're just checking the response we have data zero and embedding so essentially what we have in here if I scroll up a bit you'll be able to see that okay so we have data we're going for the first item in the list and we're looking at embedding great now print those out and we should see that we get one five three six which is the embedding dimensionality of the New Order model now what I want to do is extract those into a list or just what we're going to be doing later so we can extract those and see that we do in fact have two of those and again we can sort of check the damage highly there as well so now what we need to do is initialize a pine cone instance and this is where we're going to sort all of our vectors so for that we need to head over to app.pinecone.io so let me open that over here you will need to sign up if this is your first time okay you should come through to a page that looks kind of like this so I have James's default project up here you will have your name followed by default project and what we want to do is we don't want to create our first index we're going to be doing that in Python what we do need is the API keys so I'm going to just take one of these I have my default API here I'm going to copy it here and we're going to paste it into the notebook so I've sold mine in a variable called Pinecone key so I can run that and what this will do is initialize our connections Pinecone it will check if there is an index called open AI within our project so within this space here we don't have any so it doesn't exist if it doesn't exist it will be created and it will use this Dimension here so this Dimension is a one five three six that we saw earlier and then we'll connect to that index so let's run that and if we navigate back to the page here the app.pink.io we can refresh and we should see that we have a index here it was initializing and now it's ready so we can see here with all the details I would see the dimensionality the Pod types are using metrics and so on okay so these are just default variables there but yes we do want to be using cosine and pod type you can change a podcast depending on what you're wanting to do so back in our code let's go ahead and begin populating that index so to populate the index we obviously need some data we're just going to use a very small data set 1 000 questions from the track data set so let's load that this we are getting from Hook and face data sets so if we actually go over to hooking face Co data sets Trek we'll see the data set that we are downloading uh which is this here okay I think in total there's maybe 5 000 ish examples in there we're just going to use the first 1000 to make things really fast it's about walking through this example okay and yes we can see we have text course label find label all we really care about here is actually the text okay and we can have a look at the first one how to certain develop in and then leave Russia and we can also compare that over here and we see that it's actually exactly the same Okay cool so now I want to do is we're going to create a vector embedding for each one of these samples so more or less walk through the logic of doing that so we're going to be doing that in a loop we're going to be doing it in batches of 32 and what we're going to do is extract the start position of the batch which is I and the end position of that batch and we're going to get all of the text within that batch so this should actually be high end so we'll get all the text within the batch we'll get all the IDS which is just account you can use actual IDs if you want for this example it's not really needed and then what we're going to do is we're going to create our embeddings using the openai endpoint that we used before so we have our inputs which is our batch of text we have the engine which is the r.002 model and then here we're just reformatting those embeddings into a format that we can then take and put that into pinecare we also so later on when we're serving or when we're querying what we're going to want to do is we don't want to see these vectors because it doesn't make sense to us we want to see the original text so to make that easy what we're going to do is pair our metadata so the metadata is literally just the text that we want to see okay that will basically just be some metadata attached to each one of our vectors and it means that when we're querying we can just return that and read the actual text rather than looking at the vectors I mean that's it so we we zip all those together so each record is going to be a unique ID the vector embedding and the attach metadata and then we upset all that into pine cone so we can run that should be pretty quick okay yeah it's like 13 seconds really fast okay 14 seconds total really super fast for a thousand items uh that's pretty insane so now what we can do that is the indexing portion of our app done so all of this in green is now complete so we can kind of cross that off now what we need to focus on is aquarium so how do we do querying it's actually really easy so we have a query I'm going to say what caused the 1929 Great Depression okay we're kind of limited in a number of questions we can ask here because we do only have 1 000 examples in depth realistically probably what millions or more so we're going to be limited on what we can actually ask here but this is so pretty good in order to just demonstrate this sort of workflow so let's run this basically we're doing the exact same thing for the query that we did with the the lines or the track data set from before so we're just embedding it using the other 002 model in this case we just have one string input there and then what we do is in that response we're going to have data we want to retrieve the first item there's just one item in there anyway and we want the embedding from that and that will be if I take a look at this so that will be a 1536 dimensional vector and then what we can do is we pass that to index.query like so okay so we can remove those um square brackets there top k equals five include metadata we do want to include this this is going to return the text the original text back to us so let's see are we returning questions that's similar to the question we asked okay so why the world enter Global depression in 1929 when it was the Great Depression I don't know what is with the weird formatting here uh and then it's talking about some other things and maybe somewhat related I am not really sure or just things from around that sort of time error but we can see that the you see the score here the similarity does drop really quickly and when we come down to these because they're actually not that relevant they're just kind of within the same context I suppose so that's pretty cool um it's clearly returning the correct question that we would expect it to based on the question we asked okay so we can also format that a little bit nicer so yeah we'll just run that and we can see a little bit easier to read than in this sort of response format that we had up here now let's make it a little bit harder we're just going to replace the correct term depression with incorrect term recession and see if it's still under the SanSai Quarry because this is where a lexical search so where you're searching by keywords would fail in this case we should see hopefully that does not fail so replacing or replicating the same logic again and we can see that yes it okay the similarities slightly low because we're using a different word but it's still returning the relevant question as our first example there okay that is pretty cool now let's make it even harder why was there a long-term economic downturn in the early 20th century is it going to figure out that this is what we're talking about that we're talking about the global depression of 1929 and yes it does and the similarity is actually pretty good there so despite not really sharing any of the same words it actually manages to identify that this is talking about the same thing which is pretty good now with that done we can finish with this example so one thing you might need to do here is head over to Pinecone console and you can just go ahead and delete the index or you can do in code completely up to you great so that's it for this walkthrough and example I hope this has been useful uh it's really cool to see open ai's new embedding model and from what I've heard the performance although not as clear from this example is really good and as you have seen from this example it's super easy to use so a few lines of code and we have this really cool ready high performance semantic search example with openai and Pinecone and we don't really need to worry about anything it's just super easy to do so I hope this has all been interesting and useful thank you very much for watching and I'll see you again in the next one bye
Info
Channel: James Briggs
Views: 48,251
Rating: undefined out of 5
Keywords: python, machine learning, artificial intelligence, natural language processing, nlp, Huggingface, semantic search, similarity search, vector similarity search, vector search, gpt-3, gpt3, gpt-4, gpt4, text-embedding-ada-002, openai gpt 3, openai gpt 3 tutorial, openai gpt 3.5, gpt 3.5, gpt 3.5 explained, openai embeddings, openai api, openai api key python, openai ada model, new openai model, new gpt4 model, new gpt 3, new gpt 3 model, openai semantic search, gpt 3 semantic search
Id: ocxq84ocYi0
Channel Id: undefined
Length: 16min 14sec (974 seconds)
Published: Wed Dec 28 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.