5. OpenAI Embeddings API - Searching Financial Documents

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
word embeddings are a way of representing words and phrases as vectors in the video on open AI whisper I mentioned how you could take text and convert it to a vector what's fascinating about this concept is that when you take words and phrases and convert them to a numerical representation words that are similar numerically are also similar in meaning and this allows us to do things like build a search engine in this tutorial I'm going to show you how to build a semantic search engine using word embeddings so you'll be able to build this application right here I've generated the web app so the front end for this application using of course chat gbt to generate the python flask code and Dolly 2 to generate this little logo so what I can do here is search through a bunch of different earnings transcripts and so what I'm going to do is say demand for artificial intelligence is increasing and so this is going to search an entire Corpus of documents and rather than trying to match just an exact string or an exact phrase that's going to search for meaning in these earnings call transcripts and so when I click search here it's going to search the Microsoft earnings transcripts it's going to return search results and those search results are going to be ordered by relevancy so when I search for demand for artificial intelligence is increasing you'll see in this earnings call where Sacha is talking about going further with new AI powered capabilities empowering automating natural language into advanced workflows the second result talks about Ai and powering Dolly with azure's open AI service and the third result talks about Cloud sustainability and so on and so forth so how does this search engine work how do we take these words and phrases that we see on a sheet of paper and convert them into numbers so that we can see the Matrix behind everything and perform math on it and perform classification anomaly detection clustering and all these cool natural language tasks well that's what I'm going to show you today I've prepared just for you this word embeddings notebook and we're going to walk through how all this works the math behind it how to use open AI to generate word embed settings and how to create a semantic search back end here a step by step in a way that you can understand and apply and we're going to build several projects on top of this so This is highly important to understand I'm going to do three or four videos using this exact same concept to show you different use cases and I think this is one of the most fascinating Concepts I've covered on this channel so far so I hope you're as excited about it as I am so let's go ahead and get started click the link down below and open up your Google collab notebook to page one and follow along with me and we'll walk through this line by line so the first thing we need to do is install the open AI python package as we did in video three of this series and so we're just going to run this cell right here that does pip install open AI it installs open AI with the python package manager the next thing we need to do is import the python packages we need so we need to import the open AI package as well as the pandas package for pandas data frames and numpy for working with numpy race we're also going to configure our openai API key if you don't know how to get an API key I covered all this in video number three this series is building on itself so I'm not going to go over that again get an API key and you put it in this little prompt here and it will store that and openai will be configured and this allows openai to track the number of API requests that you make so to make sure we have a very clear understanding of this concept and how it works I've started with a very simple list of words and I've included a list of words right here so this is word CSV you see it has the words red potato soda hamburger cheesecake mocha fizzy carbon banana things like that if I click raw here I'll have a raw CSV file and I can just save I can do a save as in my browser and I'll just save this as words.csv and I'm going to download that locally right here so I want you to download this word CSV file and then I want you to upload it to Google collab so if you go to the file browser here I've already uploaded these but I'm going to upload it again just so that you see this and you should have a words.csv here right and once you have that words.csv file you should be able to read that word CSV file into a pandas data frame and now we have this variable called DF it's a pandas data frame that contains all of these words the next thing we're going to do is calculate the word embeddings what is this this is what I said about converting these words into vectors and so we're going to do is use open AI to do this and so if you were to look in the open AI documentation there's a section on embeddings that tells you what they are for and what it does under the hood is make an HTTP request and you give it some words and it actually returns this Vector of numbers so this embedding here you see there's just a big series of numbers that's returned in a response here now rather than construct this API request like this what we can do with the open AI python package is import a function called get embedding and the way this function works you just pass it a string of text and you call it get embedding and you pass it an engine and we describe the engines in the previous video so there's The DaVinci Ada Curie a bunch of different models or engines here and you just pass the name of an engine so for these text embeddings we're going to use the the ada002 engine and these API requests are really cheap they're like fractions of fractions of a cent to make and so if you have a free tier already you should be able to make tons and tons of these without worrying about it now this right here is kind of a complex expression I understand it a lot of people do but maybe not everyone understands this Lambda expression so what I'm going to do here is just run get embedding this function by itself instead of X here I'm going to type an actual string so I'm going to type uh the fox cross the road right or whatever and so if I do this it's passing the string in and converting this sentence here Fox across the road to a vector so look at all these numbers that came back so those don't mean that much to you but they mean a lot to open AI so they have these vectors that represent all these different words and phrases and so rather than just running that once what I want to do is run out on every single word in my Corpus here and so I have 26 words right here and I want to convert all of them at once and create a new column with all of the word vectors right next to them right and so we're going to do this in batch and so I'm going to take my data frame right here so I have my data frame and create a new column called embedding and inside of this column called embedding what I'm going to do is store the result of this operation so I'm going to take the text column so this is the text column and I'm going to apply a function to every single Row in that column right and so what I'm doing is applying the get embedding function to each row in that column and so the way these Lambda Expressions work this is just a function without a name game right and so for every single item in that column right here that item is going to come in this X and then it's going to say get embedding for that first word and then it's going to store it in a new column so it's going to get embedding for red get embedding for potatoes and then it's going to store the result right next to it so that's how that function works right here so after we have this new column called embedding we're going to store those results in a new file called word embeddings.csv so why am I going to store that in a new file this is so that we have this cached and stored so we can store these word embeddings or these vectors in a database or in a flat file that way we don't have to call openai over and over again and so storing this locally will save money and will save time so anything that's not changing all the time you want to Cache it locally so now that I have this word embeddings file locally I'm going to open it up and look at what's inside of it and now look at that if you look at this you'll see that not only is there a text in the CSV file there's a huge column of numbers here so just for the word red look at all those numbers tons of them like hundreds of them right and then just for the word potatoes look at that tons and tons of numbers right there and so now what I'm going to do is take this word embedding CSV that I just opened and read it back into a data frame right I know I just saved it locally and now I'm going to pull it right back out and so now you'll see we have all of our text here and the word embedding that's right next to it so every single a word in this data frame has an Associated big list of numbers wow Larry we have all these numbers what do I do with them well this is a part where it gets interesting let's enter a search term and so I'm going to run the cell and it's just going to present an input prompt and it's going to say enter a search term and I'm going to say hot dog here and hot dog is not in this list of words at all but what I'm going to do is get an embedding for this search term so I'm going to get the numerical Vector that corresponds to hot dog so I'm going to make that call again and return that search term vector and so this here this in The Matrix this is what a hot dog looks like this is this a big list of numbers right there right and so now what I'm going to do is take this numerical Vector this numerical representation of hot dog and I'm going to search through the numerical representations of all these other words I have cached in this data frame and I'm going to find which vectors are closest to the vector for hot dog and search by similarity and so let's go ahead and do this and so the way I'm going to do that is I'm going to import a function called cosine similarity and if you look up cosine similarity on Wikipedia or enter it into chat GPT you'll see that cosine Solarity is just a measure of similarity between two sequences of numbers and it's described by this equation now this notation this equation might look very complicated to you or it might look very simple depending on your level of math but I assure you it's actually quite simple this is actually just multiplication so this top term here is a DOT product and all this is is just multiplying each of the terms in the two vectors together and for those that want to perform this calculation manually I've included this calculation and at the bottom here and so if you had this equation right here and you had two vectors let's just say the first Vector V1 is just the numbers one two and three the second Vector V2 is just four five and six all you're doing is one times four so A1 times B1 1 times 4 plus A2 times B2 2 times 5 plus 3 times 6 and that's the dot product so if you add one two three and four five six and you wanted the dot product you could use numpy to do that or you can just you could do this one by hand right so the dot product is just 32 and then on the bottom here so you have the numbers one two and three so all it is is one squared plus two squared plus three squared and take the square root of that so the square root of one plus four plus nine equals the square root of Fourteen and the square root of 14 is 3.74 right and so it's just a summation of a bunch of multiplications you can even take these numbers here and put them into a 3D Vector plotter here and so if I put a one two three here as a vector and 4 5 five six here right and I click draw here and this is just some Vector plotter I found online you can see the vector 456 here and the vector one two three right there and you can see they're fairly close together and so what we're actually doing is taking a vector representation of all these different words and seeing how close they are in space now this is just a very simple Vector of just three numbers but if you look in our Word embeddings file you can see these vectors are very very complex so what does that look like on a graph it's hard to really visualize that and it's hard to perform this calculation by hand when there's that many numbers fortunately for us there's a built-in function called cosine similarity that will run through all these numbers multiply all the different vectors together and perform this calculation for us and then we can just sort all these different vectors out in space and find which Vector is closest to Hot Dog let's do that right now alright so back to the cosine similarity of hot dog so we have a search term Vector so this is the vector representation of hot dog and what we're going to do here is take this cosine similarity function and apply it to every Vector that we have in our data frame here and we're going to check the distance between each one of these vectors and the vector for hot dogs and so we're going to apply that calculation and store it in a new column called similarities and so when I run this it calculates all of those similarities and we see our complete data frame right there and so now all we need to do to find the words that are most similar to hot dog is sort all those values in the similarities column and then we're just going to take the top 20 and so when I sort this by similarities you'll see the most similar words to Hot Dog are hamburger cheeseburger french fries and cheese which makes a lot of sense because these other words it's not similar to water it's not similar to milk right it's not similar to latte and it so it correctly identified that hamburger is the most semantically similar to hot dog just using these numbers and then it calculates that cheese is fairly close as well even though it's a little bit further down and you can even see the color red is probably a little bit higher because maybe you could say a hot dog is kind of red and yellow is in the top 20 maybe you could say the bun or something it's kind of yellow or maybe you could say there's ketchup and mustard so maybe it's Red ketchup and yellow mustard on the hot dog and so it's really neat how you can use these Vector calculations to find words that are semantically similar now what's even more interesting about numbers is that you can add them together for instance so we can add two different vectors so what happens when we take a copy of this data frame so I'm going to make a new copy of this data frame that I'm going to call a food data frame and what I'm going to do is get the milk Vector so we have a vector for milk which is in the the tenth index here so that's our milk vector and then I'm going to take the espresso Vector right so that's in number 19. so I'm going to take the embedding for milk and the embedding for espresso I'm going to add them together so I'm going to say milk plus espresso as numbers and I'm going to add those and store them in a vector called milk espresso Vector so I've added two vectors for two words and now I have a new Vector called milk espresso vector and so now I can apply the cosine similarity to the milk espresso Vector against all the words that we already have and so what's closest to milk plus espresso well espresso and milk because they're part of it but the closest thing is a latte which is a combination of milk and espresso and so you can see when we sort this we get espresso and milk latte mocha and coffee so I for one find that very fascinating if you've never seen this demonstrated before so tying this all back into Finance so I wanted to walk through all of these calculations very slowly with words we're already familiar with like colors drinks and food and make sure we have a clear understanding so I think that's a pretty clear uh description of what we've done here now let's figure out how do we make our financial search engine so let's go ahead and apply the same concept not just to words but to sentences entire sentences and so what I have here is a Microsoft earnings call transcript where uh Sacha the CEO of Microsoft is talking about the latest earnings from Microsoft right and there's all these different paragraphs here now instead of applying these word embeddings just to individual words what we can do is take a an embedding and convert this entire sentence to a giant vector and what we can do is apply that same functionality so I can type in some kind of phrase and then go through calculate all those cosine similarities and find these sentences that are closest to what I typed in and so that's how you can think about searching for a phrase or a term and even though there's not an exact string match we can find the one the sentence in here that's closest to what we typed so now that we understand all the concepts I'm going to go through this very quickly so we can download the Microsoft earnings call upload the CSV file we read it into a data frame called earnings DF right there you see all the earnings sentences in a data frame next we're going to calculate the embeddings for all those sentences so we're calling get embedding on each of the sentences here and we're going to store this new data frame inside of a file called earnings embeddings.csv if I open this you'll see it's too large to display but you can open this locally if you download it so I'm going to search for a sentence or earlier I typed artificial intelligence demand Cloud products just like that that phrase is not going to be exactly like that inside of my data frame but what we're going to do is convert artificial intelligence demand Cloud products to an a search Vector so we have a vector representing it and then we're going to do our same thing we're going to do the cosine similarity so we're going to calculate these similarities of the earnings search Vector that we just searched for and then we're going to sort the values and so when we sort this value and we open this data frame up a bit you can see how it talks about a cosmos DB Dali Azure open AI service and so forth you can see the second result here mentions the cloud and it also mentions ambient intelligent solutions for automatically documenting patient encounters at the point of care and then you can also see the third result here now to data and Ai and it talks about the Microsoft intelligent data platform and so you can see this exact same concept we're able to apply here and we're able to get search results for a particular earnings transcript so now what you could do is build an entire search engine so you can get tons and tons of text not just Finance but whatever text you want and load it into some giant database or some big CSV file or some flat file and calculate these embeddings and make this search engine for your documents maybe your workplace has a bunch of documents like some support documents for instance someone mentioned maybe being able to automate customer support so someone could type some question or query your documentation and if you have this indexed and vectorized here then you can build a search for that and return them the answers that they want and so you can see how this is going to disrupt entire Industries maybe you don't need a customer support person or maybe you don't need a customer support form I go on a form and I type and ask for something and no one even replies to me what if you could automate at least and at least give some type of reply even if it's not the perfect one now as a final example I'll go back to our old friend Jerome Powell I've linked a CSV file containing the FED speech that we discussed in the open AI whisper video so we talked about how these sentences impacted price for instance so let's say we have some new sentences and we want to see how similar they are to the ones spoken by the FED in the past maybe there's some new sentences coming in on a live stream for instance right and so what I'm going to do is read in this fed speech that I'm I've uploaded Here and Now I want to get the embeddings for that fed speech right we're going to calculate those and store them in a CSV file called fed embeddings then I'm going to input a sentence and so I'm going to say the inflation is too damn High and then I'm going to search for everything Jerome Powell said that is similar to inflation is too damn High what is the closest he has said to that right and so I'm going to get the embedding for the inflation is too damn high and then I'm going to read my fed and bit embedding CSV file back to a data frame one thing I didn't mention earlier is that this part apply a Val NP array inside of the CSV file this is just a string of numbers but what we want to do is convert this to a numpy array so all that's doing is taking a string of numbers using eval and then np.array and applying those and that's converting it back to a python numpy array so it's going to bring it back to like a python data structure so now I have the FED data frame and now let's calculate the cosine similarities and then go down here and sort and let's see the stuff he said that's most similar to the inflation is too damn high so if I open this puppy up right here how does Jerome Powell say it he doesn't say the inflation is too damn high he says recent inflation data again have come in higher than expected so higher than expected inflation remains well above our longer run goal of two percent my colleagues and I are acutely aware that high inflation imposes significant hardship or the longer the current amount of high inflation continues we are highly attentive to the risk of high inflation right and so all these other ways of saying the inflation is too damn high and we are able to search a speech for that meaning using these vectors of numbers so pretty cool in about 20 minutes we learned how to make a simple search engine as well as the math behind it and how to use open AI to calculate these embeddings to find similar words and phrases in the next video we're going to take advantage of chat GPT and Dolly to make just a simple front end and a simple logo in case we wanted to productize you know if we want to make an AI product we can take a front end put it on this back end slap it together and deploy it to a server and and we got a cool little search engine we could load tons of text into it and make the next CNBC or Seeking Alpha or whatever you want to make and sell it for millions of dollars so that's it for now thanks for watching see you in the next one thanks
Info
Channel: Part Time Larry
Views: 43,909
Rating: undefined out of 5
Keywords: openai, word embeddings, gpt3, semantic search, finance, trading, cosine similarity, nlp, natural language processing, word vectors, machine learning, ai, artificial intelligence, python
Id: xzHhZh7F25I
Channel Id: undefined
Length: 20min 30sec (1230 seconds)
Published: Thu Dec 29 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.