8. OpenAI Question Answering - Financial Advisor Embeddings

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Nicole's in the chat part-time Larry Sean greylish Wilson is here it sounds like a sounds like a party that guy is alive guy is a riot in the in the chat hey everyone welcome back in today's video I'm going to show you how to build a financial advisor using Python and open AI embeddings embeddings why aren't you just using chat GPT well a few reasons number one uh there's no reason for a tutorial on chat gbt there's so many different tutorials on it but if you want to ask it a question just use it I don't need to make a video about it secondly there's some limitations of chat gbt as you've probably found out by now if you try to ask it about recent information such as something that happened in 2023 it'll be like I don't know what the hell you're talking about I wasn't trained on that information that was from you know I know about stuff from like 2019 or a couple years back right and so we need to come up with some way to be able to feed information from the present into our system so that I could use context from the present to answer our questions another limitation of chat gbt is uh it often Dodges certain types of questions so if you ask it for advice about your finances it'll be like well I am but a simple AI I'm not capable you know it plays dumb right it Dodges certain questions because they don't want the liability of giving Financial advice so it'll often say talk to a financial advisor and so what we want to do here is come up with a solution to a train gpt3 to just answer a question based on some text that we provided and so we can provide some good answers some good context for it to use and just say answer a question based on this text another limitation of systems like chat gbt is that they're trained on only publicly available information so it crawled the web and got all this data that occur on the web and learned from that but what if we want to train a system on internal information so there's tons of information like anywhere you've worked they've had some private information available some internal training documents PDFs guidebooks manuals that kind of thing and so what if you want to train in a system a question answering system on private information well you can build a custom solution using Python and openai embeddings and you just need to write a little bit of code to do that and so just as an example here I have this copy of this uh Series 7 Exam manual here and it has a lot of information about stocks Securities Bond characteristics debt strategies all kinds of stuff that maybe chat gbt would answer questions based on this I'm not sure but imagine having tons of PDFs internally with all types of different answers to questions and you don't want someone to have to read through this whole thing scan for it and find the information read this whole thing they just want to ask a question on how to handle a situation and we can just use our program to look up and find the appropriate information and generate a nice response automatically another example of this is let's say you're this guy Ben Carlson you've already written a Blog with tons of different uh posts and so you have this big Corpus of text and he's also written some books that are not online so a wealth of common sense you know it's on Amazon you can buy it but I've seen where some people have written there's a site ask my book where you're able to talk to a book and ask the book a question so maybe he wants this marketing site where he wants to let people ask a question to his book and give the link to buy the book as well and so there's a lot of different uh ways you might want to use question answering on private information that's not on the public internet all right that all sounds good but what are we actually going to build well we're going to build this little web interface where we can ask a question to a financial advisor now what is this financial advisor trained on well I took the information from a whole series of podcasts so Ben Carlson who I just showed runs a podcast called animal spirits and he's also part of a collective called the compound and Friends clip at the beginning of of the show I'm a fan of that channel I go in the live stream and so forth and so they've talked a lot about different Financial topics so what if I want to train a bot that answers questions the same way they would answer them right so what I did was use open AI whisper which we talked about and we I transcribed all of the podcasts that they've ever produced I vectorized them using open AI embedding so I have a numerical representation of all the texts they have spoken and then what I can do here is ask a question here and then use prompt engineering which we also discussed to say answer this question that I type in based on this context that I provide so what I'm going to do is I'm going to type in a question I'm going to convert the question I type into a vector and since I have this Corpus of text that's vectorized I can use the cosine similarity that we discussed to find which context in my big Corpus of text is closest to the question I asked here and then I can just pull in that context right and then say gpt3 use open AI API and say a Write a response to this question based on this relevant text so we find the relevant text from the podcast and then we answer the question generate an answer based on that context that's provided and also using our prompt engineering we can say answer this question in the style of Ben Carlson since it's trained on Words that he has spoken so just as an example here let's show how this works and then I'm going to show you how to build it using python so I'm going to say um not to brag this is something they often say in the show not to brag but I have enough cash to buy a house outright right and then I'm going to say I should should I use all of my cash to buy a house or should I take out a loan anyway and invest the rest of the cash in the market okay and I'll ask a question I'm gonna say ask a financial advisor and then it's going to search this entire podcast come up with the appropriate text and generate craft some type of response to here so congrats on your financial success so Ben's a nice guy he tends to say congrats I'll often see this answer with Kudos as well he says Kudos a lot so it says first of all congrats on your financial success I think it's great you have saved enough enough purchase the house I write that says important considering visual preferences if you're comfortable taking on debt blah blah blah so in answer the question in a similar way that it was answered on the podcasting all right so I've explained the why and the what we're building and given a brief demonstration of the project now let's go ahead and get to the coding part so I'm going to walk you through how to code something like this and I think this is going to take me uh two videos so in this first video I'm going to walk you through a Google collab notebook that explains the logic of how we extract information from a single podcast generate the embeddings and ask a question to this podcast cast right and then once we understand how to do one of these we just need to take this same concept and apply it to a batch of podcasts and so we're going to build this pipeline where we can just send in a batch like all the videos for a Channel friends a YouTube channel for instance transcribe all of them convert all of them and so forth get all the data prepared and cleaned up and then I'll also show you how to build the front end for this so where you can ask the question in the little text area deploy it on the web and have an actual site you know ask financialadvisor.com or whatever and have it query this back end this big corporates of data that we've already prepared so let's first focus on this first part and if you watch video five of this series on open AI embedding so you're already way ahead of the game and that video is now starting to take off it was the least popular and now people are starting to get why this is a very powerful concept because uh this concept there's going to be million multi-million dollar billion dollar business is possibly built on this exact concept I'm showing you here and I encourage people to take all this code I'm going to link it below you can just take it build stuff on top of this and hopefully you can come up with your own unique idea and create tremendous value you know everyone wants to talk about AI for short-term trading and whatnot but think you can build tremendous business value here and build something way bigger and generate massive wealth if you're able to find a great use case for this so definitely be thinking about that so here we are in the collab notebook follow along with the link below and the first thing I'm going to do is execute this first cell and so if you run this it'll show what kind of GPU you have available and this will transcribe really fast you'll still be able to use open AI whisper with a CPU but it'll just take a lot longer right and so I'm running that you can see I have this Tesla T4 and this is just to show What GPU you have available the next thing we're going to do here is we're going to install openai because we need the open AI python package we're going to install it with Pip and then we're also going to install pytube which is the python package for interacting with YouTube videos and so what we're going to be doing is accessing a single podcast and so this YouTube video here that I have linked I'm going to open that real quick so you see what we're dealing with so you see this is an example of the portfolio rescue podcast right here and so what they do is take questions from the audience and he answers them and the nice thing about this is they're all time stamped and so you'll see there's questions about inflation this guy owned Twitter stock and so we have these time stamped questions here and so what I did was take advantage of the fact that these time stamps are there and so what I can do is transcribe this podcast find the text the text transcriptions that correspond to these time stamps and then I can train this question answering bot so if someone asks a question similar to one of these time stamped questions I can generate a cool answer based on what Ben said which also raises whether you know you should train models based on what a person says but to pay it forward I'm recommending their podcast I'm going to deploy the site and say by his book and so forth and I've already bought a bunch of stuff from these guys so I'm supporting them you know I'm not making any money from from this tutorial or anything so I feel like I'm helping actually so um I'm so yeah the ethics thing is a whole another conversation that is ongoing as we speak so um regardless I am taking the text here and training this thing based on it so um yeah so I installed I'm installing open Ai and Pi tube so Pi tube is able to download a YouTube video based on the URL and so we're going to do is use pytube to download that video okay so once we install a pie tube we're also installing open AI whisper we've done a tutorial on that already so that's video two I did open AI whisper and what whisper does is it take it can take audio and convert it transcribe it to a text transcription right and so I'm installing the open AI whisper package right there so it's installing both of those now once those are installed I just need to import those libraries so I'm going to open uh import open Ai and I'm going to also import whisper we're going to be using pandas data frames so I'm also importing pandas I'm importing YouTube from pytube right this is the YouTube library for downloading YouTube videos and I'm also importing this get pass and so what this is going to do is prompt me for my open AI API key and I need that so I can call the open AI API we covered this in the video number three on the earnings call summarization so we talked about how to use the open AI API so all these Concepts that I'm discussing here I've made videos on them and now we're putting them all together into larger and larger projects so I'm dropping my open AI API key in here and pressing enter and now that's stored in the API key here right I'm also specifying a couple of models we've discussed The DaVinci model so that's used to generate responses so this is the more a fancy language model it's more expensive than the other models because it can generate more sophisticated answers and then we have the embedding model so text embedding 802 and this is what's used if you watch video 5 which is on open AI embeddings what it can do is take text and convert it to a vector representation and this is really cheap to call it's like .0004 cents or something like so fractions of a cent here right and so what I'm going to do here is use whisper we're going to load the base model which is good enough to transcribe a podcast and so I'm going to load this model you see it downloads the base model which is 139 Megs I could use a larger model but I don't think it's necessary for this podcast this is going to be accurate enough and so once I've loaded that model what I want to do is download the YouTube video say I provide a URL and then I'm going to instantiate this YouTube class here and so I will run that and then once I do that I can get a list of the streams associated with this YouTube object so YouTube video.streams.filter and I've demonstrated this already in more detail but what you'll see here if you really want to see what's available you can dir this real quick and show the various attributes available here and so you can see I access the streams attribute and so this streams is a list of various quality streams of different bit rates and so forth and so there's like a HD YouTube video and so forth and so what I want to do is just extract a filter down and only pull the stream uh the audio stream right and then I'm using DOT first here to get the first one so there's multiple ones in a list and I'm saying give me the first audio stream in there and download it and save it as financial advisor dot MP4 and that'll save it to the file system of collab here and so if I refresh there with that button you'll see I have Financial advisor.mp4 right there and I can even download this and play it and you would have the audio of this YouTube video and so we have the audio now right and I can click that and since I've loaded that YouTube video using pipe tube you can see the description there and so I showed you this description has these time stamps in it and it also has a description of the show and so I have all this metadata here and so now I have time stamps I have some questions associated with this and I also have the audio and so I can use openai whisper here past the file name and I run this and this will transcribe that audio file and convert it to text so I'm running the transcription now I should have a pretty fast GPU here and this should execute and finish in like a minute if you don't have that and if they don't give you a free GPU here it might take like six minutes it takes a little bit longer sometimes so just know this will take a little bit to transcribe so it looks like it finished in a little over a minute there and now if I print the output you'll see I have a lot of text and different points in time here so you see the text like but yeah that had to be tough I've been in the office for six years blah blah blah so you see we have some text and then I'm going to access the text attribute so you can see the structure of this output here but if I just output the text here you'll see welcome back to portfolio rescue so that's the beginning of the show this is a show where we take questions from you and so forth right all right so the next thing I did here was just create a list of all the URLs and just put it in a Google sheet here and so this is all the all 60-something episode codes of the show and I went ahead and just extracted the descriptions already this way we can just process this in batch right and so for this particular episode uh this is I Believe episode 21 um let me find that episode uh this is actually episode 22 actually so that's the URL for it and then I have time stamps for all those questions that came from the description and so what I'm going to do here I just have a sheet of all these questions already and so I just exported this as a CSV and it downloaded to my desktop and so what I'm going to do now is click this upload button here and upload the CSV and this CSV just contains a CSV file of those questions and time stamps and so what I did here is a little bit of data manipulation and so what I'm doing is taking this spreadsheet here loading it into a pandas data frame and I'm going to convert these to a second so two minutes and 16 seconds would be 136 seconds is where this starts and then I'm calculating the end time of this particular question so I'm creating start and end time stamps for each of these questions and then I'm outputting them in a pandas data frame and then I'm going to save this to a cleaner questions.csv file here and so depending on what data you're working with you can work with this one uh maybe you can jump straight to this step and so let me open this question CSV and show you what this looks like now and so um basically I uh transformed uh that Google sheet I just showed you into this format so that way I have a flat episode one episode one episode one I have the YouTube url link now I have a start time stamp so I put that in its own column I calculated the start time in seconds that way if we build an interface for this we could actually link to the timestamp in the YouTube video and play the YouTube video starting at three minutes seven seconds like from our own interface and so you can see right here where I'm processing this Google sheet I'm looking at the next row getting the start time of the next row and then assigning that as the end of the previous row and so forth and then I'm filtering out anyone that doesn't have an end time just setting it to zero and so you can see my nice normalized cleaned up sheet looks like this and I just have start time and end time for each question uh once that where it's just the end of the episode I just have it go to the end I just put a zero there okay and so yeah I just have a list of questions and start an N timestamps which is great it's just in a flat file there and now what I can do is read this CSV file back in and now I have this clean data frame of questions and start and end time and so I'm going to do here is make a copy of this and I'm just going to operate on one episode for this video so I'm just going to say find me the rows in this pandas data frame where the episode column equals episode 22 and so I'm just going to get a small sample here so that we can see exactly what's going on and so now I just have a simple CSV simple pandas data frame of just the questions that were asked in episode 22 one per line or one per row and start an n timestamps in seconds right there and now this next chunk of code where is where it gets a little bit tricky but you should be able to understand this let me try to go through this line by line what I want to do is take this transcription so I have this output segments here right so if you look in here this these are the outputs segments so these are all the sentences that were said and I have a start and end time for each of these pieces of text that was provided by open AI whisper so this I've been in the office for six years you can see that started at 1454 seconds and it ended at 1460 seconds so we have these little chunks of texts that are in our open AI whisper transcription and we know the start and end time in seconds of each one of these words and sentences right and so what I want to do is take my transcript and find all the words that are spoken between this time stamp start and this timestamp end and that's the text that answers this particular question right and so what I have here is I'm creating a new column just called context so I want to get the text that answers this particular question so right now in my data frame I just have the questions and so I create a new a column called context and I'm going to set it equal to get the question context so I'm going to call a function on each one of these rows that's called get question context so for each one these rows I'm gonna say get question context and so what I'm saying here is filter the output segment so that big list of the transcript I'm going to say filter it and only get the ones that are part of this particular question so on question one I'm going to say get all the text that's part of the question that starts at 118 seconds and ends at 210. so it's applying a function to each row and this function is part of the question takes a start and end time stamp and a segment and it sees oh well if the segment start time is after the start that I've provided and if this segment ends before this anti-stamp then it's part of the question and so I can return true and so we can take all these segments put them all together and so when I run that you can see how I was successfully able to get this question find all the parts of the transcript all of these segments that fall between 118 seconds 210 seconds then 210 seconds 13. 316 seconds and so forth so these are just pieces of the transcript where Urban is addressing that particular question and so now I have a question and an answer that includes the question in it right and so now that I have this pandas data frame I can start calculating my embeddings so if you watched video 5 of this series you know that an embedding is simply a numerical representation of a piece of text and so we have this big piece of text right and we want to calculate some series of numbers that represents this text in a vector space right so I import this function called get embedding and I just need to give it some text to get an embedding on and let's just show what one of these looks like so I'm going to run it just get embedding on the very first row so I'm going to get the first location so the zeroth index context here which will be this one I'm just going to calculate an embedding for it and let's see what that looks like and you should just see a series of numbers so this calls the open AI API and gets an embedding and you see it's just some huge list of numbers right there right so that's what get embedding does now what I'm going to do is create a new column on this data frame so my data frame is called episode I'm going to say create a new column called embedding and then calculate embedding for this row this row this row and this row and so forth every Row in there and so what I'm doing is taking each row and applying the function get embedding on that row right using the engine text embedding 8.002 and I'm going to store the results inside of embedding right so taking the context for each row getting embedding for it and just assigning it back to the column if I do that I'll have a new embedding column that just has a bunch of numbers in it and so I'm going to print out what that episode looks like with the embeddings and so you can see here now I have all the questions I have all the contacts that were extracted from the transcript and I have a an embedding for each one of these contexts right here so I have this numerical representation it's applied to all and it I'm outputting episodes so you see what this data frame looks like now if I run this again it'll keep calculating the same thing so this actually cost let's say a fraction of a penny here but I don't want it I don't need to calculate this over and over again so what I'm going to do is cache this and save it to a CSV file so I'm going to save it to a CSV file as well and so let me run this and I'm going to Output it to a CSV file and now you'll see I have this nice CSV file called question embeddings.csv and if I open this up you'll see I have a flat file containing questions context that is extracted from our transcript and this embedding that we calculated using open AI so now that we have this flat file with context and the calculated embeddings here we've done the hard part here so what I can do now is Type in a question so I'm just going to store a question as a string here should I buy a house with cash or get a loan and invest the extra cash in the market I'm going to get an embedding for that question so convert that question to a long series of numbers so now that I have a vector representation of my question I can say find the vector in here you know calculate the distance and find which one of these blocks of text these vectors is closest to my question and use that to generate an answer and so to do that I'm importing this cosine similarity function and that allows us to calculate the distance between these two vectors and so what I can do is is create a new column called similarities and so I'm going to do here is say take my embedding column go through each row and say calculate the distance using cosine similarity so calculate the distance between this vector and the vector of my question calculate the distance between this one and my question and so forth and then sort it by the most similar and so when I run this you should see this data frame is now sorted by the answers that are most similar to my question so you see this one has 0.87.79 0.76 so they're in descending order right here so this top one should be related to this question about should I buy a house with cash and indeed it looks like it is so if I expand that out you'll see this one contains at interest rate so low should I instead take a loan finance a house or investment capital in the stock market so this top result is most relevant to my question and maybe even the second result might be related to my question as well so let me look at another one uh paying off a mortgage it's another question or another answer that's related to my question and so forth so you can see how we're able to find the relevant text that applies to our question that was contained within the podcast the other thing you'll see here is I call Dot head5 and so that is just getting the top five most relevant results so I'm ordering by similarity for the whole data frame but then I'm saying just return me the top five there all right and so I have that stored in this data frame called episode okay and so what I can do now is I'm just going to say those top five use that as my entire context for answering the question so this is my knowledge base that applies to my question and so I'm going to use this to call gpt3 and so I'm going to build a list called context here and so I'm taking all these contacts so there's going to be five of them I'm gonna join them all together into one big blob of text with lots of information about house buying and low mortgage rates right so the most similar text to my question that was extracted from this podcast put in one Big Blob of text here right and so that's my larger context so now that I have this Big Blob of text that applies to my question all I need to do now to finish this up is call gpd3 and so I can use the open AI API that we discuss in video number three of the series I can call openai.completion.create and I need to give it a prompt what prompt am I going to give it I'm going to say answer the following question using only the context below answer in the style of Ben Carlson a financial advisor and podcaster I can also say something like if you don't know the answer answer for certain say I don't know right and so prevent it from making up extra right and so and then I'm gonna say context and I'm going to provide the context here that I've extracted from above and then I'm going to put the question that I have in my variable up there and then that's all part of the prompt and I'm going to say put the answer right here and if I run that you can see it gives me an answer here it's hard to take that one back if you buy the house in all cash I think if you have a low payments tax advantage you have to think about that in terms and it provides a lot of things to think about there so there you go mission accomplished we're able to build a question answering bot based on text that's contained within a podcast so we're able to use open AI whisper extract all the text from a podcast all the different time stamps and construct this nice flat file we're able to build a data frame with all the start and end times for each question extract the context for each of those questions and answer for each of those questions from the transcript then we're able to calculate embeddings using the open AI embeddings API and then we're able to calculate an embedding for a question and find the most relevant text to our question and use openai's API right here to generate a response to a question that was discussed in our Corpus of text and so you can see how you could apply this concept to your own set of documents or your own audio your own video your own private internal documents and maybe you could build some cool value out of this build a question answering bot for your ticketing system at work or a support chat bot all kinds of cool stuff you could build with this using this very simple concept of embeddings that has been provided via this nice easy to use API so um yeah that's it for now in the next video I'll show you how to do a pipeline to batch all these up transcribe them in batch and build a user interface for this so that you don't have to run it from a notebook so it's nice and you could deploy it on the web and have end users use it without having to know anything about embedding so they just want to type the question and get an answer so let's wrap this up in the next video so thanks a lot for watching see the next

Info

Channel: Part Time Larry

Views: 17,817

Rating: undefined out of 5

Keywords: openai, word embeddings, question answering, q and a, finance, trading, financial advisor, podcast semantic search, youtube search, vector search

Id: hR8xhJgKcJ0

Channel Id: undefined

Length: 28min 52sec (1732 seconds)

Published: Sat Feb 04 2023