Building a Q&A Chatbot using GPT and embeddings

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay I'm now officially recording this session so uh I just essentially started by explaining what Buster was and you can see here where we essentially we did a request a buster on hugging face and we got an answer with all the different sources so I was kind of quickly explaining how it worked and uh yeah so now I'm just going to explain in detail how Buster actually works uh through this live stream so first of all I'm going to start with a um high level overview of how this thing works and I would say there are pretty much like three steps to make this work first of all can everyone read is the is this good enough the uh the font size I hope so so there's three steps to make it work awesome uh the first step is that we want to collect the source of documentation parse it into some kind of readable format and then embed it using a large language model now I'm going to explain this in a bit more detail but for now I'm just going to go into the the main steps so the source of documentation that we uh that we went for was hug English Transformers the reason why we decided to use the hiking face Transformers was for a few reasons first of all it's a really popular Library so we had a feeling that a lot of people could benefit from it but also their documentation is really good um it's just really well written it's really clear and so if you're giving it to a completion engine and you've prompted with a lot of this clear documentation there's a good chance it'll be able to find relevant relevant bins in the documentation so the second part that we want is uh that we want to build a document retrieval system so we take the hybridface documentation we scrape it we parse it we embed it use it a large language model and then we get a user's question we also embed it using the same model and we compare it to these embeddings by doing this comparison we can get a score so the score that we'll be talking about is a cosine similarity again I'm going to go through these details a bit later and that allows us to retrieve the top documents that we're interested in using those top documents we can then craft a prompt where we take those essentially the raw text we append it to some kind of engineered prompt for GPT and we add the user questions so we take all the documentation we explain to the to the chatbot what it's supposed to do so we say you're a chatbot you're supposed to answer questions now answer the following question and we add the user's question so in practice this prompt is actually a lot more um quote unquote sophisticated so I'll also show you some of the the hacks that we had to put into this prompt to make it a bit more robust and a bit better but that's that's the high level overview of how this whole thing works and then you know it's all wrapped neatly into a python API and we'll go through it afterwards how you could use it as well on your own projects okay so let's do a little bit of a background here for those of you who don't have uh that much background in natural language processing I'm not going to go super into the details there's many really good tutorials out there that explain what word embeddings are but it's just to give you a bit of an intuition so most of these modern large language models GPT all of that they all work on embeddings so uh someone's asking to show point two again yeah sure so just really quickly point one I'll do a quick summary point one we have the um scraping of the documents point two we have the comparison of the question to the embeddings and we retrieve the talk documents based on that and I'm going to go through all of these points in much more detail later on uh but this is this is the second point and then the third point is once we have these retrieved documents we formulate the prompt okay so now so now the word embeddings so most of these large language models GPT included essentially most of the Transformers and I would say almost all modern Transformers these days operate on embeddings they're not necessarily word embeddings per se they could be token based usually they are token based but the idea is kind of the same you take a word and you have to embed it into some kind of vector representation so you know a vector is just a continuous space and it'll be a one by n Dimension and so you can the the machine kind of learns automatically how to organize all of these different words and Concepts into Vector space okay so we're not going to go into the details of how this actually works under the hood but you know this is where you have things like Mass language modeling um next word prediction all of these different tasks that were come up with over the years and there's a really good summary here that I would recommend you to go read if you're interested in how we we can actually uh you turn turn words into vectors and put them into Vector space now what's really used useful is that once these words are in Vector space we can actually measure different things between them so how far are they from one another how similar are they to one another and and that's what we're going to be using in this tutorial but for now let's focus a bit more on the actual embeddings so here I have a little bit of code that just shows you uh how it is that we can actually play with some embeddings and for now we're not actually using the open AI API just because the open AI API as much as it's great you have to pay for it every time you use it so I'm going to be sharing this notebook after the stream but I want most people to be able to run most of these cells even though they're not paying for open AI so I'll be using a lot of Open Source tools throughout this live stream but then at the end like if you do want to use Buster then you will have to have your own open AI key Okay so here we're using the sentence Transformers library and the reason I'm using this is honestly because their API is is just really really simple so let's just load this up I already installed it but if you haven't installed it just use pip install sentence Transformers and so they have a bunch of different things here more importantly we're going to load one of these models so it's some kind of Transformer honestly I'm not even sure which one this is but it's a it's called the mini language model l6v2 you could go check out the documentation for what it is exactly and so we're going to pretend here that we have a document right here okay so this is welcome to this tutorial by chairpence so chairpin that's me that's my handle on GitHub and on Twitter and everything so what we can do with this sentence is we can actually use the model to encode it to a tensor okay so a tensor essentially in this case is going to be just a vector so this is what I'm doing here I'm printing the shape and I'm printing the actual embedding so let's start first by printing the shape and you can see that it's a it's a tensor of size of shape 384. so what that means is it's a vector that's 1 by 384 and if we actually print the content of this embedding now you'll see it okay you could see the contents of this embedding and to most normal humans this means absolutely nothing at all but to the model this it kind of says this is where you should place this sentence in Vector space based on every other sentence that you've ever seen this is where it makes sense to categorize it in this Vector space so now we have a vector representation and you can see it's it's relatively sizable like we have 384 and so look at the values themselves okay like minus this whatever this value might be minus 2.9 10 to the negative two you know if I change this and I put like uh welcome to this cooking tutorial this will probably be a very different value you can see the values just changed just because of what I put in the text so based on the text and the semantic of the text these models just learn how to place these things in Vector space and we're going to use what the model has learned to our advantage okay so um that's essentially the high level overview of what a an embedding is and so then when we collect the data where we do is that we go and we take all of the documentation and every little bit of documentation that we can find we create an embedding just like this one okay so I'm going to show you here like we're going to come back later on as to how we use those embeddings but for now I'm going to show you what it looks like to actually collect this data so the first thing that we need to do is collect all the documents and embed them so um we're going to use the hugging face Transformers documentation so here's the link to the hugging face Transformers and you can see like all of these different things you know natural language processing computer vision and a reason why I really like this documentation for the project like I said is a lot of it is in is in plain English which GPT is really good at understanding plain English and like really simple terms you know it's like probably one of the harder ones would probably be the python documentation like for those of you who have ever read the python documentation it's very terse it could sometimes be hard to make sense of with this it's just very clear emojis everywhere really meant to be like a friendly kind of source of documentation so what we do next is we essentially shove it all into a panda's data frame so for those of you who aren't familiar with pandas it's a really neat library to have essentially like csvs that you can order so uh I guess I should mention first the documentation for hugging face and most of their code lives online it's a open source Library so for example if you go to um the hugging face Transformers GitHub you can see here I'm under their documentation Source English and we have here a page on bertology okay so bertology I think this is probably a term that hugging face came up with I don't know if it's like a real academic term but it's a growing field of study concerned with investigating you know large-scale Transformers like birth uh so anyway they explain what breathology is but this is just a markdown file you can see here at MDX so it's some kind of markdown format so if we click on Raw on GitHub we'll actually see the source code right for this thing and it's just your standard markdown so what you could do is you could just come take this text copy it then we come here and so here I already pasted it but you can see I just literally just copy pasted this into this text block so now we have a string representing this documentation we also want to have a name here which represents the name of the section so the name of the section is bertology and the URL Associated to this thing so you can see the URL here it could be related to the version but you could also try to point it directly to Main and then the idea is that you want to build a data frame where we just have the name the URL and the text and you would do this over and over for each and every single page but let's just do this uh right now for this one in case of that run so now if I just run documents PDF you should see we have only one entry with the name of the section the URL and like you can see the plain text here so if I did documentdf.text let's just print the whole thing you can see this is really just the the raw plane type of uh what we found here so it's less readable for us but it's much easier to handle for a machine so now the idea is that you would do this for the entirety of the documentation so you could in theory just do exactly what I did here and uh do this one by one however you will find that the documentation is very long for hogging face and for most projects it's really long so what we did in this uh in this repo is we essentially implemented a bunch of scrapers so we support two kinds of scrapers at this point we support the hugging face documentation which was a bit custom and we have some scrapers for any kind of documentation that is in the read the docs format so that is documentation that is built using Sphinx we don't have that many that much information yet as to how to actually use these scripts but we're going to add them soon into the repo but it should be relatively straightforward to uh follow along Okay so um yeah so essentially we could have done this one at a time but that's that's time consuming so we have we have scrapers that will do this automatically for us so once we have all of this the idea is that for each one of our documents we can then compute an embedding and so here I just use the exact same model that we had earlier from sentence piece but in the actual project we'll call open AI right here so instead of calling the sentence piece model we'll call open ai's uh one of openai's GPT models so they recommend on their website the ada-002 for doing um for using sentence embeddings so that's what we use and so we just call that essentially on the entire data frame so we can create a new embedding column and here we just apply the model dot encode so exactly the same function that we the same function call we had before on our string this is just a fancy way of using pandas on an entire column okay so if we run this uh now you can actually see in here if I just linked it you can see that the embedding is now stored in this embedding column okay so we essentially do this and now once you've screened the entire documentation for each section you also have an Associated embedding so I'm not going to go and scrape the whole documentation right now live for a few reasons first of all there's a bunch of steps involved you need to clone it locally you need to actually build the documentation into HTML because we actually use beautiful soup to do the parsing so it's not enough to just point to the raw markdown files another reason why it's not enough to just point like you could technically just go point to the raw markdown files hosted on GitHub but because of how they build their documentation they put a lot of preferences that only get built when you actually build the documentation locally so we did a whole script with beautiful soup to be able to do and do the sparsing it's all in Buster and uh there's there's more info right here so I have the link here you'll be able to go see it but we have like parser scripts and this is at the year we did most of the parsing so uh you know that was like he's was really good at doing the spicing really quickly and um the hugging face one in particular was a bit harder because it's not using Sphinx they have their own document Builder and you know it shows because their documentation is is really nice uh uh so the nice thing is though that we are hosting this thing uh like the full documentation script for hugging face is actually being hosted on our space so you could just download it if you wanted to so right here you could see like you could just call this wget you'll have access to this tar gz file which is just a format that we used to pickle the file because if we saved it as a plain CSV it would it was huge like it was really really huge because the embeddings were kind of big so here you can just run the cell it'll download the um it'll download the file and then here I'll just remove this so we don't pollute our output and then I just print the the head of it so that you can see that we've done this before um we have an extra an extra column here that we add in our code it's not necessarily it's not something that's necessary but it's nice to keep track of which is how many tokens does this text actually represent when you call the open AI API so this is useful because you have a maximum tokens that you can actually send to uh open AI which is 4000 but that includes everything in your prompt so you do want to be able to monitor how many tokens you're you're parsing because if you were to just dump the whole thing into one giant block of text this would not pass like you would get an error from open AI saying you have way too much uh tokens in your request so you can see here this is just the first few lines of the file uh so the different name of the pages so Community Resources Community notebooks all the URLs Associated the plain text Associated as well as the embeddings associated and these embeddings were calculated using open AI um so yeah that's that's the first big step okay so now that we have all of this done we take this for for let's take this step as already accomplished the next thing that we want to do is document retrieval so this is the step two kind of thing so how do we measure the similarity between two documents so now we have all of the documents available we have all of their embeddings pre-computed and now a user asks a question so what we want to do is we want to figure out given a user's question so let's let's suppose that the user's question is embedded as this D1 embedding and giving all the other sources of documentation so let's just take one source right here D2 what's the distance between them and a really popular and easy way to compute the the distance between two embeddings is to use something called cosine similarity where you essentially just compute the angle between these two vectors uh so you know I'm linking here there's um lots of really good pages on what cosine similarity is it's something that you would typically see in like a linear algebra course and you know it's it's really just a measure that says R2 vectors close together in Vector space or our two vectors far away the value usually oscillates between one and minus one one means they're very aligned or usually one means they point in the exact same direction and minus one means that they're like almost opposite vectors okay so once we have these embeddings um the idea then is that we can take a user's question we can embed it in the exact same way that we embedded all of the documents and we can compare them one by one so for for the question for each embedding we take the question and we measure a cosine similarity and then we sort them by order of importance which was the highest to the lowest and we retrieve however many that we want to retrieve so in our code this is something that you can actually play with for this demo we used three documents that you actually retrieve so uh this is a setting that you could put in the configuration of Buster and you ask it for the top three uh embeddings that match best your query so now I'm going to show you what this looks like in practice these similarity scores so here we're just going to be using the similarity to score uh from the sentence Transformers library and so you can see here I just have a quick function that's going to be looking at these things so given two documents and here I should probably put that these are two strings uh someone's asking am I going to post this tutorial yeah I'm going to post this tutorial and uh we'll share it probably on the Discord um and I'll also post the the whole stream will be posted on YouTube as well so all of this is being recorded and the notebook will be posted you'll have access to all of this documentation okay so here we have a function and the goal of this function is just to print the similarity between two different vectors okay so given sorry two different input strings so given two strings print the similarity between these two the way that it works is first we print the documents themselves in plain text we embed each one of these uh so we use again our same language model to uh embed them and then we just compute the cosine similarity using one of the utility functions uh and we just print that out after that so here I have a bunch of different documents and I want to show you just like what it means to actually compute this cosine similarity okay so here I have four types of different documents so the first one is dogs are furry type of animal then I have dog cats are also very cute and fuzzy the current economic situation is dire and I was never really good at saving money alright so you can already have a bit of an intuition as to what should be more related to what so let's let's start printing these things out so here we do the similarity between the zeroth documents of this one and the first documents so this one or maybe I'll just put this here so we have a better idea so now we're comparing this one to this one and we get a similarity score of 0.5 with this specific model okay so 0.5 remember I said the range was between 1 and -1 so that's like a kind of crude way of measuring this let's say a bit like a 75 kind of similarity it's not exactly how it works but 0.5 can be considered high on this scale now what if we compared this first sentence to actually before even comparing this first sentence to the last sentence let's just change here cats are also and but dogs are also very cute and fuzzy now this should be a lot more similar so remember we had 0.5 and now let's do that again and you can see from if we went from dogs back to dogs this score is even more similarly okay and another thing that's really interesting right the maximum that they could score in theory is if it's the exact same Vector so here let's just compare the two first sentences together and if I compare dogs are a furry type of animal to dog is our furry type of animal we get what we'd expect which is a similarity score of exactly one so now let's try to compare the first sentence to the last sentence okay so here let's put the the last sentence so now we're comparing dogs are furry type of animal too I was never really good at saving money and these two sentences as you would expect aren't really related at all so now we're starting to score into the negatives but what if I compared like something about the the economy and something about me saving money so here I could put like the second document and here let's just make it more explicit put that as the so the the last document as well and you can see here current economic situation is dire I was never really good at saving money and well that's already a bit more related not super related but kind of more related than it was to the thing about the dog and the cat so you could play along play around with all these things get a a good understanding of how these things work someone's asking if he's the only one having audio troubles can everyone else hear me well I'm hoping he's the only one can hear well great all right so I'll continue loud and clear all good all right so let's continue uh by the way I I really uh want to insist like if you guys have questions related to any of this or like what this means if you're lost like please ask questions uh it's always better to you know have a dialogue with internet people than to just like talk to myself about all of this so uh yeah please don't hesitate if some of this stuff is not clear if you want some clarifications just uh let me know and I'm more than happy to answer your questions um oh so okay so someone's asking the the requirements uh we'll go through the requirements a bit later but there's there's not many requirements it's basically just uh open AI pandas and a few other libraries but we will go through the the requirements at home uh for for people who want to try this at home later what the requirements actually are and so if you're using the open AI stuff the really nice thing as well is you don't need to have a uh a GPU because you'll use open ai's gpus so yeah let's continue so now going back to what we want to do with Buster it looks something a little bit more like what we have in this cell so you'll have a question that comes from your user okay so the user might ask you what is the best model to deploy on text so this could be a very legitimate question that you would ask Buster and now you have a series of documents that you scraped from the hog and face Library so you might have one document that contains the sentence bird is a large language model trained on text Vision Transformers work really well on images and multi-layer perceptrons are a type of neural network okay so if we compare this question to every single document which one would you expect to score the highest uh so there's already a few hints here right by just looking at the words that they contain we got text and text in this in these two documents so we would expect that to already score a little bit higher but even in the context of these things right best model to deploy on text and Bert is a text model like there is high similarity between these things so so let's just run this and we're essentially just looping through all the documents and comparing the question to each document and so you can see for the first two sentences we get a relatively High similarity score compared to all the other ones and then how important these similarities to scores are will really depend on the quality of your models uh so like with the open AI embeddings models we tend to see like similarity scores of 0.9 when we when concepts are very related and then maybe close to like 0.6.7 when they were less related so it really depends on how these models are retrained not all models will output similar cosine similarities for similar sentences this is really really just uh an artifact of how they were trained right based on whatever it is that they optimize whatever their final weights end up being is what's going to determine these similarity scores and how good they are at these things and also another thing I should point out is a lot of these models weren't trained to necessarily score high on embeddings in future space they were just trying to predict the next word so it's kind of a an interesting application of these models that we can also use these embeddings but it's not directly optimized to do that so keep in mind that when you're embedding these things it was never really meant for this it's just a nice artifact maybe in birth you could say that it was kind of done by Design when they were fine-tuning on these different tasks with the CLS tokens like because those CLS tokens would actually collect all of those embeddings and try to to do it essentially a regression on those values but when those models are trained they're really just always trained to predict the missing word or predict uh the next the next word uh someone's asking could this work with philanthy five uh yeah actually it can and uh we're going to use flyant T5 just in the next example as well so definitely you could use this with flan T5 uh there's this is very agnostic to the kind of model that you want to use and then fly T5 comes in all sorts of different sizes as well so we'll actually go to the next section where we're talking about flying V5 so we went through uh the the first two parts and now the the next big part that we have is to actually generate uh answers so how do we generate answers this is where like all these models like GPT come in these are generative models and uh they allow us to complete essentially text so we're using open ai's GPT 3.5 completion model so that's the text davinci02 model that's currently available on the API so this is what we use for Buster but in this example I'll actually be using the flan T5 model for generating some responses again just because it's open source so when I'll be sharing this notebook you guys can actually run this without needing to provide open AI with an API key so that that'll be something that you can just play around with so the T5 model that we're using here we're taking it directly from the hug interface Transformers Library uh and so you can see we're using Google's slant T5 small model so this is a like the name suggests this is a very small version of T5 I believe T5 goes up to XXL obviously the bigger it gets um the more the more weight you have to load the more time it takes to run the more beefy your computer or your GPU needs to be to actually have this thing Run in real time so we just load the the flan model here we also load its tokenizer anyway this is more like hugging face related stuff but the idea being that we can given uh given an input to our model we can then uh we can then generate the rest of it okay so the input here what we're going to do is we're going to ask so we're going to say we're going to prompt our model with Bert is the best this is a typo Bert is the best model to deploy on text and then the question of the user is going to be what is the best model to deploy on text so we're going to combine these two things so the document plus the question into one string okay this is essentially going to be the input that we're going to give to Flynn and we're going to ask flan to come to complete the sentence so we're literally going to prompt flan with bird is the best model to deploy on text what is the best model to deploy on text now this is a really easy thing to answer in theory but this is also fun T5 small so you're going to see it's going to be okay at answering this question so let's run this cell you could see it thinks a little bit uh so it's it's loading everything it's loading the tokenizers and now you can see that it responds burnt okay so it's really just spitting out like kind of by Design Bert is the best model to the play on text so now let's play around with these things instead of putting bird is the best model to deploy and text uh you know what let's put um GPT is the best model to deploy on text what is the best model and yeah you could see it like it's a really uh not as good as GPT to generate like long coherent sentences but if you were to take Plan X XL I'm sure you would get much better results okay so keep that in mind this is one of the smaller models that we can get and they provide them in in different sizes and if we said there is no good model to deploy on text uh and we send this to um T5 to document we would say what is the best model to the plant text and now it just says no like there's you know it's ideally it would have said there is no best model to deploy on text but you can see that it has already a bit of an understanding it shouldn't exist so this is kind of how we do uh this is how what we do with Buster to uh prompt it we actually pass in all the documents as context and then ask it to uh to formulate the rest so this is what we call Prompt engineering and I would have I would say I would argue that this is a big part of what makes Buster so powerful and useful and it was surprising honestly at first how well Buster was answering but I think that the prompt engineering was one of those things where the more we played around with it the better results we started getting now if you're not familiar with prompt Engineering in general there's a really good resource here that I really recommend you go check out this website called learnprompting.org uh shout out to sander I think he might be in here he's the behind the brains behind this uh this whole thing uh you could really learn everything about prompting and this is this could be prompting for any kind of generative models but the idea behind prompt engineering is really like you have to in plain English give instructions to kind of hack GPT to give you responses in the way that you want so in the earlier image that I showed you I showed you this thing here we have the retrieved documents you have some kind of prompt and then you add the user's question but now I'm going to show you like the actual prompt that we used for the hug and face Library so you can see it's a much more intricate prompt so the prompt is is uh this thing right here okay so the first thing that we do is we give like imperative directions to the to the tube Buster we tell them you are a chatbot assistant oh here technically it's not slack we did deploy this thing on slack as well uh but this one is a hug interface space but whatever it doesn't really matter it works just as well so you say you are a chatbot assistant answering technical questions about hugging face Transformers a library to train Transformers in Python make sure to format your answers and mark them format so this is one thing that's really neat also for for those of you who've played around enough with Buster you probably notice like it gives you code Snippets and they're properly formatted it gives you links and they're usually properly formatted this is all GPT is doing uh natively because we're telling it make sure to format your answers in markdown format and GPT really understands pretty well what markdown is because markdown is really popular format online so we say including block code blocks and Snippets so we want we want to make sure that it formats them properly and this is another important one uh GPT tends to hallucinate and a lot this is a really known kind of behavior of GPT at Chad GPT as well it's not ideal but it'll very gladly make up links and URLs that kind of look real but don't exist so this is another instruction that we had to add because we're already citing the sources because we know which documents we retrieved from so we can cite the sources but we saw that in some of our responses at first when we were playing around with it that on top of that GPT was providing its own made-up sources and links which is really not a good thing you really want to have this kind of the sources that are real and that exist so this was another prompt that we included here we said do not include any links or URLs in your answers um so another thing that was also a kind of prompt engineering hack was gpts and GPT models in general are very happy to make up an answer that they have no idea about or that's completely unrelated to the task at hand so this is another bit of prompting that we used uh prompt engineering technique where we say if you don't know the answer to a question or if it's completely relevant to the library usage simply reply with this doesn't seem to be related to the hog face library and in fact for those of you who have played around with Buster maybe we can go try it out now just see like how well this actually works um so here let's say uh what's a good recipe for uh I don't know who wants to know a good recipe for what for pizza let's say for pizza and I don't know let's just say pizza and pasta okay so now the the processing time is much slower because I think less people are using it oh okay so now I actually didn't really work that well but usually it uh so this this is a failure case but sometimes maybe let's try to ask it like what's uh what's the closest planet to our sun well hopefully we'll get something that works no I don't know why it's dancing I have to check out what's happening um the hot moonlights in parsley yeah I guess so usually it's supposed to not answer this so there might be something wrong right now and I'll have to go check that out check this out but I have seen it many times before not answer these irrelevant questions so obviously when you're showing things live things tend to not work but um let's try something completely irrelevant what is larger an elephant or a mouse no okay there's maybe there's maybe a bug that I have to go figure out but uh this could be like because of a new deployment that we did recently so I'll have to check this out but in theory uh and I telling you this worked in the past right it was really good at saying this doesn't seem to be related to the hug and face Library so I'm gonna go check this out um after the stream and try to investigate a bit further but in theory if this bot works well the way that we had it it was generally pretty good at saying I don't know what you're talking about um yeah so going back to the prompt engineering another thing that's really important is that we add the retrieved documents as him Murphy's Law exactly only during the live stream do things not work um yeah so here for the prompt engineering another thing so on top of this really like well-crafted uh prompt we also put in plain text out of the retrieved documents so really this is just like for now it's kind of just raw plain text that we copy copy and paste right before and we just append it to the prompt before so the that buster has context as to what it is that we're actually uh asking it about because gbt 3.5 like the text DaVinci 002 um it's it stopped being trained in 2021 I think and uh you know right now we have like maybe one or two more years of documentation of Transformers of models that it would not be aware of so by adding these documents right up here we actually give it the context that it needs and it can actually answer about more modern models as long as they're in the hug and face Library so this is kind of a way to get around the fact that gbt's knowledge stops at 2021 uh we actually give it all the most recent knowledge that it needs to answer these questions and from the retrieved documents uh and so another little neat bit of this right because we know which documents were retrieved we can simply add these so we have a section in the code where we just add all of the URLs afterwards with the relevant scores so people can have a good idea of what these relevance scores are uh okay so now putting it all together so we have all the little bits and pieces now but how do we actually make this thing work so we need to put this all in a nice clean API so when I first started hacking this thing together you know we had a whole bunch of different functions all the different pieces and at some point the code starts spaghettifying out of control and you want to start cleaning up and doing some some house cleaning and making sure that this goes into an API that's nice and easy to use so this is actually in my opinion a really good use case for chai GPT and and I actually use this okay so I asked chai GPT I said I'm building a chat bot in Python I want to build a clean class interface for my for my project the bot has to First compare question to a series of documents pick out the best matching documents then generate a response based on the retrieved documents AMS also format the output once generated give me some skeleton code where I can fill out the actual functions so that my my interface is clean and easy to maintain and Chad GPT gave me this really nice looking skeleton and it was kind of cool too because essentially a lot of these separate functions I had is just all over the place different functions but they were already kind of atomic functions but now I had a good idea as to how I could put this so that it just looked clean and I did not ask it to write any code for me um it does write code pretty well for simple functions but for this kind of stuff I really didn't want to have it write code for me but I did want to be inspired with like what's a nice interface that could like someone could just read and uh it would make sense to them so I actually used this for the code to inspire myself and this is what the actual code ended up looking like so you can see it's a bit more uh it's a bit more involved than what GPT gave me but it was a really good like first step so we have a rank documents function a prepare prompt function generate response add sources format response and finally like the process input that's kind of the the main of this chatbot but I have to say this was pretty useful because at some point you you start looking at your code a bit too much you you always have this fear that you're going to over engineer it to so having something that's just like seen you know hundreds of millions of lines of code ask you like giving you a suggested skeleton was was really useful and so I used that for us um so now how do we actually use Buster itself so yeah this is all the bits and pieces together and then the actual code itself we can we can go through some of the code if if you're interested uh but I'll show you more how to interact with our API because we wrapped it up essentially in a pip installable package that you can just import and use directly yourselves so first things first to be able to use Buster and so for the rest of this tutorial so all of this before you didn't need a valid open API key but for everything else down here you do need a valid open API key uh so first things first you know you want to pip install the package it's not yet available on pipei uh maybe we'll we'll put it up on pie Pi at some point but for now you can just simply install it by using this uh this time command in uh in your notebooks so or you know just locally you can fit install it you can also install the package from Source uh right now it's still in beta that kind of development mode so if you do this then you'll actually be able to edit the source code and mess around with it debug it you know maybe adapt it to your needs so up to you how you decide to install it but it is relatively easy to install and someone did ask what the requirements were for Buster so we can go look at the Buster requirements um so here we have a list of requirements uh so these are most of the requirements that we use for the thing so beautiful soup is for the scraping numpy and pandas just like some basic stuff and then almost every other dependency is open AI that required us to have it so you can tip install openai embed embeddings that would have been a lot easier but there's something that breaks our CI when we do that so we just here have the list of uh requirements that they have to use their Library some of them are really not necessary for this project like matplotlib and plotly we're not doing any kind of plotting but if you want to use the open AI embeddings dependency they they call those Imports so your your code will break if you don't have these anyway if you just pip install a requirements file or just pip install the project it should take care of installing everything that you need Okay so going back to this thing so I assume you've done all this set your open AI key okay I have an open AI key set as my environment variable so this is kind of the typical open AI API stuff so just here checking that I haven't set API key if you don't this is just going to default the none and this is going to raise you an assertion error so you just won't be able to use the rest of it so what you can do here is you download the weights so this is something that you would do yourselves if you wanted to do this for a separate source of documentation make your own sets of embeddings right like go scrape the documentation you're interested in go make your own sets of embeddings and then shove it all into a data frame so here we provide them already they're hosted on the hugging face space uh so whatever it downloads it all and now here I'm going to show you how we used radio so gradio is what we use to deploy the hugging face space and this is essentially what a radio app looks like so here's the the entire configuration we use for the chatbot so we have a chatbot configuration so this was part of the whole refactoring thing so you have to specify a bunch of different things first thing that you want to specify is the file in which your documents live so this can either be like a tar file a CSV file as long as pandas can read it either through from CS like read CSV or read pickle this will be supported uh the unknown prompt so this is something that I didn't really talk about but maybe later we can talk about it it compares the response that uh that is uh given by chat by GPT to this unknown prompt and if it scores high enough uh then this the bot says I don't know so I suspect that because it wasn't working before either this prompt on the app is not well configured or maybe we're being a bit too too strict with the cosine similarity between the unknown prompt and the response so I'll have to go check that out offline uh once we figure that out but it definitely take my word for it I guess but it definitely works it's just today is not working for some reason so I'll have to go explore that here you specify the embedding model that we want to use so this is the open AI embedding model and this is the text embedding Ada 002 so you can go read the documentation and then uh a bunch of different settings that you can play around with so the number of Maximum documents that you're going to retrieve it's set to three we find that three is usually a good enough default but you can play around with it then here is a threshold so the minimum cosine similarity score for a document to actually be thought of as a source if it's below this number then just ignore those documents completely this is something that we also have so the max number of characters that the retrieved documents conspire so if we have three documents each of them have however number of characters there is a limit to how much you can send to open AI it's usually measured by tokens this is kind of a rough guesstimate you know we want to limit maximum our entire prompt to 4000 tokens so four thousand tokens can be roughly thought of as about 4 000 characters it's not exactly but you know roughly so we we take about 75 percent of that for the documents otherwise they just get truncated and then then we to that we append the prompt and uh you know the the text before the prompt and all that uh yeah said by all programmers every day yeah I know I know what do you want it can't always work but at least at least the the space is up and running like that would have been even funnier right during the live stream like I ping the space and and that doesn't even work or it like starts giving me answers completely unrelated so yeah whatever it is what it is but uh we'll get it working and I'll post an update at some point when it's uh it's working again but um you know what actually I have uh I have the logs of every single request that gets made and I've seen it in the logs work so maybe later I could go fish it out in the logs and be like you see I thought it used to work but uh right now I don't know why for the live stream it decided it decided not to work okay so the the next things that we pass here these are just the the parameters that we passed to uh gpt's engine so for completion so here we're telling you use the text DaVinci sorry it's text DaVinci 003 model I think this is the latest one so this is what they call GPT 3.5 and then the maximum number of tokens that it can actually output so you know we we want this to be uh complete but not too complete or you know at some point we want it to stop so this is where we tell it to stop uh this separator thing this is kind of important depending on how it is that you're serving your app uh the line breaks will render differently in gradio for some unknown reason the backslash n it doesn't like it it only likes this break thing so it's like an HTML thing so we need to use the brake as a separator but if you're deploying this as any kind of other bot like on slack or on Discord anything that supports markdown uh then you could use here at backslash n instead the link format so this is just for the retrieved sources this is the only thing that we format format ourselves you can have it formatted in markdown but also slap has its own formatting for links when you're a bot so when you're not a bot in slack everything is formatted in markdown but when you're a bot and you want to format a link you need to have a special slack format so we also support that and then we have the text like a generic text that you want to put after your response so it's usually good to remind your users that you're a bot and uh that you're not always perfect you know so here my answers aren't always perfect I've seen it as well like if you ask questions that are a bit more technical or challenging or require a bit more Chain of Thought reasoning than otherwise would be necessary like if the answer is not directly within the source of documentation Buster can make stuff up so it's really important to remember the users that uh the answers aren't always perfect on the Buster's a bot and then we have the text before the prompt so that's what we saw before our prompt engineering um and then we instantiate our bot and everything else here this is Radio stuff okay so this is just how you can use gradio to deploy your your web app so you have a chat function um this is a little bit of a formatting hack for gradio because otherwise it was breaking our our markdown but otherwise everything all the formatting is taken care of by the GPT and uh here we just have the actual interface what it looks like so you know essentially how this looks on our interface right here like this stuff these examples you know all of this stuff the links to GitHub this is what we have over here and what's really nice about radio is that you can actually test all of your stuff locally so right now I'm going to run the cell it should be running so it's in uh it's in the debug mode so you can see I have also all of the logs access to all my logs right here and now you can see that we have the exact same interface running directly within our notebook so you can actually debug this thing without having to launch it to a hugging uh hugging face to space every single time which is really nice because you know then you have to wait so long every time you you deploy it it builds all these things and so now you can start asking your questions uh so let's try again let's try to see if this might work so what is the meaning of life okay hopefully this says I don't know uh but we saw today that for some reason it wasn't working so let's find out well all right at least Buster is philosophical even though our token is not working today uh it says the meaning of life is a personal question without a single answer it's different for everyone we each have our own unique life experiences and perspectives on what life holds and what purpose it serves you know that's actually really good so thank you Buster and you know what it's actually a good thing maybe that the uh unknown classification is broken because we got a really uplifting message for today so at least there's that um just to go back to it why I'm pretty sure there's a bug is if you look at the prompt you know I really had it here like what is the meaning of life for hugging face and it's supposed to answer this doesn't seem to be related so there's probably a bug somewhere in my code when I did some refactoring that it's not triggering the actual response but actually maybe I'll have a better idea if I look at the logs uh so you can see here all of the different logs uh so this is also very helpful when you're debugging your app you can see everything that was taken the cosine similarity for everything so you can see that the cosine similarity even up uh with these with these uh prompts that have nothing to do we still score kind of 0.74 so that's a that's something that you really I have to play more with the GPT embeddings to kind of understand what this cosine similarity means but so far I don't think I've ever really seen something go much below 0.7 like so there's there's definitely some calibration there that needs to happen um so the one thing that so we could see here the The Prompt that we used uh so this is where what is all here yeah so this is where these are all the documents that it retrieved uh and that it added and then we can see that we add also like you are a chatbot assistant if you don't know simply answer these things Okay so this is interesting this is this is what I think is broken right now and I don't know why and this might be something related to I wonder if there was maybe an update on these models probably not but usually what I would expect here to see is this uh unknown score to be much higher I don't know why this unknown score right now is so low because it the way that this unknown score works is it takes the response so this is the response that buster gives me the meaning of life is a personal question without a single answer and it's supposed to compare it to this thing right here uh in my configuration this is the unknown prompt this doesn't seem to be related uh to the hiking face Library I'm not sure how to answer so I guess I don't know why it is that it's not saying this doesn't seem to be related to the hug and face library and now we could probably investigate more with our prompt engineering like this was working a few days ago and today it's not and this is also the kind of annoying thing of working with a closed Source API is we have no idea what changed or what didn't change so I'm gonna have to investigate this a bit further but you can actually see clearly like we prompted it with if it's not related to the hug and face Library just reply with this so what is the meaning of life for hugging face it's supposed to say right there so let's let's just try this right now what is the meaning of life for hugging face and then we'll know if something is like really wrong in the code or if something went wrong elsewhere because now this is really like like GPT is just not following instructions at that point so let's see what what it answers yeah there's there's definitely something weird going on because now it says the meaning of life for hugging face so that should definitely score really high on the uh well not only that but it should also from The Prompt engineering like GPT is not respecting what it is that we're telling it to do so here we have what is the meaning of life and it doesn't seem so why would it not do this I don't know we're gonna have to I'm gonna have to look into this uh but that's that's kind of weird and I really do wonder if it's possible that there was an update to the model that was done without us knowing uh because seriously like nothing has changed to my knowledge in the last few days and this has been working pretty well up until now so yeah I'll have to explain that yeah AI is taking over already exactly um anyway that is pretty much most of what it is I wanted to show um so yeah and yeah exactly useful right so I want to put the blame on myself you know I want to say that this is probably something like some bugging in the code but because this is closed Source like I have no idea the only thing I can say is use this model but if they decide to change the model I I I'm not in control of that so there is a chance that there's a bug that slipped in but the API the has not changed significantly in the last few uh days that we've deployed this thing so I'm actually quite surprised to see this being pretty different so I would say there is a possibility that this model has been updated and there's just nothing I can do about it but it's also a possibility that this isn't working but I would say like from my experience with prompt engineering when you say like for example here's a question now answer something uh this doesn't seem related GPT is supposed to follow these instructions so I'm very very uh like very perplexed as to why this is not following the instructions so yeah anyway this kind of uh sums it all up so I'm happy to answer a lot more questions if people have questions uh if you have just some other things as well that you want me to cover otherwise it's pretty much covers most of what I had to show you today in terms of like code Snippets and how to use this Library um we're still you know we're still kind of in a very early phase with Buster so we're going to try to add more docs to like how to use this thing be a bit more user friendly show some more examples but uh if people have other questions as well please just ask them in the chat and they'll go through them uh otherwise thank you all for being here today I'd say there's a bunch of people typing so maybe I'll just wait a bit and see if uh if there are a bunch of questions coming up uh oh that's a really good question what is the okay so there's some good questions coming all right let's try to go one by one you are using Panda's data frames to store your embeddings how would you do if you had a huge amount of documents and they would not fit in a data frame which database would be suitable yeah that's a great question uh honestly it's a question I don't want to have to think about right now for now CSV is the ultimate database that we're using uh CSV is the best database out there for now um no it's a really good point I don't know I'm not really a pro at uh setting up database backends and stuff like that I'm sure there's lots of people with lots of good opinions on what you could use but so far I mean you know the amount of embeddings that we have it all fits in like a CSV file that's less than 100 Megs or let's say a pickled file that's less than 100 Megs so you can easily load that into memory even on like most free tier services so it just really minimizes the overhead of having it all in a data frame but uh if it started getting out of control I mean there are definitely things that make pandas data frames handle larger things but also for sure like if you were to deploy this as a real app you know you would have users you would have you would want to keep track of prompts everything else so definitely some kind of relational database would be a good idea but uh I don't I don't have a good suggestion for it another thing that I would mention as well is if you have a huge amount of documents right now we're just using the basic open AI cosine similarity function they provide one with their API but once you have to start searching for millions of documents through embeddings there are some libraries that are actually meant to do that really fast like one of them is the vice Library I think it's from Facebook research uh so this like is probably something that would be in the next things next steps to do as well like do some really fast efficient similarity search I don't know how open AI does it under the hood but I can't imagine it's something super efficient uh in their API so definitely if you want to scale this to millions of documents think of using some kind of framework that does that uh the next question is what's the cost estimate for Ada or DaVinci that you've seen so far so Ada is definitely much cheaper than Da Vinci I think it's like orders of magnitude cheaper so it's much much cheaper to do the embeddings than it is to do the completions uh I was surprised at how cheap all of this was so we built this whole thing uh in about let's say uh like the first prototype took about a day but then the actual deployment of it all took about a week and uh during that week we were pinging the chatbot a lot we did a whole bunch of different sources of documentation um we we used the open AI API extensively and by the end of like a full week of debugging it cost me less than like less than three US dollars so you could like think about what that means right if you were deploying a web app with a GPU three dollars per hour on Amazon you'd get a pretty good GPU for one hour and we had for three dollars like all of the debugging that we could possibly do and we did a lot also of of playing around with the bot like testing its responses seeing how it worked so yeah really cheap really cheap to run the only downside I would say which is a bit unfortunate is you need to have a credit card and an API key even to just play around with it it'd be nice if it was like a kind of free tier you could get a few prompts just so like I could share this notebook and everyone could run it without having to pay but there's no way to run anything for free at all you need to absolutely have a credit card on your account to be able to run anything so the plus is it's pretty cheap to do these things the minus is that you need to have a credit card that's on the account but they do have like you know you can set the maximum amount allowed so I think I had months at the 20 for the month and then they'll send you a warning once you get close to that so it is well built in that you're not gonna bust your limits without you knowing it's not like AWS where you could have a server running and you forget about it and like you wake up in a few in a few days or a few weeks I mean there's a couple hundred dollars with your name on and you're like oh oh damn what did I do so that's nice uh then someone is asking is the cosine similarity a good choice in such high dimensional State space uh I don't know if it's the best choice but it is a good choice it's a good choice because it's easy to understand it's a good choice because it runs relatively fast and efficiently and it's a good choice because it seems to work empirically but we didn't test other types of metrics and and distances and I'm sure that in this High dimensional space like that's probably why we always have such a high cosine similarity even between things that are unrelated like I think that the uh Ada embeddings are one by one thousand something Dimensions so I remember seeing somewhere that when you're in very very high dimensional space even when things are very far apart they're still very close together when it comes to cosine similarity so yeah probably cosine similarity is not the best and I'm sure there's lots of literature on the best things that you can use but it is the best thing to like quickly hack together it's also supported by most Frameworks and it's it's what's recommended by the open AI embeddings API so you could definitely have an edge if you were to do some research and find some better applications but I think it's a pretty good one for getting started I'm not sure what FIS uses under the hood maybe they use something else than cosine similarity to do your embeddings retrieval so probably go check out their library and see how they're doing it and uh and then you could you can think about that a bit longer when is live LinkedIn or how much Long Live going to continue I'm not sure what that means oh the live stream I guess the live stream yeah probably soon I mean we've been on here for an hour uh and I don't have much more to talk about so I'm just gonna finish answering everyone's questions and then I'll uh probably sign off and we'll be sharing all of these uh all of these documents on the Discord so join the the what's AI Discord uh this I think it's this is the what you're already on the website I Discord so we'll post it in all the different channels uh Louis will take care of blasting everyone with uh with all the links um open AI said about cost per question about like fractions of a penny what expectation and model is on paid search I have no idea how to uh answer that honestly yeah whatever open AI said the cost might be is like the best thing I could possibly come up with but I'm not sure like to run your own a100 I don't even know what they go for right now on AWS you could also get them on Hunting fish I think they have some some pretty uh good offers for a100s as well but uh yeah now next question is can you host the chat box locally so yes and no so you can host the entire chatbot itself locally yeah but every time someone uses the chatbot it pings open AI so the model itself you cannot host because open AI does not release their models right open AI like it's open API there's no open models it's open for everyone to use the API it's not open for everyone to use the models so yeah you can uh you can host the chatbot itself locally but every time someone pings your chat bot you need to make the request to the open AI servers and hope that you get your answer on time uh can you use already script content from not just search engine yeah you can use content from anything as long as you put it into the right format for the data frame like at the end of the day you can even adapt this to your needs it's really just plain text embed them and then you can you can use whatever you want for that uh not only a cost problem a lot of companies don't want to rely on open aipi yeah 100 I completely agree honestly that's the one thing that I'm like not loving about this Buster bot is we're completely reliant on the open AI API and everything we send goes to their servers so it's really nice and cute for the hug and face Transformers library but if you try to go to a company and sell your services like good luck with their private IP and data you should have a chat bot like this I mean it would be really useful but it should be completely sealed off to the outside world if you're going to be asking it sensitive questions about like Internal Documentation and internal IP so 100 we have hope coming you know that uh like the people that lie on and all that with the open Assistant like I mean I'm sure it's a matter of weeks until we have a pretty good chat GPT or not chat GPT but GPD in general uh replica there already are models that are of the same similar sizes just maybe not as good but like the flan model is completely open source you can host it the bloom Z Model is completely open sourced and you can host it it's just a bit there's a lot more overhead into hosting these things so like I mentioned a bit earlier that myself and Adrian who built Buster we had this up and running and like answering questions in about a day but if we had to worry about hosting the models and running servers that would have been impossible and the the fact that this cost us three dollars to run for like a full week's worth of debugging if we had to run this on servers that were hosting these models that would have been an astronomical cost so there are advantages to using uh to using the open AI API but definitely if you're doing something more sensitive uh hopefully in the next few weeks we'll see some open source models floating around and we'll be able to build some stuff that's like completely voice like completely uh non-reliant on open AI anymore and also next time uh I do a live stream and something fails like I won't be able to scapegoat on someone else's model like right now that's my best excuses uh I don't know maybe the models changed maybe not probably it's a bug of mine but then you'll be able to say with certainty like I have to check some of the model the model definitely did not change therefore there is a bug in my code right now I have to keep that as an open possibility of the model might have changed and I have no idea uh someone's saying what are the benefits of using AWS versus local hosting if both ping open AI well if you're using local you're not paying AWS you'll be paying uh they do have a free tier but you know free tier I'm not even sure what the conditions are there but you maybe get a limited amount of free tier per year so there's not necessarily a good or bad like right now everything is hosted on hug and face spaces and I I don't know what hugging face infrastructure they use for like hosting all this stuff I doubt that they have their own gpus so realistically on the hug the hugging space space itself is probably being hosted somewhere on AWS and so you know essentially this whole Buster thing is already hosted on AWS anyway uh can we ask Chad GPT for hosting the site means for programs and prompt engineering is Buster using open AI API so Buster is using the openai API and uh no you can't ask Chad gbt for hosting the site but you can host it yourself uh very excited for some open source models appreciate the demonstration conversation yeah thank you and yeah I'm also very excited for the open source models uh especially because like the open source models you can fine tune right right now open AI has a kind of fine-tuning API but like I said it's all closed Source like I don't know people like to hack these models you know you could use the weights to do all sorts of different things plug them into other things when you're just given the embeddings it's it only goes so far um all right well I think that's probably most of the questions so thank you everyone that was really fun uh thank you what's AI for hosting as well today and uh if you have any more questions like Ping me on Twitter or uh LinkedIn or whatever you know we'll be posting all of these notebooks everywhere and uh yeah this was this was really fun so go ping Buster uh go hack it you know take it for your own applications and uh yeah I'll see you uh maybe some other presentation foreign
Info
Channel: Jeremy Pinto
Views: 9,417
Rating: undefined out of 5
Keywords:
Id: LB5g-AhfPG8
Channel Id: undefined
Length: 67min 45sec (4065 seconds)
Published: Sun Feb 05 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.