Chatbots with RAG: LangChain Full Walkthrough

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're going to take a look at how we can build a chatbot using retrieval augmented generation from start to finish so we're literally going to start with the assumption that you don't really know anything about chat Bots or how to build one but by the end of this video what we're going to have is a chatbot for those of you that are interested using openai as a GPT 3.5 model and also the line train library that is able to answer questions about more recent events or is able to answer questions about our own Internal Documentation for example in the organization which a model like GT 3.5 or GT4 cannot do and the way that we will enable that is through this retrieval augmented generation so to get started let me just take you through what we're actually going to be building at very high level okay so this so what you can see here is what we call a rag pipeline or retrieval augmented generation pipeline so in a typical scenario with an llm what we're going to do is we'll take a query like this up here and we just feed it into our LM and then we get some output right that is okay in some cases but in other cases it's not for general knowledge question answering or for knowledge that the LM has seen before this does work relatively well but the problem is that a lot of LMS have not seen a lot of information that we would like it to have an understanding of so for example in this question I'm asking what makes llama 2 special and most alums at the time of recording this will not be able to answer that because llama 2 is a recent language model most LMS were trained on train data that did not contain any information about llama2 so most LMS have no idea what llama2 is and they'll typically tell you something about like actual llamas the animal or they'll just make something else up when you ask them this question so we obviously don't want that to happen so what we do is we use this retriev augmented generation Pipeline and it's this pipeline that I'm going to teach you how to build today so here's an example of when llms don't know what you're talking about even though it's you know general knowledge or at least you would expect an LM that is for example in this case good at programming to be able to tell you the correct answer line chain is probably the most popular library for generative Ai and usually using python there's also a JavaScript version available and maybe some other languages as well but if we ask this is gpt4 this is when GT4 first was released I asked gpt4 in the openai playground how I use the LM chain in line chain LM chain is the basic building block of blockchain and it told me okay line chain is a blockchain based platform that combines artificial intelligence and language processing LM chain is a token system used in line chain all of that is completely false like none of this is true so it just completely made everything up this is you know it's a hallucination and the reason that we get this Hallucination is as I mentioned an llm it's knowledge is just what it learned during training like we can see here it has no access to the outside world now let's just jump straight into it and actually build a chat bot that has this sort of limitation and we'll just see you know how do we build that it's it's pretty easy and also we'll sort of play around and see that limitation in action okay so we're going to be running through this notebook here you there'll be a link to this at the top of the video probably now and we just start by doing a few pip installs okay so we have Lang chain opening eye hooking face data sets Pine compliance and take token we need that's basically all we need to do the whole like chatbot plus rag thing so we're not doing rag we need even less but we're going to use array now I'm again relying very much on the line chain Library here and what we do is we import this chat open AI object and this is compatible with GT 3.5 and GT4 models from open Ai and essentially it's just a chat interface or an abstraction in Lang chain to use the juvet 3.5 or GT4 you can also use those models directly via the open AI API but when we start building more complex AI systems line change can be very useful because it has all these additional components that we can just kind of plug in so we can add all these other items such as a rag pipeline very easily so that's that's why we do it now here okay we initialize our chat model and what it's going to do is we're going to put some objects in there and it's going to format them into what you can see here this type of structure and this is typical of openai chat models so you have a system prompt at the top which is basically your instructions to the model and then you have your user query the chat bot the AI the assistant the user and so on and so on right and it just keeps continuing okay that is what your chat log is going to look like now via the open AI API that would look like this so you have you have like a list of dictionaries each dictionary containing a role and the content which is the text right all I've done here is taking that and translate it into what you would put into the open AI chat completion endpoint in line chains slightly different format but based on the same thing it's only it's a very thin abstraction layer so you have system message human message AI message right obviously system will tie back to the system role role user is human and role assistant is AI and then you have your content here right so this is the Lang chain version of what I've just shown you so let's initialize that and what we're going to do is we're going to pass all those to our chat openai object okay in here we run that will take a moment but we'll see that we get this response right so it's telling me about String Theory which is it's what I asked up here and I mean I don't know if it's accurate maybe it's hallucinating I don't know but I imagine that sort of thing it probably doesn't know the answer to right so we can just print out a little more nicely here so it actually gives us like this one two three four five gives us a nicer format that we can read here now what we can do with this response right if we take a look at what it is it's an AI message so when we're building up our chat log all we need to do is append this AI message to our messages list in order to sort of continue that conversation so what I'm going to do here is yeah I'm just appending that here then I'm going to create a new prompt I'm just going to ask another question and notice here that I'm not saying that why do physicists believe String Theory can produce a unified theory if that's what I'm asking there I'm asking why do physicists believe it can produce a unified theory right so here our chat model must rely on the conversational history there was previous messages that we've sent and that's why we need to add the response to our messages and then we add our new prompt to the messages and then we send all of those over to chat gbt now it's it's tube T 3.5 it's just the same model that would be the product okay okay and you can see straight away like mentions okay physics believe that string theory has a potential to produce a unified theory so on and so on and so on okay so it definitely has that conversational history in there that's good now we have you know we have a chat bot that was I think pretty easy to put together it's nothing nothing complicated going on there now let's talk a little bit more about hallucinations and why they happen now returning to this LMS hallucinate or they one of the many reasons they hallucinate is because they have to rely solely on knowledge that they learn during their training okay and what that means is that an LM essentially lives in a world that is entirely made up of whatever was in its training data okay it doesn't it doesn't understand the World by going out into the world and seeing the world it understands the World by looking at its training data set and that's it so if some knowledge is not in that training data set that knowledge is 100 not in the LM and even if it is it might not have made it into the alarm or maybe it's not stored very well or it's been misrepresented you know you kind of don't know but the whole point of an LM model or what it tries to do is compress whatever was within that training data into like an internal model of the world as it was within that training data set so obviously that causes issues because it has no access to anything else and that's what we want to fix with rag so this little box in the middle here that can actually be many things right it may be it may be a rag pipeline it may also be like a Google Search right it can be search it may be access to a SQL database or many other things right this little box in the middle what that represents is some sort of connection to the external World it doesn't mean a connection to the entire world just some subset of the external world so that's what we want to enable now without that this is ilm as mentioned it just understands the world as it was in our training data the way that we would refer to this knowledge is parametric knowledge okay so this here so parametric knowledge we call it that because it is a knowledge that's stored within the model parameters okay those model parameters that only ever change during training not during not at any other point right so those parameters are frozen after training so essentially what we have is that kind of brain on the left where we just have the parametric knowledge but what we can do with rag is we can add like a a more long-term memory or just memory component that we can actually modify okay so indicates of rag that external knowledge base site external memory is a vector database and the good part of having a database as a form of input into your LM is that you can actually add delete and just manage the more it's like the memory or the knowledge of your llm which in my opinion is is it's kind of cool it's almost like you can someone's like plugging into a person's brain and just being able to manage the information that they have in there or update it or whatever else which it sounds a little dystopian but it's a good power hello to what we're doing with a lemons so yes we're doing that we call it Source knowledge not parametric knowledge because the knowledge is not sold in the the parameters of the model Instead The Source knowledge is referring to anything that we insert into the model into the LM via the prompt okay so any information that goes through the prompt is Source knowledge now when we're adding that Source knowledge to our LM it's going to look kind of like this we typically have some instructions at the top of our prompt we have our the prompter input so basically the user's query which is a little question at the bottom there and then the that external information that Source knowledge that we're inserting is here right it's it's what we call either a context we can call them documents we can we call them a lot of things actually but let's call it a context in in this case that is what we're going to be adding into our prompts so first before we actually build the whole you know the right pipeline to do this let's just try inserting it ourselves and seeing what sort of effect it has on our model performance so we're just going to add another message here what is so special about llama2 okay unless see what the model tells us okay I apologize but I'm not familiar with a specific reference to llama 2 it's possible that you might be reframe to something specific specific within a certain context or domain could you please provide more information and clarify your question so the model cannot answer this question and actually I think the the open AI team have added this because in the past if you asked about llama 2 it would tell you about llamas or give you you know who hallucinate like like full-on hallucinate where it's giving you an answer but it's completely wrong so I think they have seen probably seen people asking about llama 2 or maybe they just saw the Llama 2 release and they added some sort of guard rail against people asking for that to essentially tell the model hey when someone asked you about that tell them you don't you don't know I think anyway unless they've just been training it on incoming data I don't know but I don't think they have so let's try another one I'm gonna say okay can you tell me about the LM chain line chain I asked this earlier right you saw so we see there's another example of something they've kind of modified a little bit later on couldn't find any information specifically about llm train in line train okay and it just it has the same you know there's that same structure to this right so I'm relatively sure this is actually a like a hard-coded guard rail that opening I've put in there so they've added it for line chain for for llama two they've they've clearly added it for a few things okay so let's try the source knowledge approach so I actually got this information I I think I just Googled like llm chain in line chain and I went on their website and just pulled in a few little bits of information they're actually quite long right you can see it goes on a little bit here basically I just have some information about line chain in there some information about chains and the LM chain right and what I'm going to do is just concatenate all those together to give us our source knowledge and then what I'm going to do is it's what you saw before with that sort of structured prompt I'm going to put all those together and just see what we get so can you tell me about the LM chain Lang chain let's try so we create this prompt Maybe I can just show you quickly so we just print augmented prompt okay and you get so you have the instructions you have our we have our contacts and then we have the query okay and I'm just going to feed that into our chatbot here so let's do that see what we get out okay so LM chain in the context of line chain refers to a specific type of chain within the line chain framework the line chain framework is designed to develop applications powered by language models with a focus on enabling data where an agentic applications it's almost a copy and paste from the website itself but obviously formulating in a way that makes it much easier to read and specific to the question that we ask in this contacts a LM chain is the most common type of chain used in line chain you know okay there we go we have loads of information here as far as I know it's all accurate and yeah I mean we got a very good answer just by adding some text in there but obviously we're always going to add text into our prompt like we did just that probably I think probably not right it kind of defeats the point of what we're trying to do here so so instead what we want to do is find a way to do what we just did but automatically in that scale over many many documents which is is where rag comes in so looking back at this notebook here what we kind of just did there put the context straight into the The Prompt it's kind of like we just ignored this bit here like this whole bit we created a retrieval augmented query by you know pulling this in there and our own context feeling it into LM and getting our answer so now all we need to do is figure out this bit here right this this retrieval component right and it's it's pretty easy this isn't really not that complicated and I think you you'll see that very soon so the first part of setting up that pipeline is going to be actually getting our data right so we're going to download the states that here it's from hugging face so you can download it you can even see it on the website you just put like hookingface.co slash this here or maybe you just search this in our face search bar and what it is is a data set that I downloaded or I scraped from llama to Archive papers and archive papers related to llama2 a little while ago it's not very clean it's also not a huge data set but it's I think pretty useful for this example so you can kind of see those chunks of text I've pulled from there so we're going to use that data set to create our knowledge base now for the knowledge base we're going to be using the vector database like I mentioned which is Pinecone for that we do need to get an API key so we would head on over to app.pinecone.io if you don't have an account or you're not logged in you will need to create an account or log in it's free so I'm gonna go do that and if you don't have any indexes already you should see a screen like this you can you can even create an index here but when we're going to do it in in the notebook what we do want is to go to API keys I'm going to copy my API key and I'm also going to remember my environment here so U.S West one gcp I'm going to bring this over so in here we have the environment so it would be us West one gcp and then I'll just paste my API key into here I've already run this a little bit so I'm going to move on to here and what we're doing here is initializing our index we're gonna be using text embedding r2002 that is a embedding model from openai when we are using that model the like embedding Dimension so think of that as the size of the vectors the vectors are like numerical representations of meaning like human meaning that we get from some text it's the size of the vectors that order 002 outputs and therefore the size of the index that will be storing those vectors so we need to make sure we get the dimension aligned with whatever model we're using there uh the the metric is also important but less so typically most embedding models you can use with cosine but there are occasionally some where you should use euclidean instead of cosine or dot products so we'd run that and then what we're going to need to do is just wait for our index to actually initialize it takes I think it's like 30 to 40 seconds typically but that also depends on like the the tier of Pinecone or also the region that you're using the environment that you're using so it can vary a little bit but I wouldn't expect more than maybe one or two minutes at most so I'll just jump ahead to let that finish okay and one side's finished we can go ahead and so this will connect to the index and then we just want to confirm that we have connected the index and we should see that the total Vector count at least for now is zero because we haven't added anything in there so it should be empty okay so now let's initialize an embedding model again like I said we're using r02 again we can use the open AI API for that or we can initialize it from line chain like this so we'll initialize it from Lane chain here and what I'm going to do is just you know create some create some embeddings for some we're calling documents so documents here is equivalent to context that I was referring to earlier so basically a chunk of text that we're going to saw and refer to as part of our knowledge base here we have two of those documents or contacts and if we embed those what we're going to get is two embeddings each one of those is the 1536 dimensional embedding outputs by of zero zero two okay so that's how we do the embedding Now we move on to just iterating over our entire data set the alarm to Archive papers and just doing that embedding okay so we do the embedding here extracting key information about each one of those records so the text where it's coming from and also the the title of the the paper that it is coming from okay and then we just add all those into Pinecone okay well I suppose one other thing we are doing here is we're just creating some unique IDs as well now when we're going through this Loop and doing this actually let me just run it we do it in batches right we can't do the whole thing at once because if you do the whole thing at once we have like 4 800 odd chunks there and if we try to get the embeddings for all of those that's going to create 4 800 odd [Music] 1536 dimensional embeddings and we're going to be trying to receive them over a single API call which most of the time or at least for most providers most internet providers they they won't allow that as far as I'm aware so yeah you can't do that and also even open AI r002 if you add too many things to embed at any one time to that model it's probably going to error out although I'm I'm not sure maybe open AI probably added some safeguards around that so at least on their side you probably wouldn't actually run into any issues but in terms of getting the information to open AI back from open Ai and then to Pine Cone you probably are going to run into some problems if your batch size is too high so yeah we we minimize that now that is almost ready once it is you can come down to here and you can use describing those stats again and what your you should see is this actually let me just rerun it to make sure that it's there so we can see that we now have 4836 vectors or records in now index so with that we have a fully fledged Vector database or knowledge base that we can refer to for getting knowledge into our LM so now that the final thing we need to do is just finish the right Pipeline and connect that knowledge base up to our LM and then that would be it we're done so let's just jump into that okay so we're going to come down to here I'm going to initialize this back in line chain because sometimes it depends on what you're doing but often you will want to use pine cone via line chain if you're using it with a lens because there's a lot of ways you can connect the two I think in this example we don't actually use those I'm just going to connect them directly some you'll see but basically I'm just going to throw information straight into the context into the prompt but a lot of the time you probably want to go and initialize the vector sort object and use that with other components in line chain so I initialize that there you just pass in your your index and also the embedding model and this is the embed query method which basically means it's going to embed a single chunk of text rather than the embed I think it's embed documents method which encodes like a batch of many chunks of text but one important thing here is that we have the text field so text field is the metadata field that we set up earlier you can see here that contains the text that we would like to retrieve so we also specify that now let's ask the question what is so special about llama2 we saw earlier they couldn't why couldn't answer this but now if we we take that query and we pass it into our Vector database or our Vector saw in this case and we return the top three most semantically similar records we can see that we're getting these chunks from the Llama 2 paper okay you can see the title here now these are pretty hard to read to be honest and even like here like our human evaluations for the helpfulness and safety may be suitable substitutes for closed Source models right I I can just about make that out but in what we'll actually see is that also lens can also pass that information uh relatively well I'm not saying it's it's perfect but you know they can so we have these three documents these three chunks of information hard to read let's just let out LM deal with that so I'm going to set up this augment prompt function here so we're going to take that query we're going to do what I just did there retrieve the top three most relevant items from the vegetable we're going to use those to create our source knowledge you might recognize this code from earlier where we did it manually and then we're going to feed all that into an augmented prompt and return it okay so let me run that let's augment our our query okay so using the context below answer the query you see that we have these contacts in this work we develop and release number two collection pre-train and fine-tune large language models and you know a ton of other stuff in there and we have a query what is so special about llama2 so this is now our augmented query that we can pass into our chatbot so let's try it we're going to create a new uh human message from before we're going to append that to our chat history and feed that into there so remember the question here is what is so special about llama 2. and it says according to the provided context llama2 is a collection of pre-trained and fine-tune large language models I read that earlier actually developed and released by the authors of the work these LMS range in scale from 7 billion to 70 billion parameters they are specifically optimized for dialogue use cases and outperform open source chat models on most benchmarks tested that's pretty cool right so what makes llama2 special is that the fine-tuned LMS has and then we have this kind of like mess up bit here I'm not sure I think it's llama something but it's a bit of a mess they are designed to align with human preferences enhancing their usability and safety this alignment with human preferences is often not easily reproducible or transparent in closed Source models limiting progress in AI alignment research okay so additionally based on so and so on alarm 2 models appear to be on par with some of the closed Source models in terms of helpfulness and safety all right so we can see yeah I think it's giving us a good answer I don't want to read the whole thing I almost did but yeah it's giving us a good answer there let's continue more alarm two questions so I'm going to try without rag first so also just consider we've just asked the question and we got all this information here that has been stored in the conversation history so now the LM will at least know what Alarma 2 is and let's see if we can use that to answer our question here so what safety measures we use in the development of Alarma 2. all right so I'm not performing rag on this specific query but it does have some information already and it says in the provided context save them as you use and development alarm 2 are mentioned briefly detailed description of their approach to fine-tuning safety however the specific details of these safety measures are not mentioned in the given text okay right so it's it's saying okay I don't I don't know even though we've told it what alarm two is and we're giving it like a fair amount of text about llama too context about llama 2 you still can't answer the question so let's avoid that and what we're going to do is augment our prompt and we're going to feed that in in set and let's see let's see what we get okay based on provided context the development enlarmature involves safety measures to enhance safety of the models some of these safety measures mentioned the text include and then gives us a little like a list of items here so safety specific data annotation and tuning okay specifically focused on training okay cool and it also is kind of figuring something out from that suggests that train data and model parameters were carefully selected and adjusting to prioritize safety considerations red teaming so that's um well it tells us here red teaming refers to a process in which external Experts of evaluators sim or simulate adversarial attacks on a system to identify vulnerabilities and weaknesses okay so you know almost like safety stress testing the model I would say and then iterative evaluations dimension of iterative evaluation suggests that the models underwent multiple rounds of assessment and refinement this iterative process likely involved continuous feedback and improvements to enhance safety aspects so the impression I'm getting from this answer is that it mentions this iterative process but doesn't really go into details so the model is kind of figuring out what that what that likely means all right so we get a much better answer there continues but like I said you can take a look at this notebook yourself and and run it I think it's very clear what sort of impact something like rag has on the system and also just how we we Implement that now this is what I would call naive right it's or almost like the sanded rag it's the simplest way of implementing Rag and it's assuming that there's a question with every single query which is not always going to be the case right you you might say Hi how are you obviously your chatbot doesn't need to go and refer to an external knowledge base to answer that so that is one of the downsides of using this approach but there are many benefits obviously we get this much better retrieval performance right we get a ton of information in there we can answer many more questions accurately and we can also cite where we're getting that information from this approach is much faster than other alternative rag approaches like using agents and we can also filter out the number of tokens that we'd only use two main tokens so we can also filter out a number of tokens of feeding back into the LM by setting like a similarity threshold so that we're not returning things that are completely like very obviously irrelevant and if we do that that'll help us mitigate one of the other issues with this approach which is just token usage and costs obviously we're feeding way more information into our lmcm which is it's going to slow them down a little bit it's also going to cost us more especially if you're using open AI you're paying per token and if you feed too much information in there your LM can actually the performance can degrade quite a bit especially when it's trying to follow instructions so there are always those sort of things to consider as well but overall if done well by not feeling too much into the context window this approach is very good and when you need other approaches and you still need that external knowledge base you know you can look at rag with agents it's something I've spoken about in the past or rag with guardrails which is something that I've spoken about very recently both of those are alternative approaches that have their own pros and cons but effectively you get the same outcome now that's it for this video I hope this has been useful in just introducing this idea of rag and chat Bots and also just seeing how all of those components fit together but for now what I'm going to do is just leave it there so thank you very much for watching and I will see you again in the next one bye
Info
Channel: James Briggs
Views: 93,682
Rating: undefined out of 5
Keywords: python, machine learning, artificial intelligence, natural language processing, semantic search, similarity search, vector similarity search, vector database, retrieval augmented generation, retrieval augmented generation tutorial, pinecone vector database, vector search, langchain, langchain chatbot, large language models, chatbot python, chatbot rag, chatbot ai, rag tutorial, james briggs, openai gpt-3.5-turbo, openai chatbot, chatbot full project, chatbot full tutorial, ai
Id: LhnCsygAvzY
Channel Id: undefined
Length: 35min 53sec (2153 seconds)
Published: Wed Sep 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.