RAG using Open Source LLMs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Yi power chatbots are revolutionizing the business world with every Enterprise seeking to integrate this cutting technology yet the issue of data privacy continue to haunt the business world and the reason is obvious using closed Source large language model as your chatboard this is where we have an exciting solution to use open- Source large language models as a viable alternative in this video we will Implement a rat pipeline using open source large language model with the Frameworks like L chain and aing face Transformers so let's get started first let's understand retrieval augmented generation with an example I live in Bangalore India in Bangalore we have this beautiful place called Ison Temple to those who don't know what Ison Temple is it is a place where people worship Lord Krishna who is a supreme god in Indian religion you need to visit it once so what I do is I just ask a PR at what time does iscon Temple open I know the response I know the timing when it opens in the morning and when it opens in the evening but I want to see what Chad GPD returns so when I as this prompt Chad GPD gives a weird looking answer which is true like the opening ours of Ison temples can vary depending on the location and obviously uh since Ison is not just part of Bangalore there are different places or different state in India where we areis cont Temple but the actual response is I need the timing I can't find the time and this particular response so what I do I update the promt and then I ask the question like at what time does is cont Temple open now what sgpd will do it will look into the context and then it will generate a response adding its own intelligence make sure what I say it will look into the context it will generate a response based on its intelligence this is very important when we we will talk about rack let's understand what kind of prompt engineering you can do we have zero sh prompting now zero sh prompting is when you ask a direct question we have few short prompting where you are asking a context and then you are asking a prompt but let's understand the issues with large language models one such issue is hallucination hallucination everyone of us know what Hallucination is if you have work with large language model whenever a GPT or any large language model returns a response which is out of context or which is not factual is a Tred as hallucination and one more issue with large language model is knowledge cut off since since it is trained on large Corpus of data it has certain limit until what time it is trained on so if you ask Chad CPD like a question saying who won Cricket World Cup 2023 probably it won't answer because the Cricket World Cup took place in November 2023 and Chad GPT doesn't have the context to it now this might vary if you're using GPD 4 because in GPD 4 if you have plugin installed you can access to internet data but in this case I'm not talking about internet data I'm talking about the data on which GPD was trained on and one such issue is also domain specific faal responses it's true that chat GPD can answer many domain specific response let's take medical when we ask any medical related question CH GPT will return a response but factuality it is not correct it is somewhere near to the actual response but there is more Improvement that needs to be done this is where we have retrieval augmented generation let's take our example query the example query can be at what time Ison Temple open and you can pass this text directly to the language model rather you need to convert that into a numerical format you can do this using embedding model so what embedding model does is it will convert your text into Vector form which is embeddings and once you add the embeddings you will look into your vector database and retrieve the relevant documents now what are these relevant documents this relevant documents can be morning timing evening timing now I ask my prompt that is at what time does is cont Temple open so we convert that into embeddings it will look into the database and it will perform contain search technique schematic search and it will return the retrieve documents or retrieve context this context you will pass to a language model along with query this is what we did in Fus short prompting once you have both of this you can directly use a large language model and large language model will generate a response so this is a normal flow of rack and this particular image was taken from any scale block and this is a very beautiful blog on building rag on production you can check it out later let's understand how you can save your external data into a vector database let's take an external data the external data can be PDF it can be document it can be PP it can be YouTube url it can be a website you need to pass this external data you need to extract on the context and and do data pre-processing data pre-processing is where you remove unwanted context you convert your text into chunking which is smaller form of your text and then you convert your text into a beddings and then you store it in a vector database usually databases are used to store things but Vector database are special it will also perform different search technique like similarity search Co similarity nearest neighbor and all such things and once you have the vector database this is where R pipeline start so let's divide the R into three different TS we have retrieved we have augmented we have generation in this step one of retrieval we have a user who ask a query converts that query into embedding which is a vector form look into Vector database perform certain search technique and it will retrieve a context so this context is your document let's come to the step two in augmentation whatever quy you have large language model have their own prompt template so you need to use that particular prompt template to convert your augmented prompt in generation you take the prompt you also consider the retrieved context you pass it to a large language model and it will generate a response it might sound trick key but let's look at Via code so let's get into the code demonstration it's about time we start building rack pipeline using open source large language model the only thing you need is Google collab notebook so quickly open Google collab notebook the use case that we will build here is chat with website now in sth website our external data is a website URL and we can store this entire website content in a vctor database and then we can ask any problem let's start with the installation for the installation we need Lang chain which is an into an LM framework that we can use in this case and we also have sentence Transformers uh sentence Transformers is optional uh this is mainly need when you need to create the embeddings for your text and we have chroma DB chroma DB is an open source Vector database that we will use in this particular code demo I have attached the same image from the presentation that is external data sing vector embeddings vector database four components external data which is my website chunking vector embeddings and Vector database and then we have one more uh we have two more we have Retriever and we have large language model so let's quickly import them uh this this is some extra text let's remove it so what are the four first components from live chain do document loaders let's import web base loader so this is my first component the second component is from line chain I need to do preprocessing that is nothing but chunking for chunking we have something called as what is it it's actually text splitter yeah we have text spitter from text splitter let's import recursive uh I don't know what it is so let's wait collab notebook will give you the suggestion it's recursive character text spitter next comes EMB it's from L CH import edings since we are working with with open source let's use hugging face inference embeddings it's actually huging face inference API embeddings the speciality of a ging phase inference API embedding is once you pass your embedding model it will not download it rather it will run on the cloud so for this demo purpose we can use the INF eddings the fourth component is Vector stores so it should be Vector stores import chroma so these are our fundamental four components and then we have our large language model Now lastar language model is our a face again from p hub and then we have Retriever and in order to use retriever we need to use the Q I hope I entered my spelling correct which should be chains from Lang chain chains import retrieval it's actually retrieval I don't know whether it is small QA or Capital QA yeah it's capital Q uh we have our six block or six components using which we will build our pipeline let's start with the code but before we proceed with the code there is one thing we need to do and that is to set up our aing face access token now why do we need access token in order to use aing face up and aing face embeddings we are not loading the model rather we are using a access token so that we are directly running this model on Ain cloud and in order to do that let's create a variable and copy our copy our what you call Access ster in order to not make it publicly visible I have this Library called get pass so from get pass I will import get pass and here what I'll do is I will pass this function uh let's quickly get our green pH token first you need to create your account on a.co and once you create your account click on your profile image and select settings in settings on your left hand side you will see something called as access tokens and in Access tokens since you will be creating your first token or any token you need to click on new token and once you click on new token it will ask you to enter the token name I'll just write uh rag open source and you need to sel it the RO to be right select Ro to be right and generate a new token and it my tab generated new token you just have to copy it copy it's copied come back to your collab notebook run this particular piece of code now if you can see it is asking me to enter my token I will just paste it that's it and one more step what you need to do is you need to save it in your environment variable uh the environment variable is aing face it's aing face of API token and you just pass your except token that's it okay sorry it's actually small letter token my bad that's great now what we do is we take our website URL uh in our case I'll be using my own portfolio link just to avoid copyright or using some other external website so I'll use my own website that is Tun jan. net. app that's it so this is the URL and now what we do is we pass this URL to my web base loader so I just passed my uh URL to the webbase loader this is done and now once you add your data you also need to load the content so let's load the content data. load so we need to load our content so let's print what is inside my content so as you can see we have my name it's about me have my nabar for nabar you can check out my website I'm not doing self-promotion but just to see if we are able to retrieve all the document or not so we have about me events achievements and all uh yeah we are able to extract the Nar then we have where I were Planet Community lead. gdml yes we are able to extract the content we are able to extract the content from the website sure now let's proceed the next step is we need to do chunk it so let me add subtitles so the next step is sing or you can also call it as text splitter so text splitter is nothing but whatever content you have you need to break that down into a smaller segment the reason why we do chuning is when you have a larger context or larger uh larger document there might be reputation and diversity in your document in order to avoid the tissue which just do with the chuning which is more convenient method to extract your relevant documents so we split text splitter now the text splitter was recursive character text splitter here we need to pass the chunk size in our case I'll use chunk size to be 256 and there is one more very very important uh argument which is called Chum calap and I'll tell you the significance of what chlap is let me assign chlap to be zero we will change this Chun overlap so that you can see the significance of this argument let's run it and now we will do the chunking X splitter we need to split the documents it's actually split documents and here we pass my content it's done let's check the Chun King length of the chunking it's 33 um now I will show you the importance of Chunk overlap zero uh let's take the chunk chunking of my index three okay so as you can see um it starts from communities so let's take two as well uh Ching to be to uh that is my second index so as you can see my second index starts with hello my name is tarun AR and I'm a passionate qu with expertise in machine learning image processing deep learning I have published over 80 blog articles so there is some context and my second chunk index is ending at w DS now if you look at third chunking it starts from communities uh so the complete format will be a published over 80 blog articles documenting my code journey and I'm actively involved with various communities including aace kasas working group D planning. now what I do is I'll convert my CH app to be 50 so let's see the difference so you can see the difference over here um you can see chunking two ends at various chunking three starts with communities so chunking 2 will definitely end with various but now here is where you need to observe the change my chunk 3 will not start with communities rather it is starting with journey and actively involved with various communities so if you look at this particular tokens journey and I'm actively involved in various it is already available before it is already available in the previous s so what chunk la will do is it's beginning of the uh context is already available in previous previous Chun so this is what chunlab does so let's proceed let's proceed with the code now what we'll do is We'll add our embedding model and how do we choose our embedding model there is a Leaderboard on aing face uh you can check out which particular embedding model is performing better and then based on that you can decide what embedding model you can use so I've already decided what embeddings model I need to use so let me quickly initialize it I in phas inference API embeddings and here I need to pass my API key which is hsf token and one more thing I need to assign model name so this is my embed model name and the name is bwe back slash we have BGE base English version 1 V version 1.5 that's it so we have our embedding model now what we need to do is we need to pass chunking by embedding model then save it in a vector database and how do I do this I need to create a vector store I need to create a vector store by theame chroma in chroma uh fromont documents is the function in this from documents I need to pass the chunking with the embedding model something with my embedding model so now what it will do is it will look uh what is this it's embeddings so now what this particular statement will do is whatever chunking I are it will create the embedding of it and store it in a crot DB which is a vector database so it's going to take few seconds now it's going to take few seconds until then let's proceed let's add our large language model now what is our large language model it is an aing face up so let's take our model uh so what is this model name model name is pugging face up and from aing face up let's use a better model which has lower chances of hallucination so as I said it will take 40 seconds I mean few seconds so the vector store is done uh now let's use our language model which is froming phas we need to use any open source model and the better model that I can think of as of now is zire which is a fine tune model of Mr repo ID the repository name is hugging P it's for zire 7 billion parameter Alpha model so this is my model name I also need to Define few keyword arguments that is model KW which is keyword arguments and few of the keyword arguments are the major one temperature uh you can either set temperature to be 0.5 if you don't want any Randomness you can keep it zero or if you need creativeness you can keep 0.9 for this case I'll just use 0.5 which is default or normal and I also have maximum length or you can also use maximum new tokens now maximum new tokens is the sequence length that you need to add uh in our case it should be around pi2 you can also make a 1024 yeah there is one more uh hyper parameter which is max length uh you can also set this to be 64 or P let's run our model so as you can see it is not downloading and that's the reason why we use access topens now we need to use retriever now retriever uh you can use chroma DB retriever itself so what you can do is you can do vector store as retriever so our Vector store itself acts as a retriever that can perform various search Technique we can also Define what kind of search type we need uh if you look at this particular arguments we have search style uh defines the type of search that the retriever should perform we have similarity we have MMR MMR if I'm not wrong it is maximum marginal relevancy and we also have similarity score threshold So based on this search Technique we can pick anyone since by default it is similarity you can keep it to be same but I'm not that guy I'll use MMR and there is one more very important keyword that is K how many relevant documents that you need to extract we will look into this how K value matters so now what you need to do is you need to create a search keyword argument search keyword argument inside this you have something called as K which is how many documentation you need to retrieve I'll just tell one that's it now let's verify let's verify it uh you can ask any query you can ask any query I'll tell who is i b okay so as this query and I want to see what kind of relevant documents can I extract so I'll just write documents relevant uh from retriever retriever dot so there is a spelling mistake it should be e b Retriever get relevant documents so I need to check what kind of relevant documents am I able to retrieve and too bad with these spellings it should be get relevant documents and I need to pass my query so doc sell will generate the uh relevant documents so as you can see de planet is mentioned here and my name is also mentioned in this particular Chun so let's change my K value to be2 so you can observe the change as of now I can only see one single content one single document page content now when I run this I can see two p content this is Page content one this is Page content two now page content two I do have my name but it's not relevant uh that is the reason why since I have very less amount of content in my website I'm keeping my K value to be one if you have your document such as PDF which is very complex and which is very lengthy you can keep K value to be somewhere around 5 to 10 you can decide that based on uh what kind of reputation is there in your document uh since my content is very less I'm using it one I repeat it again if you have very complex data keep K to be 5 to 10 so let's rerun all this cell so far we are doing good but we have not connected llm with the retrieve document this is where we need a chain now that chain is our retrieval QA so we use retrieval uh just let me see retrieval qer we need to create a chain we need to create a chain type we need to Define our llm model and we need to Define our retriever we need to Define our Retriever and our retriever is defined as retriever itself and what kind of chain type unit there are four different techniques one is stuff one is refine um refine is nothing but it will generate a original answer based on the original answer your llm will Define the response then we have map rank which will rerank your response based on the r context and we have map reduce so same time we can Define it to be stuff you can also make it refine in most of the case I use refine but what refin is it takes too much of time to generate the response we'll do it U I'll show the response with REI I'll also show the response with stuff but let's start with stuff so we are able to run the chain now the only thing is we need to create the augmentation so far what we have done is this particular retriever right this particular retriever is the first step this is this step one that is retriever so retrieval is nothing what you're using a vector store and you're creating a retriever and this particular thing where you are doing the chain this is the step three for Generation Now step two we'll add it here we need to create a template the augmented template that we discussed in our slides augment that is your prompt set the prompt so let's create test the prompt for each model there is certain prompt template that you need to follow so let's use a template over here we have a template I'll make this to be prompt and the template goes something like this it starts with system now system prompt and it is closed by yes and here you can write your system prompt my system prompt will you are an AI assistant that follows instruction extremely well please be truthful and give direct answers so this is my system PR I pr find my system prompt and after system prompt we have a user now user prompt is nothing what the query what a user will ask so we have a user and we are defining the query what user will ask so let's make this as to be up string U I'll ask a query ask a query yes ton CDE in your this is my query this particular query I will pass inside user and then again I will close my I will close my prompt and then I'll use assistant so assistant is nothing but what will be the response from the chatboard we have a system prompt we have the user prompt and then we have the assistant prompt which is empty because the particular large language model will generate there so this becomes my prompt and now what I do is I need the response and for response I need to pass my prompt inside the chain so I hopefully it should not give me any error that's it so let's print the response uh for in order to print response we need to get the answer because it will also generate the query and okay is it result yes so as you can see um when you run this response right when you run this response it will add the query as well so query is nothing but your prom template and then it has one more uh key which is result inside result you can see the response yes tunin is a Google developer expert in machine learning this title is awarded to individuals who have the One exal technical knowledge and expertise in this specific field and who have a significant contribution to the developer Community Theon has been recognized for his Works in machine learning and deep learning and has contributed to various open source project and Sh through his blog and other platform this is what saire can do this is the power of Open Source large language models so you can test around with different ground you can check this r p blade and there is one more addition that you can do by adding text re ranking and use a very big PDF data this particular collab notebook will be attached on uh the description please do follow and happy learning thank you so much

Info

Channel: AI With Tarun

Views: 8,740

Rating: undefined out of 5

Keywords: AI, generative ai, langchain, llm, large language models, huggingface, open source llms rag, rag, retrieval augmented generation, transformers, python, chatbot, chatgpt

Id: dUkiQ_WI92c

Channel Id: undefined

Length: 34min 53sec (2093 seconds)

Published: Wed Jan 03 2024