Build and Run a Medical Chatbot using Llama 2 on CPU Machine: All Open Source

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone welcome to AI anytime Channel in this video we are going to develop a medical board using llama2 model OKAY llama 2 has been released uh by meta AI last week and after that uh the entire open source Community have worked on quantization and different ops to make it available uh to run it on a compute limited devices like a CPU machine okay and that's what I'm gonna show you in this video that how you can basically take a quantized model from hugging phase and you can build your own chatbot okay on top of your data and knowledge basis that you have that's what I'm going to show you in this video so let me first show you that what I have built and you can see on my screen we have something called hi welcome to Medical Board what is your query very simple interface and this is what I have built and this is what we are going to build in this video okay we are going to develop this is from scratch okay I'll show you that how you can build your own board using llama2 or any other open source large language model okay so if I ask a question and it should generate some response for me okay this is the document that I have taken you can see the document the knowledge base the data that we have the Gale Encyclopedia of medicine which has around 600 plus pages okay 600 plus piece like an encyclopedia of medicines okay that's what I have taken and I have created embeddings out of it with the help of an open source embedding model and then I fill it to a large longer model with the help of some Frameworks like Lang chain and chain lead you know to retrieve the response response from it or retrieve the information from it right so if I ask the question here okay what is ick game or something okay let's let's ask this you can see Mi is a general term used to describe a variety of conditions that cause an itchy inflamed skin rash atopic dermatitis is a form of igni that is non-contagious disorder something like that okay we also getting the sources the document the metadata of it so I have asked the question and you can see the step over here we have all these steps retrieval QA stop documents everything I'll explain this when I'm we're going to develop less of talking and more of coding guys in this video okay and that's what we're gonna do in this video so let me first you know uh ask some other questions as well what question we can ask what is an NT diabetic or something like that diabetical drug if I ask this question okay what is an anti-diabetic drug and you can see what it does it kind of starts the chain using retrieval QA so retrieval question answering chain from line chain that we have used and the good thing about this framework that we are using the chain lit which is again sounds very similar to extremely right to to build python apps but Channel it is more for building chat boards on top of a language model or a large language model because chainlit provides a lot of features that I will explain when we are writing the code there now it will retrieve the information for you if you click on this it's it says using retrieval QA using stuff documents chain using llm chain and I can also see llm chain is running and once it will take little time because the main agenda of this video is to run llama to build a project on top of llama 2 and run it on a CPU machine because not everybody owns a GPU or vram right not everybody has that right so that's the reason that we are doing it and it will take little time when you are on CPU machine to retrieve the answer for you but this is what we're gonna build okay so you can see it's on chain let's let me first show you what channel it is now if you see the chain lit over here it says chain lead lets you create chat GPT like Eis so it helps you create conversational interfaces okay uh with ease and the the GitHub repository will be given to you and they have lot of you know exact samples app that you can take and you know develop on top of it they have integration with Lang chain pine cones Etc okay they have their docs here on docs.org channel dot IO and you can find out everything that you need and that's what I've also taken some references and I have built this okay now let me just go back to terminal here and show you we can see it says batches 100 it's loading three changes detected now if I come over here and you can see anti-diabetic drugs are medicines that help control blood sugar levels in people with diabetes mellitus fantastic this is fantastic you are running llama 2 quantized model on a CPU machine and we are not only running it we have built a custom you know uh GPT kind of a thing where you have your own private data okay your confidential data and for example you want to run this on a CPU machine or inside your own infrastructure or network you can do that we are not using any kind of apis here guys and you can see the quality of Bot and of course you can beautify it later I'll also give you some links at how you can you know clear this or clean these sources and all there are a lot of features on channel it but this is what we are going to develop so if and this is going to be an exciting video guys I hope you will learn a lot from this video this video is going to be very lengthy and I said less of talking more of coding so I hope you will spend some time with me on this video if you want to do it alone if you want to just rely on the code the code will be available on the GitHub repository you can find it in the description please take the code from there if you're not interested you know watching the video but if you watch this video you will learn a lot of nuances about this entire workflow that's what I'm going to cover it each line I'm going to explain okay so let's now build this application guy this chat bot that we have built that I'm currently showing you let's let's start uh building this let's start writing some code guys so now let's build the bot guys so you can see I am in this folder I have a folder on desktop that's called llama2 demo so what I'm going to do here I'm going to open this in terminal I use Ubuntu but doesn't but you can also use any other operating system you want to use I am on Linux right now and you can see I am in this directory I'm going to open this in code vs code I'm just going to hit code dot same is applicable on Windows as well if you are using Power Cell terminal or CMD as well now you can see these are the requirements the minimal requirements to run this bot to build this bot and run this body you can ignore bits and bytes is not required right now to run this on CPU okay but these are the minimum requirement that you need you know to run the app now few things we need a model to down we have to download a model we need a model so that we can use that and I'll show you that which model you have to download I have this model called lama27b chat ggml q80 pin so we need to download this model you can see this is the model that we have to download it's by the block now if you click on the Block it's by Tom jobbins okay thank you so much you know we have to give credit to Tom jobins or the bloke for all the great things that he does with L&M quantization fine tuning and Ops on the hugging face and he is he has been very very instrumental to empower the open source Community okay so he was great to him thank you so much Tom jobs if you're watching this video by any chance and this is the model that we are going to use this particular model you can see that llama27b chat ggml v3q8 you can see here on files and version and you can see the quantized bins and you can see this is the model that you have to download lfs is large files you can click on this and you can download it from uh here still download it right it will take a little time for you and that's the model that I have this is the model you can see in my folder which is around 7.2 GB I assume that you need minimum of 13 GB of RAM to run this so I have 16 GB of RAM in my system and I hope that it's it should be decent enough to run this maybe you can try on 8GB as well but that will probably crash your system after one question okay so you have to offload the weight over there then I also have a requirements txt I have shown you and I have a data here inside this data I have this file either file I was showing you earlier and I'm going to use this data you can use any other data you have you have one file you can have 1000 files depending on the compute power okay so it depends how many data files that you have and what you want to do with this entire workflow so I'm gonna I'm going to rely on this one file which is huge again right 600 plus files now let me show you what we're gonna do here guys how does this work okay so we we need a large language model here now we are living in the era of generative AI we have llms we have llms here right now how does this work what are the things that we need guys to build this let me just uh do one thing let me show you step by step okay I'm gonna build uh mid bot mid board using llama2 so for that we need this llama 2 model but we cannot get it from uh meta AI to run the model on CPU machine because that will probably not work so we need a quantized model so we need a quantized model now this quantization is again four eight bits different beats of quantization we need a quantize model and the thing that we needed we are going to take this model from something called the block on hugging face I'm writing HF for that and the link will be given in the description and also I have just shown you that how you can download it so the block on HF now once you have downloaded this model you have to load a model so how do we load a model from hugging phase so we take help of Transformers but we cannot load this model from Transformers if you want to run it on a CPU machine we have to use something called C Transformers which is a python binding in C C plus plus okay I'll show you what C Transformers is and you can also see that c Transformers in my thumbnail so let me just go on C Transformers something like this C Transformers GitHub and you can see Marilla C Transformers python bindings for the Transformer model implemented in C C plus plus using ggml Library okay is it interesting so it will taking the model from here and then we use C Transformers python bindings and the model that we are talking about this model on quantity model this model will be should support ggm so it also has gptq models okay if you are familiar with this terminologies dtml gptq that helps you run this model on community Hardware okay that that's the beauty of it okay the develop C Transformer and then we'll use sentence Transformers of such an underrated MD model guys sentence Transformers we are going to use all mini La mini language model V6 we're going to use this model for to create the embeddings and then we need a vector store because you have created embeddings now then you have to store those embeddings in a lower dimensional space and that's why you need some kind of vector database or vector stores and that's why we are going to rely on a vector store now you can use different things here okay I'm writing something but you cannot see it because screen will not cast it that means yeah now you can see so Vector stores and these Vector stores are of different types or different Vector stores are available there so the more famous nowadays is chroma DB which I don't really don't like so you have chroma DB and then you have fast it's by Facebook we are going to write CPU here we are also given in GPU format then you have qdrand and then you also have I'm not going to talk about any opens or closed Source things here we have pine cone but it's it's again closed Source I'm not going to rely on that I am an advocate of Open Source AI guys so these are on the open source we'll talk Q rent is available through Docker images as well which is very interesting now so we need a model gamma 2 model because we are relying on lamba 2 because it is performing very well on leaderboards okay and so new also and then we need c Transformers sentence Transformers Vector store so let me create a high level diagram for you the architecture and then we'll move on uh to the coding part you will have your document and let me let me write this Docs or basically we can also write it data right you have your data now you need to pre-process this so let me writing let me write this pre-process and this pre-process happens through line chain it does all the heavy lifting for you so lunch and loaders splitters so you have lunch and loaders Splitters recursive character text splitter text splitter unstructured loader Pi PDF loader directory loader thousands of other loaders that you have probably in launching thank you thanks to Langston you know I'm a huge blanching fan okay but there are issues with light chain in Productions that will cover in some other videos so you have docs data now LinkedIn for all the pre-processing now once you have loaded this texture this text your text is ready the pre-process text we will pass this to an embedding model so let me write that embedding model and in our case our embedding model is nothing but sentence Transformer we are going to use one of the models from sentence Transformers so you can see sentence Transformers for embedding now once you have this embeddings you have to store that somewhere right you have to store that in a lower dimensional space as I said earlier and then that's where we use a vector or vector DB or store and in our case we are relying on fast in this video so fast CPU you can also use chroma DB if you want you know I'm not getting good results with chroma nowadays so I'm gonna write okay excuse me fast CPU now once you have fast CPU once you have the vectors stored into it now the user will do a prompt user will do a prompt you know on the screen or suppose this is your screen and the prom now the prompt will again go to this fast CPU again with the help of launching ER fast CPU and then it gives to an llm and it returns you the response yes this is a high level architecture guys I'm not covering uh chunks and text chunks and all of other things but this is the very very high high level architecture very high architecture level architecture that I have shown you okay so few things to cover here now what does happens in Vector stores what really happens inside Vector score inbuilt inbuilt similarity algorithms cosine similarity would heard about it guys right you have heard about cosine similarity jacket leavenstein lot of other algorithms in build they are faster so no latency issue as well so I'm gonna write no latencies and metadata Etc right so these are the features of vector stores now this is a high level architecture you know for this board I hope you understood what are the things that we need we need a model so we are going to rely on a quantized Model gamma 2 ggml model and then we have a c Transformer so you're not going to load the model from Transformers We are going to load it from C Transformers and then we need sentence Transformers for embeddings or vector stores and then we are going to pass this to model again with the help of line chain right so let me just go now and start writing the code because I'm really excited to write the code for this so let me come over here on vs code and I'm going to create a file called ingest pi so ingest dot pi and the first thing that we need is from line chain it's all langen thing you know so from Land chain dot m let's have something like you know uh text splitter so text splitter import recursive recursive text glitter no it's recursive character text creator so let me write recursive character text splitter so this is okay so from Lang chain dot text splitter this is okay now next thing is from line chain dot uh we need a document so from document loading document loaders so from Lang chain dot document loaders let's import we are going to rely on PDF guys but you can again you can also use any other format if you have unstructured data like PPT and text file please you unstructured loader so from PI PDF loader and I'm gonna use directory loader here so let's do that in this case directory loader we are good with our load here so from length chain and the next thing that we need is from Langston dot we need the embedding model so we have to load it from hugging face embeddings you can directly load sentence Transformer as well sometimes it might behave no it might really give you some trouble the hugging face embedding due to version conflict but you can if you are if you're getting any error with hugging face embeddings you can replace this with sentence Transformers it's completely fine so from line chain Emirates are dominated a vector store so let's do that so Vector stores import fast Facebook AI search or something I forgot the full form of it guys okay it's something with similarity storage or something that's it so we have our import for this engage dot Pi we have document loaders text printer embeddings and Vector stores now let's define a data path so data underscore path and here I'm gonna pass my you can see the data that I have in this folder so I'm just gonna do data within that data and I also need to create something and I'm saying okay whatever embedding you create you store this in this folder okay so let's define a path for this folder so let's call it Vector stores and DB underscore fast DB underscore fast so whatever you do now you do inside this guys okay all my embedding should get stored inside this root this path now let's write a function so I'm just going to write create vector database create Vector database and now let's write Define create Vector DB so let's do that and inside this the first thing that we have to do guys we have to have a loader here so let's do a loader I'm just going to do a directory loader in this case directory loader and then in the directory loader what I'm going to do here let's have a data path which is right and then I need uh excuse me we can write in the one line because it's not that big so glow which is looking at the extension of it okay and let's have an extract dot PDF so we only accepting the PDF right now and then we have a loader under underscore class and I want this library to to be used you know to get the PD of data which is pi PDF loader and that's why we have imported as you can see Pi PDF loader now loader is okay now the next thing that we're gonna do here is uh we need a documents variable so documents and loader.load so let's load the file document loader.load and then we have to split this text splitter correct thanks to tab 9 and I'm going to use recursive character text splitter so let's do that guys so not recursion sorry recursive character text filter inside this let's pass a chunk size and chunk overlap let's keep a 500 chunk size you know for text splitter there and then we need chunk overlap let's keep a smaller chunk overlap for this demo video guys you know you can extend this further this app let me know what you built with this you know after the video chunk size and chunk overlap we are good with the text splitter now what I'm going to do here is I'm gonna call it text and text Twitter dot yes split documents we need to split documents split documents and inside this split documents you're gonna pass the documents and then let's use embeddings now now we're gonna pass this text to create the embeddings okay so I'm gonna use hugging face embeddings and inside this I'm going to pass my model name so my model name is nothing but it's going to be sentence Transformers sentence Transformers slash all Mini LM L6 and the version 2 so this is okay now I'm gonna do an ALT G here I'm going to use model excuse me model quartz I want to put a CPU over there so I'm just going to use a device value so device would be CPU if you have a GPU machine probably that will be wonderful for you I do have a GPU in the same machine but I want to run this on CPU that's why I'm creating the video for you guys not everybody has a CPU as I said a GPU sorry and then level DB variable and I'm gonna use to fast Dot from and I'm going to use documents here guys and in this documents and it's going to pass text and the embeddings that's it so what we are doing in this function guys we have this text which contains all our splitted text and this embeddings model that gonna use from sentence Transformer passing this too fast I'm saying okay here I might take and you use this embedding model to create all the vectors and store it in a database or a store not a DB in this case but a store DB and then do save local so I'm saving it locally you know in in chroma I would have basically you've been doing persist right so in here we are going to do DB fast path that's where we are saving this right you can see DB fast path that's it now what I'm gonna do here I'm gonna use if name name main create Vector DB that's it so we are good with our file guys our ingest dot Pi okay now we will run this and automatically it will create a folder over it will take little time but it will create a folder and once we have this ingest or Pi ready we'll start writing code for our model or Pi or the app.pi or the Bob dot Pi whatever you call it let's run this guys so in terminal what we are going to do is we are going to do python and you can see I have to activate that environment so let's activate my line chain environment I have an environment where I've installed all the required dependencies and here I'm just gonna do python ingest dot pi now when I do python English dot Pi it should probably run the file and should create the embeddings for me and it should automatically create a folder and it will save all the embeddings the index the metadata in that through I think it says I think it's a pickle the pickle file that it will get saved with an index that you can see it over here in the fast folder that we have Vector store DB fast that's what we are going to do okay now let's also create a model file because that will take a little time meanwhile I'll create a model.pi file where I will run uh write all my code for the bot and then we'll run this file model.pi through chain lit Okay so let's come here it will take little time because it has around you know 637 pages of PDF and it's it's a huge file uh and that's why it's taking little time it will take probably up to three to four minutes you know to do that but let it edit complete now let's write the code for our model guys the first thing they say import so again link chain thank you LinkedIn for all the heavy lifting that you do so from Land chain import you're gonna import a prompt template we write we're going to write a custom prompt for this so prompt template so from Langston import from template from link chain in LinkedIn Dot embeddings import so from line chain dot hugging phase embeddings from line chain dot Vector stores again import fast the same thingy that we did now here couple of more things will be added from linkchin dot llm it sounds really good llms last language models more than fifteen thousand large language models are available on hugging phase guys there was a recently I saw a chart from one of the Stanford researcher okay it was it was very fascinating so from LinkedIn dot llms import fee Transformers here here it comes guys C Transformers C Transformers and people still rate it uh you know I don't know python higher than other languages C C plus plus such a fundamental programming languages okay now from Lansing dot llms import C Transformers now what I'm going to do here from the link chain Dot chains we need that we need a chain here guys you know or so there are different type of change conversational retrieval and I want to use a retrieval chain but if you really want to focus more on chat history you know you can also use a conversational retrieval chain as well so I'm gonna use import retrieval keyway that's it excuse me I don't know why you can see a Keras thing here okay 3.0 has been released as well so retrieval key way and then when it import chain lit as CL they have copied that lead from streamlit it seems okay but this is very interesting I'm spending a little time nowadays on chain lit as well now we are done with our import let's write the same DB fast path because we need that so I'm gonna write here again Vector stores and Vector stores it will be not DB it will be deep and now okay you can see we have a folder in the left hand side let me just show you what I'm talking about okay this is okay you can see we got a new folder called Vector store DB fast and you can see index dot fast and index.pkl and you can see it is stopped it means our ingested Pi file has been successfully ran and we have now our all of our vectors is stored in a vector store so we are good with our embeddings guys now in model dot Pi let's write this functions quickly so I'm gonna write a custom prompt here so let me write a custom prompt template so I'm going to write custom prom template and in this custom prompt template it let's write something good here so in custom prompt template what I'm going to write is use the following pieces of information to answer the users question if you don't know the answer please just say please just say that that you don't know the answer don't try to make make up an answer or something like that make up an answer make up an answer I have little sore throat guys and cuff and cold sorry for that and then I'm going to also need a context and my context is nothing but this context will come from uh the llm and this the knowledge base and then we have question this is the question from the end user and let's write only returns something like this only Returns the answer already done the correct or factual or helpful answer something like that helpful answer below and nothing nothing else this is okay and then here we'll have our helpful answer that's it that's what we have done with our custom prompt template and that's we are gonna use a line chain here guys that's why Lang chain is so so so important to use okay so let's do that so what I'm going to do here is I'm going to write couple of more functions quickly the first thing is let's set this custom prompt so Define set custom prompt and I'm gonna call it custom prompt and this will have no parameters let me just write so let me Define this into this so I'm going to write some documentation so prompt template for for QA from template for QA retrieval QA retrieval for each Vector stores so the context that you see right that's the power of the our knowledge base that we the embedding that we have created okay now in in below this we will write our prompt and our prompt is nothing but we are going to use a prompt template function from line chain and we are going to pass template and this template is nothing but our custom prompt template that we have defined and I'm going to path input variables so let's pass this input variables and this input variables are nothing but the first one is context and the second one is the question inside a list so you can see we have a list inside question that's it and then we'll return this return prompt we are okay with this guys now so we are okay with this function now let's write other functions the next function that we're going to write is retrieval QA chain or should we write write load llm so let's write a function for loading this llm model the model that we have right so let's write that and that's where you will see Transformers so I'm going to use llm and I'm going to use C trans uh C Transformers Inside this all the magic will happen inside this I don't know why I ran this I'm just going to do a Ctrl G here and exit you can see C Transformers and my model is nothing so my model I have to take this uh let me just go back here in this model properties and if I'm going to just copy these guys the path and everything the name there and I'm gonna paste this over here that's it this is my model and my model type now see Transformer also you know also look for your model type because it also supports different other model types so it has lamba it had different model type that you can use vikuna Etc so model and I'm going to pass type here so my model type is nothing but it's Lama so Lama and then I have Max new tokens couple of hyper parameters so max new tokens and let's keep it 512 for now Max new token then I have temperature and let's keep it 0.5 as a temperature that's it we are loading our model guys and now we're gonna return the llm here so return llm large language model so we are okay with our function now we are loading the model so you can see the changes that we have done we are not using a Transformers from pre-trained or Auto cowser LM but we are relying on a c Transformer the python bindings you know for Transformers in C and C plus plus because it it's really fast you can also explore V llm you have to check that as well guys from GitHub please have a look on that as well and also retrieval and I'm going to write one more function retrieval q a chain and in this I'm going to accept llm from DB I think I have to write the code for DB as well the function there okay we'll see that now retrieval qha and llm prompt and DB okay I'm going to define a qha in here and in this QA chain I'm going to pass retrieval QA Dot from uh chain type we're gonna use the chain type from chain type and here we go so retrieval dot key wave from chain type llm equals llm and then we have chain type we're gonna use a stop chain type you can also look at mapreduce Etc as well in your summarize and something like that okay so stuff in this case I'm gonna use stuff and then we have retriever so let me do that so retriever I hope I spelled it right r i e yeah I think so DB as retriever DB as Retriever and it will take search box so search quarks and it should take number of sources that you want to do so I'll have two here okay this makes sense and I also need return Source document we have to explain the output guys to the end user because we assume or not assume I think we should retrieve the response from the knowledge that we have fed to the system because lamba 2 will also have its base knowledge okay we don't want to do that we want to only return the knowledge or the information from the document that we have fed to the system and that's why we're going to do a return Source document here in this case so let's do that guys so return Source document I don't know why Capital there returns those documents it's a true false Boolean here so true okay return Source documents and then I'm going to also use chain type clocks because I have to pass that prompt here right the custom prompt that we have so chain type clocks and in this it's gonna have a prompt so let's do that so prompt and it's a key value pair so I'm going to pass a prompt here it's a dictionary so you can see the prompt that we have passed that's it so we are going to return this qhn now so let's return this q a chain that's it we are done with couple of functions guys so what we are doing in so far we have imported the libraries which are required we have given the path of our embeddings we have a custom prom template the better you write prompt better responses you will get so you can explore this and then you have customers setting a custom prompt a function that return the prompt using Langston prompt templates then we are loading the llm large language model C Transformers and then we are you know loading this model by the block the quantized model gzml model and then we have a retrieval QA chain that basically combines all of these you know uh let me open my cache here so I can also explain a few things so now what you're gonna do here let's write we have to write couple of more functions so let's do that so Define QA bot and here we'll use our embeddings so embeddings and probably we should write this function on top and I'll do that guys so hugging face embeddings and this remains same I don't know why I'm writing this let me just do one thing let me just remove this thingy from here and take the code from ingest Pi this remains same so I'm just gonna use copy paste here for this and I'll come here and I'll just paste this we are okay with embedding so now let's do one thing embeddings are done so let's DB and again fast so fast dot load local now in the ingest dot Pi we did save local now we have to load local so in load local what I'm gonna do is I'm gonna give this path of DB fast path and the embeddings on the Fly and then I'm gonna use this LM so for that I'm going to use lowered llm you can see this load llm and then I will also need a prompt so I'm just going to call it QA prompt and this q a prompt so let's use the set custom prompt function that we have used set custom prone and the last one is give a variable that will combine the QA change so retrieval qhn function and I'm going to give llm and QA prompt and DB that's it so we can we are utilizing all the function that we have written on top in this function I'm just going to return in QA excuse me sorry and it's going to return this QA guys here that's it so we are okay with our function now so we have completed the functions for the model load and also the retrieval keyword chain the last function for this part at least and then we'll write the channel it couple of function for chain lead now the last function is we need a final result guys so let me just do that so final this is a little bit of output parsing not much and the query by the end user if you are building a streamlit app that will be easier for you to just take this function and run it through a html text area or text input what we are going to rely on channel it in this so QA result and I'm going to call this function on top and I'm going to pass my response variable into this QA result and this will have my query which is the key value pair dictionary here that's it here you go so if you run this now it should it should return the response on your query so you can do a print or something like that in terminal but you can play around it I'm going to write the code excuse me I'm going to write the function for now chain link so let me just do that so I'm gonna write Channel it code here and I'm gonna use some chain lead so if you see here is we have Channel it overview you can see the chain lead documentation it's very powerful guys an open source python package that makes it incredibly fast to build and save llm apps okay it's it's provides you a conversational interface so we need some decorators it has some inbuilt decorators you would have seen it in Microsoft Services you were working with flash fast API or Django and that's what we're gonna do so the first thing is CL Dot on chat start so what you're gonna do the first thing is when you start the chat okay this should get loaded okay so on chat start async and let's write this function called start and inside this we'll write our code guide so the first thing I'm going to use this chain equal to QA bot I'm going to call this keyway bot function here and I will have a message variable and inside this message variable I am say okay I'm gonna use CL Dot message chain lead Dot message function and I'm going to write my content here so let's write this content and this is nothing but starting you know the bot okay this is my message to the end user when we logged in immediately you know on the chat start so the message and I'm going to use and await here message dot send okay so you send this message once this gets loaded and my message content there should be a message content and then hi welcome to once it gets loaded this would get printed there on the Chain lead UI welcome to the medical board or something like that okay Medical Board what is your query you know something like this okay what is your query and then we update the message now because in the previous call we said here we'll update this message so message dot update and let's do that and let's clear this so CL dot user session foreign and then we pass this chain variable so this chain variable gets our QA board that's the first function that we have in chain lead now you can explore chain lead documentation if you want to put some you know images or in some markdown or something they have this as well okay so it's extremely a powerful guys generally it's it's good to build uh llm based bot because it provides you the interface and it also gives you a tracking mechanism you can you can really track which functions are being called okay without you know writing a lot of code explicitly it provides you you know it's inbuilt there so let's write the last function that will get all this result and responses on the Chain lit so what I'm doing I'm doing okay add the rate shingle it The Decorator on message so if you get the message so on message then here I'm gonna write that guy so async diff main this is our main and it will have a message parameter so let's excuse me message and here I'm going to write our chain so CL Dot this is what I need so I'll just copy this I can see it so I just need this and I don't need this chain anymore okay this chain will be called okay so CL dot user session and then I have a call back so I need a call back here guys a callback function so I need c l dot async uh line so we are using line chain so it has something called async line chain call back Handler or something like that so let's use that call back Handler and inside this I'm gonna pass a streaming message response so you can see I stream Final Answer excuse me stream final answer and that it's going to be true and then answer prefix tokens and Enzo prefix tokens this should uh be final answer so let's write like this you know uh final ends and I think we are done with this so now we have a callback Handler which is stream Final Answer true answer prefix tokens final and answer now what I'm going to do is if if answer is reached so CV answer is true call back that answer which is true they're printed so result and then await chain Dot uh change await chain dot let's have an asynchronous calls a call and then masses and call back so it's not called callable it's called tax so callbacks should be inside a list I probably assume that's CP callbacks equal CB now we have our answer here so answer is nothing but our result answer response result and then sources and sources are nothing but result and its source underscore document so I already have done this guy that's why I know that I how I can pass the output you can also explore that you know you you have Source document you have result right so I'm just passing it okay just you can do split as well if you're not using Channel it you can you can just do a split or something to get that value out of it of the final reason and I'll just have a condition here so if sources answer and then it okay we have to augment so we let's have answer and then F of let's have a line break okay and sources and this should be your sources but I think we have to wrap this in a string probably stream yes I think so so let's do a string so we are okay with this and now I have an else here so if just say no source is found okay no sources found something like that okay if there is no source for that answer and that's it just a last line here guys just to give a message that our content is answer and just do a scene here that's it so now we are sending this on the UI and it's a function so now you can see what we have done so we have two functions two decorators for chain lead the first is to which I start the app we are using this QA board okay and now we are sending that message dot send starting the bot and once it's done we update the screen okay we said hi welcome to the medical board what is your query and now in the next one on the message when the user gives that input ask a question is basically uses that function the chain here that you can see the chain that it see just QA board chain right the QA board will get triggered and it will get the answer and sources and we're just using if there are source execute this one and if there is no Source execute this one and just send the message so chain lead Dot message send so we are just printing this answer guys here so let's now run this and we'll probably see our medical chat bot that we have created using llama2 and that will run on a CPU machine guys so let's let's run this so now in terminal what we're going to write is chain lit run model.biofile name and then hyphen W so let's run this now chain liter and model dot pi and let's see and now you can see that first you saw a readme you can also see this readme here it says hi there we are excited to have you on board this is a powerful board designed to help you ask queries related to your data and knowledge you have some useful links data join AI anytime Community you can also join our community on WhatsApp group you know and you could say where we share your projects and job opportunity internship blah blah blah right the happy chatting couple of other you have this chat here that you can see hi welcome to Medical Board what is your query now this is the conversational interface and that's why we are using chain lead earlier we have used streamlit in most of my videos I also have some interface that I built with bootstrap with power by Ginger 2 in fast API also have done that but we are trying it for the first time chain lit here which looks good you know apart from graduate and extremely that we have seen in Python right so it's also a python package you can see built with chain lid and you have this board and you have this settings also you can make it dark mode also you can see the dark mode if you are a fan of dark mode you can do dark okay but I'm just going to keep you know light mode for this right now and I'm going to ask a question and you can see I asked the question and you will probably see this the response was very fast do you know the reason is because it's already in the memory because when I was recording this video you know when I started this in beginning I was showing you I asked the same question and you can also see that the question that I have asked here all the previous question it gets in the memory here last message and that's why it's very fast in the and it's not going to the model again and getting the respond this is a very good thing by the way okay and that's what I like chain let you know and it also has a answer and it also has a source in let's ask a question so let me open this document and I'm going to ask a question from this document guys that's all that we haven't asked yet okay maybe what we can ask I'm looking for a simple you know I'm not that good in you know finding out the question from document where I don't know what question to ask let me ask a question let me see a good enough question so should we ask it's such a huge file okay amino acid disorder screening let's ask this question what is amino acids disorders what was that disorders is screening okay I'm asking this question I have a temperature of 0.5 so it will be really creative I think disorder screening now this is my query that I'm asking it's from a chain lit you know conversational interface and we are going to leverage gamma 2 the lamba 2 will respond from the embeddings model that or the model that we have used for embeddings the base nor the the model that the knowledge base that we have created right what is amino acids disorder screening let's hit enter and see now once I hit the enter you can see it says using retrieval QA and that's a very good thing by the way because it's explaining you can keep a track of your uh the function you are getting your now you have different functions you can track which function calls okay which has been called right now you can see it says now it's running right now stuff document chain now go inside that you can see using llm chain now if you go inside you can see it's running so the LM chain is being running right now okay in the back end so you can keep a track of all the function calls okay retrieval QA or what kind of document chain you are using and the same thing you can also remove hide it from settings okay we'll see that so from settings if you you know hide the Chain of Thought probably you will hide this okay now so let's it will take a little time because it's running on CPU machine so I think for one response it's probably take around a minute so it takes one minute up to two minutes or sometimes also and your internet speed has to be fast you can see it says could not reach the server because it's connected with all localhost 8000 with your machine which requires right now could not reach the server so that should be there okay so let's let's wait for it we'll take little time and meanwhile I'll show you the model which I've taken the model from here on Lama the block model let me just close it we are using chain lead the link will be given in the description for this as well guys and then we have C Transformer that we have used so three three Transformers here you can see it's running it's still here so let me come back on you can see its batches hundred percent right successfully loaded fast with avx2 support loading fast with avx2 support you can keep a track of it everything over here okay now if you so terminal let me minimize this here is the code now let me go back and you can see now okay you can see that we have got our response here which says amino acid is fantastic right amino acid disorder screening is done universe and we also got our sources right we can ask more questions as well you can ask follow-up questions as well and you can get you can also use conversational retrieval memory guys and this is this is why I wanted to create to run this from CPU machine okay fantastic answer that we go let's ask some other questions so I'm going to ask the other question as well guys so let's ask this one more question here now which question we can ask Bartholin's gland cyst okay something like that okay let me see a tricky question I'm looking for a tricky question let's ask how can we treat bartholine's gland okay so let me ask this question how can we treat bartholines what is that bartholine's gland what was that bartholines right Bartholin's gland if it is infected and I'm asking this question how can we treat bartholine's gland if it is infected now when I when I ask this question you can again track it using retrieval QA and if you expand this you can see it says using stuff documents chain using llm chain and running so this fantastic right it's a and you can also stop task so the same way you have in shared GPT you know stop generating or something like that try the stock responses so it's providing a very good you know conversational interface you also have read me here now you can also design it a bit you can change this uh you can put your own you know name or something like that you know in taskbar or something you can do that you have to go to the documentation you can also clean this uh sources the the way we are returning the sources there are option of you know make it clean okay you can also uh so the PDF inside it the data that you have now here you can also click on settings you know dark mode you can see in the dark mode now how are we seeing it it also takes your system time you can see 2029 currently we are on 2037 you can see it's running 2036 so it's it's very powerful guys okay when it comes to building a conversational interface using chain lead so let it lower this is the last question that we are asking okay and so this will the entire code will be given uh given in the description you can find out through GitHub so that really code will be available on GitHub you can go ahead and take the code base from there and let me know what are you building you know with this model the quantized model with C Transformers and ggml2 you know they are they are they are rescuing basically for people who have limited compute power you can use this model locally on your CPU machine for your task right and that's the beauty of it so let me know what you are doing with this kind of model guys uh let it load and currency you can see it will say batches and you can see the output vehicle we've got our output so you can see that if a Bartholin's gland becomes infected it would be evaluated and treated by a healthcare professional wow familiar with the treatment of this type of infection and then you can see the document and this is important it's giving the response from our guys our uh knowledge that we have fed to the system and it's a QA board that we have built but you can you know try it out something else as well but this is what I wanted to show you guys in this video you can see uh simple and Powerful board that we have built using llama2 and we have loaded that through C Transformers uh that supports ggml models and thank you to the block again and I hope you like the video if you like the video please hit the like icon and please share the video with your friends and your soccer and if you haven't subscribed the channel yet do subscribe the channel guys I've been working on many other videos that I've recorded I'll post it those are uh very interesting now because we are also going to release video which will not only help you create this kind of demo but also how you can control hallucinations how to look at guardrails governing data protection and privacy when it comes to large language model powered application or solution that's the there'll be a few videos on those lines as well guys so uh looking for your support and you know guidance as well let me know any thoughts if you have any thoughts and feedback for me in the comment box I'm not feeling well I was not well but I wanted to record this video I am in suffering from influenza but that that didn't stop me to record this video guys I hope you like the video so thank you so much for watching and that's all for today's video guys see you in the next one

Info

Channel: AI Anytime

Views: 35,045

Rating: undefined out of 5

Keywords: Build a Medical Chatbot using Llama 2 on a CPU Machine: All Open Source, langchain, vector database, vector store, llama2, llama, llm

Id: kXuHxI5ZcG0

Channel Id: undefined

Length: 59min 15sec (3555 seconds)

Published: Sun Jul 23 2023