Build Real-Time AI Voice Assistant With RAG Pipeline And Memory | Mistral LLM | Ollama | LlamaIndex

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

guys in this video we are going to talk about how we can create a voice assistant B using a large language model can you suggest me something from Indian menu hello I'd be happy to help you with that what specific dish from our Indian menu would you like to order our options include clean doer AI Roti idly Alo parata and POA let me know if any of those catch your interest okay so I will go with one idly and one Coke great choice one idly coming up a along with one Coke your total comes to rup 70 is there anything else you would like to add to your order so guys that was the highlight from the demo so we are going to learn this whole pipeline from the scratch in this use case what is happening is the customer is basically placing the order using voice the voice is basically converting into the textual data using the whisper model so what I I've done is in this example instead of going with whisper I went with faster whisper what is faster whisper faster whisper is a reimplementation of open AI whisper model using C translate 2 which is a fast inference engine for Transformer models this implementation is up to four times faster than open whisper for the same accuracy while using less memory the efficiency can be further improved with 8 bit quantization on both CPU and GPU so this was basically a reason for going with faster whisper instead of just going with whisper so I would also recommend you guys to go with faster whisper instead of on the whisper it will be lighter and much more faster than the whisper model and then we have different whisper models here we in our use case have used medium model these are the requirements we need python 3.8 or greater and I have used GPU that's why I had to install these two packages also so basically after spending some time I to know that instead of you know individually install CU blast and CNN what you can do is you can you can rather have a Nvidia Cuda toolkit um whatever version which supports your system so this Coda toolkit will have both these packages within itself and this is the installation so those who are going to run it on CPU they can just follow pip install faster whisper and it should work and this is the usage example which we are anyways going to see when we are going to start going through the F part so once this voice to text conversion happens it goes to rag so now let's say for example the customer has placed Pepsi and pizza so what will happen is this textual information goes to the rack pipeline now I'm not going deeper into rack back because we have already discussed this in my previous videos but just to give you a glimpse of how the data is basically flown flowing in the rag think of it like you have a textual file which has the information of this restaurant and the Indian menu the English menu and the beverage information is present in the textual data right and this textual data is getting um converted into chunks and then embeddings and it is getting saved to the quadrant DB now what is happening is that once the user so basically this is for the other example but think of it like we have the customer here and asking for Pepsi right so what happens is the Pepsi word will get chunk and basically converted into vector and then it goes to the matching okay so the retrieval will fetch the chunk which will have all the beverages information where we have C Pepsi and some other things once the retrieval will bring out all the chunks which are you know matching with the customer Pepsi it will go to the large language model so this is what also happening here so the moment the customer is basically asking for Pepsi it is going to Rag and then rag is basically fetching the the chunks and then goes to the m large language model and AMA is basically a large language model framework you can download install multiple lar model now once the retrial retriev the chunks and gets processed from the mistal the answer has to propagate back to the user to the customer right so what happens is it has to get converted from the text two voice right so that is happening using the gtts package so we have used gtts 2.5.1 it is the Google text to speech package these are the features of it and these are the installation instructions and quick start and this is the course snippet to start with where hello has been passed as a string and it's get converted into the l. MP3 voice audio once we go through the code we'll understand it better so we can see that we have chat memory also in the pipeline right why we need this so in this hug case think of it like the customer is basically having a conversation with the bot and the customer has placed Pepsi right so once the Pepsi is getting fed into the rag and then the customer is keep on you know placing different different items let's say after Pepsi he has ordered Coke or he has ordered ayti or idly whatever it is so the conversation is not like he has placed each and every items in one shot right he's interacting with the voice assistant and then what assistant is giving some options and then he's interacting it um again with the different items like okay whether I want this or that and that so what should happen is that the the bot should also have the memory of this chat whatever it is having with the customer so that it should not ask again and again in each and every query that what is your name what you have already placed till now right so to have that capability of having the memory in it we have added the chat memory so we have use llama index chat memory buffer class to you know retain the memory of the chat okay so now that we have covered all the components of Pipeline and all the packages and libraries which we are going to use let's jump into the cing part so here's a code in vs code where we have three main files first is the app. P which is the starting point of this application with main the other one is voice service which takes care of text to speech using the ggts and the third one is the rack module where we have two files one is the AI voice class where we are providing the configuration of quadrant Vector DB we are creating AMA object providing the myal as a model we are creating the service context and creating the knowledge base creating the chat engine and restaurant file which I talked about while creating the while explaining the rag we have this text file which we are going to feed and create the knowledge base out of it right so these are the three or you can say four files which we are going to use so first of all you create a virtual environment so in vs code you come to terminal and type python m v NV and the name of the virtual environment in my case it is v n and the moment I execute this command I'll have the virtual environment so I already have it so I'm not going to execute this command and once you have this um you know environment ready you can select go to uh View and select the select interpreter and then basically provide The Interpreter path so in my case it will be inside. VNV script we have python.exe so the moment you select it your python interpreter will get selected and then the moment you click on the command prompt you can see here the virtual environment is activated so from now onwards whatever you are going to install it will get installed in your virtual environment so because I have already done it so I'm not going to do it but let's say for example you want to uh you know install all the requirements then what you can do is install our requirements.txt and the moment you execute this command all these required packages and libraries will get it okay so once you have done this then you are ready to you know execute this application let's try to understand what we are doing in this main method so first of all I'm going to delete this chat hory because we are not using this anymore okay so we are providing model size in the visible model so model size is nothing but the faster model model so as I said we have medium. model and we passing the other information let's say for example device we so in my case I'm using GPU so I have used Cuda but if you are using the CPU only then you can change it to CPU and then you have to change this to in 8 it's better you refer their page and get the correct information and number of workers I chose 10 it's optional one even if you don't provide it's okay it will take the default one so I back this to yeah then we are creating Pi AIO and then we are providing this P audio ba and 16 channel is one rate is 16,000 and we are passing this Fram forare is24 then we are creating a customer input transcription variable and then we are having a while loop while code will keep on executing until and unless the customer and The Bard doesn't finish the conversation right and there the order what we are doing here is we are passing the audio and Stream in the record audio Chunk we are creating the audio file here right and then after creating it we are checking whether the audio file is silent or not if it is silent then we are dropping that file there itself we are not processing it further if it is then we are basically getting the transcription from that audio file okay so here you can see if it is true then only you pass it on here and get the transcription so how we are getting the transcription by using the VIS model see model model transcribe and then we are passing the file path where the audio file has been saved and we getting the transcription once we get the transcription we are basically removing the audio file we don't need that file so we are going to generate file again and again so why the audio file is going to get create and we are going to delete it again and again think of it like this so the customer just keep on talking with the bot right and the conversation will be always like once the customer will talk that time the bot will listen to the customer right what what he needs what kind of items he need to order from the menu right right and when the B is going to talk back reply back on the customer's query the customer has to uh be silent right customer that time is listening like what is uh what the board is replying so all these conversations will be the chunk of audio so let's say customer is saying hi so that will be one chunk of audio it will go to it will go through all the pipeline and then um in reply it will the rag the whole application will generate something in you know in text and that text will get replied by The Bard using the voice conversation right so this is how um the conversation will get exchanged so that means that we are taking 10 second we are taking 10 seconds of the audio from the customer and then we are going to get the transcription from that particular audio right and once we get the transcription we are delet delting that audio chunk file okay and then we are basically printing the customer transcription then we are passing this transcription into the interact with llm function so let's quickly check what is happening in the chat so before that so once the AI Voice Assistant object is going to get create right which is happening in app. by itself so you see here we have the AI voice assant initiation of the object on the class level so that means it is going to execute at first and once it will get execute that means that we are while executing we are creating the quadrant plant we are creating the AMA object we are passing the mistal and we are providing the request time as 120 seconds and then we are also creating the service context okay and passing the large language model and then embed model will be the local one then simultaneously we are creating the knowledge base also so in the knowledge base what we are doing is we are passing this restaurant file restaurant file is nothing but the information of the restaurant so it's a bangal kitchen which is again a hypothetical kitchen it doesn't exist which is in Whitefield Bangalore and we have a Indian menu we have English menu we have some beverages so once this um gets executed and we get documents we are passing these documents and uh basically the you know the conversion of these documents into chunks and then vectors and then eventually right and thenan it is getting saved in the quadrant DV so we are going to see that also while executing this whole application then the knowledge base is getting created okay so that is what is happening in the create knowledge Bas then we are creating the chat engine so in the chat engine we are using the chat memory buffer because as I said we are not only using rag but we are also using the history of the chat so that the information which has been exchanged by the board and the customer should get uh saved and we are also using chat information in the interaction between the customer and The Bard okay and we are keeping the token limit as 1,500 so we also want the conversation history in the memory that's why we are using as chat engine rather than as query engine so when we use quy engine we can't keep the memory that's why we are using chat engine passing the memory option here as a chat memory prefer we'll keep the chat mode as context and memory as memory of chat memory profer and then we are providing the prompt so prompt is nothing but this is the prompt which we are using okay so you are a professional AI assistant receptionist working in bangalore's one of the best restaurant called Bangalore kitchen and ask question mentioned inside square brackets the square brackets we have mentioned some questions like ask name and contact number what they want to order and end the conversation with greetings and if you don't know the answer just say that you don't know um this is basically for avoiding the hallucination and then provide the concise and short answer not more than un and don't chat with yourself yeah so um yeah so that was it about the AI Voice Assistant let's go back to the flow of while loop so once we have the output from the large language model right we are checking whether the output is um valid and then we are stripping it and we are playing we are basically passing this text into dat text to speech so in the play Tex to speech we are using again the gtts and we are using pame because pame is used here to initiate and load and play so um this is also one of the requirements which you have to install as a package now that we have gone through all the files and understood the code part let's run the application so we are going to type python uh but before running this uh let's run all the required applications so we are using AMA so we're using AMA so we have to run AMA we are using quadrant as a vector DB so we have to run the docker to basically run the quadrant so you can go to ul.com and download their software they have software for Mac Linux and windows and and they have this model list so you can install any of these models in your device so we have downloaded the mistal they have mistal Lama 2 Gemma and all the popular models available and to have a better understanding you can go through their GitHub and go through their read go through the quick start so we'll quickly go to command prompt we'll type you want mral model and we going to run one command where we are going to [Music] say so this is giving me the error saying that only one usage of each socket address that means um there's something running on 11434 so what we'll do is for Windows we can go and check all the processes which are running on this so now we know that these are the processes which are running so we can terminate this using a so all these processes have been killed now if I say Ser it should work here what we have done is basically we first ran the M on AMA framework and now we are passing the AMA serve so that we can run the inference server here so inference server is running here and our code will utilize that server using this AMA I have to go to desktop doer next we are going to run the server so to do that we have to have a Docker installed on our device and you have to basically pull the image of quadrant and run the container on this Docker software so I'll quickly show you how to do that if you come to quadrant. te it will give you a brief about what is quadrant Vector DB and how to run it using the docker commands and it also has a quick start guide you click here you will get the procedure to download and run the you know the image and how to run it on loc 633 Port there's also a step-by-step tutorial here so please go through this uh so that you have the basic understanding of how you can run this Vector DB apart from that let's say for example if you don't want a quadrant Vector DB what you can do is you can go for the alternate options for example chroma DB is there V8 is also Al there so it's up to you it's just that you have to do few changes in the code and you will be done okay so let me start this quadrant container so you can see the status is froming and if I go to Local Host 6333 dashboard and collections you can see there is only one collection right now collection D the kitchen D I'm going to delete this because this is the collection which we are going to create using our app apption so we go back to our code I quickly C so the moment we run this application the restone file will get extracted and it basically create the knowledge Bas this knowledge created successfully if you go to The Local Host 6333 dashboard and Collections and refresh the page we can see that the knowledge DB collection has been created and this is the text node we can see the content of the file here LSA akiri all these with the pricing information is available here if you go to the info part we can see there is only one vector because the file is relatively very small right I mean we are using rack just to have you know the pipeline so that let's say in our use case it's a very small file but in some use cases it could be huge there might be multiple uh you know uh chunks and in that case you can see a lot of um you know vectors here so in our case because it is small so we have only hi do you have idly yes we do have idlies on our menu would you like to place in water that additionally may I have your name and contact number please yeah so my name is Ayan and my contact number is 9991 hi a thank you for sharing your name and contact information let me confirm you would like to order ites from our menu correct what else would you like to add to your order if anything from our Indian or English menu or beverages also please note that all our orders are prepared fresh and it might take some time for the food to be ready would you pref prer to wait at the restaurant or have it packed for takeaway lastly do you have any dietary restrictions or preferences we should keep in mind while preparing your order yeah so I'll only have so I'll only have one idley and I'll have one Pepsi and that's it understood Ayan you would like to order one idly from our Indian menu and one Pepsi from our beverages section correct your total amount comes up to rupees 90 is that correct and you don't require anything else from our English menu at this time also please note that due to high demand there might be a wait time for the order preparation would you prefer to wait at the restaurant or have it packed for takeaway lastly we are happy to accommodate any dietary restrictions or preferences you may have if you have any specific requirements please let us know before placing your order so I I think the calculation is wrong it is not 90 it it should be 60 apologies I for the mistake in the previous calculation one idly from our Indian menu costs rup 20 and one Pepsi from our beverages section costs rupees 40 so your total amount comes up to R 60 is that correct and you don't require anything else from our English menu at this time also please note that due to high demand there might be a wait time for the water preparation would you prefer to wait at the restaurant or have it packed for takeaway lastly we are happy to accommodate any dietary restrictions or preferences you may have if you have any specific requirements please please let us know before placing your order thank you that's all from my side I'll prefer to have it at the restaurant itself thank you Ayan for confirming your water and preference to D in with us your total amount comes up to rupe 60 for 1ly and one Pepsi we will prepare your water as soon as possible please let us know if there is anything else we can help you with in the meantime have a great day at Bangalore kitchen

Info

Channel: Ayaansh Roy

Views: 4,573

Rating: undefined out of 5

Keywords: #LLMs, #AIIntegration, #Tutorial, #MachineLearning, #ArtificialIntelligence, #DeepLearning, #NeuralNetworks, #NaturalLanguageProcessing, #AIDevelopment, #ModelIntegration, #AIProjects, #AIApplications, #AIProgramming, #WebDevelopment, #AIInnovation, #SoftwareDevelopment, #mistral, #mistralofmilan

Id: mdV8lETtGY4

Channel Id: undefined

Length: 25min 25sec (1525 seconds)

Published: Fri Mar 15 2024