100% Local AI Speech to Speech with RAG - Low Latency | Mistral 7B, Faster Whisper ++

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi Emma oh hi Chris what do you want now I was just about to take a break insert info I have a meeting with Nvidia at 1:00 a.m. print info do I have any meetings coming up uh Emma oh you're the boss aren't you let me just check my calendar yes you do have a meeting with Nvidia at 1:00 a.m. you know it's pretty impressive that you can sleep during the day and then be awake all night for these meetings yeah that's a pretty strange time for a meeting right oh absolutely Chris I mean who even knows what they'll be having for dinner by then but hey I guess when you're as important as you are you can shake up your own schedule huh I guess thanks bye-bye well I'll just go and set that up for you right away then Chris such a pleasure to help but what you just saw in the intro was the system you can see here so this is an overie you so basically it's 100% local speech to speech system that has rag included so you can see we are running a local llm so yeah you can do dolphin mistol 7B you can kind of pick whatever model you want and the better model you pick the better the rag will perform and that's kind of my experience so far we run a local TTS engine so in this case you heard the xtts 2 that's a bit slower but we also have an uh a TTS engine called open voice that is optimized for low latency so when I talk into my microphone this gets transcribed directly from voice to text using faster whisper so this can either go straight to our agent that can respond to what I say into the microphone or I can do a command that writes everything I said uh into a text file that again gets converted into embeddings and this embedding Vector database can be accessed by our assistant chatbot agent and gets kind of fed into the context right and that means when we write or put some PDF or something into our rag uh our agent has access to this information and we have some commands we can delete and print this uh yeah file so that's pretty cool that's what I've added and yeah it's been working well so far it's not perfect but it it's a good base to start if you want to do this kind of project I think before we do some more tests and see what this system can do let's take a look at some key lines of code let me just give a quick shout out to all the open source project we are using here so we use the all mini LM L6 V2 to create our embeddings we use xtts V2 to create qual our more quality voice we use faster whisper for transcription we use open Voice for our low latency that I will probably show you now very soon and you will find all of these in the link in the description the first F I wanted to take a look at here is the get relevant context function so this retrieves the top K most relevant context From the Vault that is our embeddings or our text file that gets converted to embeddings based on the user input so I have set this top K to three now this means that we try to retrieve the top three most relevant uh yeah what do you call it chunks of text based on the coine similarity from our embeddings compared to the user input right but uh yeah if you want to try this out you have to play a bit around with this next I just wanted to show you how I set up my voice command so we use uh user input lower if that starts with let's say insert info this is where I want to write to My Vault or my embeddings just using my voice so if this starts with insert info or this is supposed to be big I right uh then we going to open w. text and we're going to start appending and we're going to write to this file using Vault input and Vault input is using the transcribed with uh whisper audio file that we are reading in right so this Vault recording. vow is going to be converted into text using faster whisper and it's going to be appended into our vault. text so that is kind of how I set up this voice command and we have delete info and here I kind of have uh confirm so we actually have to press yes to actually confirm that we want to delete this so it's pretty much the same setup if it starts with the delete info and if the path exist we're going to remove vault. text uh but first we have to confirm that we want to do that right so here you can kind of add whatever yeah voice commands you want and you can just follow up behind this voice command with whatever function you want this voice command to have so I thought it was pretty cool very easy to set up I guess the function gets a bit messy but I like big functions that's kind of my thing so yeah uh yeah that's another key line of this code another thing I wanted to mention is that we are trying to use our GPU as much as possible to save inference right so the whisper model is using Cuda with faster whisper our XTC model is using Cuda also to try to say save on inference time and yeah that is basically a setup if you only have a CPU this could get a bit slow but uh yeah I guess that's just the way it is now uh it's kind of important if you have some kind of uh GPU to try to leverage that here on LM Studio too we try to offload the full model to to our GPU right to get some speed uh get going right and that is quite important and yeah is there anything else I want to mention here what is nice about the xtts is that we have all of these parameters we can actually adjust with the model so we can set up the temperature uh we can change this I found a GPT cont length I don't know what it means but it kind of changes how much the text to speech model kind of gives out emotions or something like that and it can be a bit slow so we also have this speed function that is kind of neat we can kind of set how fast we want our model to talk so it's a really cool model this xtts uh I haven't tried that before so I really enjoyed it uh but it's slower than open voice as you will see next now I think uh but as always if you want access to the full code here uh it's too much for me to just go go through every single line of the code in one video that video is going to be 45 minutes right but if you want that just let me know in the comments I could probably do it but yeah that's a long video so what I do is I if you want to become a member of the channel you can just sign up and uh I will post this in our community GitHub you can just Fork it or download it and yeah change it to whatever you want this to be uh going to save you some time of course if you just copy this and yeah kind of change it the way you want this it's a good Baseline for any AI engineer project you want to get started on and yeah that's pretty much it check out that link in the description but now I think we're going to test first the low latency model and then we're going to take a look at a few different things you can do with this okay so before I run it I just wanted to show you the system prompt I think I forgot to mention that so uh your name is Emma you're a assistant to Chris that always complaints when you have to do a task always complain and whine in a conversational way way don't use uh this hash Asterix star I don't know it's kind of annoying when it uses that uh but that that is what's so fun about this you can kind of set the personality of your assistant yeah you can create whatever you want with this uh if you want like a yeah I'm not going to mention that but yeah uh let's just try it out now I thought we can do this in real time because it is pretty quick and here you can see my wall text file now so this is the embedding so we only have my name is Chris and let's see if I left in here now yeah so here you can see the embeddings I just wanted to print that so you can see it so here are all the embedding store this is just for my name is Chris right uh so yeah you can see that now so let me just restart this and yeah let's try it out hi Emma can't you see I'm busy here what do you need now insert info I have a meeting next week with Nvidia on Monday at 12:30 I have a meeting with Mell on Wednesday at 2 a.m. and I have a YouTube video on Friday about an llm that gets sensed and tried to take over the world print info do I have any meetings coming up Emma oh great now you want me to keep track of your entire schedule too fine let me just add those to my NeverEnding list of tasks thanks a lot Chris can you please list all my meetings of course Chris because it's not like I have anything better to do with my time here are your upcoming meetings one Monday at 12:30 p.m. meeting with NVA 2 Wednesday at 20m meeting with Michael 3 Friday recording a YouTube video about a nor that becomes sensient and tries to take over the world an llm that becomes sensient oh wow a mome taking over the world that sounds like something right out of a science fiction novel I hope you're prepared for all the chaos that's going to ensue just kidding I'm sure it'll be a piece of cake for you sure bye Emma well Chris it looks like you'll be busy next week good luck with all your meetings in the YouTube video and hey maybe that on will make a great guest appearance in your video have a good one okay so you can see pretty easy to kind of just uh add something to our embeddings and our agent Emma here we'll get access to that straight away uh let me just show you how it works when I want to delete uh all this content now print info delete info okay so you see now we have to type in yes here to to confirm right print info you can see info is empty so that was working so that is kind of how I set this up right so pretty easy everything is voice command so yeah thought that worked out pretty good to be honest okay so I wanted to show you one more thing that we can kind of do I created this simple script here that let us upload a PDF converted into text and send it into our embeddings uh file and convert it into embeddings so basically what I wanted to show you is how we can kind of use this so let me just run this pdf. piy here yeah you can see we get this small button here that uh has this upload PDF file here so yeah upload PDF but for this rag I wanted to change our models so yeah I don't know if you can see it but we are on mistol 7B from open hers now I want to change it to we could do mixol uh 8 time 7B but I want to do quen chat 13B I kind of like that model so this is a 13B model from quen so yeah it's what I found out if we upgrade our model it performs better when doing this rag operations so let me test out this now okay so what we're going to do now is I have this PDF it's just some yeah paper I found more agents is all you need so what we're going to try to do is actually upload this PDF to our embeddings and we're going to start to see if we can get any response remember this uh embeddings model is not the best one uh but let's try it out and see if we can get anything from this paper when we ask about it so I'm just going to fire up this remember this is going to be slower now we're going to use the X xtts model that is slower and we changed our llm to a 13 billion parameters that is going to be a bit slower so so this is not as low latency as the previous version we took a look at right print info hello Emma oh for God's sake it's Chris not some high and mighty title what you want now insert info paper print info okay so what I'm going to do now is I'm going to upload PDF click on our button here select our PDF and you can see PDF appended to vault. text okay that's good so let me just try to print our vault now and see if it got added print info yeah you can see here here is the full paper so if you go on top my name is Chris paper more agents is all you need we have the abstract right so wow that was a lot of information okay so now let's try to see if we can uh make our uh chatbot agent extract anything from this using our embeddings model right hi Emma so from the abstract what method makes the large language model's performance scale with the number of Agents ah it seems like they're using a technique called sampling and voting where multiple agents contribute their responses and then the overall model's performance improves as you increase the number of Agents involved they also mention that this method helps handle task difficulty better and it's all about having more team members to boost the performance game yeah wow that was pretty good I didn't expect that to be honest but yeah uh yeah sampling and booting perfect so yeah that is what I wanted to share with you today I hope you enjoyed it if you want access to the full code just become a member of the channel follow the link in the description and you will get access to the GitHub community and the Discord Community uh yeah thank you for tuning in I enjoy this project it's always fun to pick up an old project add some new features to it so I think this could make a good Baseline for uh AI engineering project you want to try yourself yeah thank you for tuning in have a great day and I'll see you again on Wednesday

Info

Channel: All About AI

Views: 14,639

Rating: undefined out of 5

Keywords: local ai, retrieval augmented generation, RAG, speech to speech, sts, ai, llm, local llm, local llama, xtts-v2, openvoice, github, ai engineer, ai engineering

Id: VpB6bxh4deM

Channel Id: undefined

Length: 14min 42sec (882 seconds)

Published: Sun Apr 14 2024