Using Local Large Language Models in Semantic Kernel

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey everyone, my name is Will Velida. I'm a software engineer in Cloud Architect. And in this video we're going to learn how we can use local large language models with semantic kernel. Now in previous videos, I've used Azure Open AI and Open AI with Semantic Kernel, but those options require us to either have access to Azure Open AI, which is currently restricted, or we can actually buy credits to use the Open AI API. But in this video, what we're going to do is we're going to explore two options that we can use to build semantic kernel app locations using large language models that are hosted on our local machine. And the first option we'll look at is Alana. Now Alarma is a tool that allows us to run small and large language models on our local machine. So here I am the O Lama landing page and I can see I can run models such as Llama 3 by 3, Mistral, Gemma 2, or any other models that are available. If I click on models here, I can actually see all the models that are available to me through Alarma. Or I can even create my custom large language models and and run them through alarm as well, which is pretty cool. Now to get started with a llama, all we really need to do is actually download it from the website here. So we can use it for Mac OS, we can use it in Linux or Windows. I've gone ahead and done this. And what this allows us to do is we run a local web server that's running on localhost 11434 and we can actually interact with models through that web server through the O Lama CLI. So if I look at the GitHub page, excuse me, I can see that I can. Run models using alarm at run. And a whole bunch of different models and their sizes and how I can download them onto my local machine. I've gone ahead and downloaded 53. So that's the model we're going to be using in this tutorial. But we can create our own custom models. We can pull it down. It's very similar to how you kind of work with images in Docker. So Alarm will pull the idea of pulling down a container image, but here we're pulling down a local large language model. Which is pretty cool onto our local machine. So what I want to do is I'm going to jump into the terminal here. If I type O Lama, I can see that there's a whole bunch of commands that I can go ahead and run. So I've already pulled down a model. So I'm just going to go a llama list, see the models that are available to me and there we are. So 5 three is available for me. Download that two months ago. So if I wanted to go ahead and start that model. Or run it. Sorry, I'm going to go O llama run by three. And that's going to go ahead and start to run. So what I might do, I can actually exit this using buy CLS and I'll get a bit more real estate going there. So if I would say give me. Three bullet points. On leg exercises that I can do in the gym. Just press enter, that's going to go ahead and generate a response back from the model. So it's talking about squats favourite exercise and it's going to go ahead and generate text through the command line. So it's actually responding to us as a local large language model would in any types of different applications. But that's all very well and good doing it through the terminal. So what I'm going to do is I'm going to actually go ahead stop that by and go into Visual Studio. So this is just a bare bones Visual Studio solution and all I've done is I've installed a semantic kernel. So the semantic kernel Nugent package, the latest one there, I haven't really installed anything else. So to actually go ahead and use O Lama within this console application. All I need to do is just var endpoints and create a new Uri which will point to where that local large language where my I'm serving the alarm a host which is on local host 11434 andthen provide a model ID. So this will be the name of the model that we're using which is by three. Cool. And then from here we can actually start to build our semantic kernel as we would have any other semantic kernel application. So far kernel Builder. I'm going to go ahead and build my kernel. If I can actually spell kernel right. Builder I'm going to have to go ahead and import the using statement for semantic kernel and then from here I'm gonna add open AI check completion. And then I'm going to pass in. So for my model ID, it's going to be the model ID for the API key. I don't need to provide an API key. I can just use null because then the endpoint that we're going to use. Will be the endpoint. For our. Alarma model and I just need to go ahead and suppress this. This is a compilation error, but I'm going to go ahead and suppress that and then once I've got my kernel Builder going, I can actually go kernel Builder build, which will go ahead and build my kernel for me. Fantastic. So now what I'm going to do, I'm just going to do a bit of copy and pasting and I'll do this in various steps. So first thing I'm going to do is paste in the prompt. So again, I'm creating an exercise box essentially telling my semantic kernel application that this is going to be an exercise. Bot can have a conversation with you about any related fitness topic, give explicit instructions on how to perform exercises, or provide general information about fitness. Or if they don't know an answer to the to the question, just respond saying I don't know and then format the response in this manner. And then using that prompt, I'm going to create that function, kernel function from the prompt that I've given it. And then I'm just going to add my chat history as a kernel argument. Cool. And then once I've done that, I'm just going to copy and paste this while loop. So again, we're going to say, OK, please ask me a question. And if it's not equal to quit, we're just essentially going to. Allow the semantic kernel to answer questions that the user might have. And in this case, we're not using Azure Open AI, we're not using Open AI's API. We're gonna be using our local model that is being served through O Lama. So I'm just gonna go ahead, build that, make sure that there's nothing wrong with it. Build started. And there we go, succeeded. So now what I'm going to do is I'm just going to go ahead and run it. Just expand that a little bit before I was going to get a cup of tea in and I'm going to say. Give me three leg exercises. In brief. Bullet point format. Not the greatest question in the world, but whatever. So we go so semantic kernels giving me a response to my question. So give me three leg exercises and brief bullet point format and it's gone ahead and recommended squats, it's recommended lunges, it's recommended deadlifts, which is fantastic. So that's just a quick example about how you can use O Lama to host a large language model on your local machine and use semantic kernel to interact and build applications that use that local large that large language model. Excuse me?Cool, so let's move on to our second option. So our second option is to use Lmstudio. Now with LM Studio we can run large language models on our local machine and offline. We can use those models through this in app chat UI or we can interact with it using an open AI compatible local server. And we can also download and discover large language models through this application. So I can see I can look for Llama models, minstrel models by three, which we'll be using Falcon, Star Coder, all of these cool names that I don't know how they come up with these names, but still pretty cool. So if I look at my models, I've gone ahead and downloaded by three already, I can actually interact with this locally using the AI chat. So if I go ahead. And create a new chat and say, give me a three chest exercises, go ahead. And that's going to go ahead and generate a response using the large language model that's hosted locally on Lmstudio, which is pretty cool. So rather than working for a terminal, we can use this. UI interface through the application, which is pretty neat. I can actually go ahead and start a local server for this. So if I click on local server. I can see that I can expose it to a port that I configure. Obviously, you don't want the same port as you're running Olama if you're running both at the same time. There's some support at endpoints as well, so chat completions, embeddings, I can view the different models as well. There's also some API documentation, which could be pretty handy if I'm starting to develop with this. So I'm going to go ahead and start that server and we can see some logs are being admitted here down below, which is pretty cool. And First things first, I'm actually going to try this hello world curl example. So I can see that it's very squished. Sorry, I'm like compressing my my screen resolution. But essentially what I'm going to do is I'll open up postman and I was running this before, but essentially we can I'll just get some more real estate going here. I can tell the server which model I'm going to use. So that's five three, the one I've downloaded before. I can define some system messages that answer in rhymes and also tell the user to introduce myself. Also set the temperature, the Max tokens, and whether it's going to be a streaming response or not. So if I click on set this to true, it's going to stream that response from the large language model. So if I go ahead and send that now, it looks like it's taking its time to send its request. But if I open up LM Studio again, I can see the response being streamed through the server logs, so it looks like it's completed. So if I go back into Postman. I can see that it's responded with greetings, my name is a delightful Riddle, et cetera, et cetera, et cetera, which is pretty cool. So now, how do we actually alley this to semantic kernel? Well, it's fairly straightforward. So if I go back into a Visual Studio at the moment, what I'm doing is I'm pointing it to my O Llama server. But all I really need to do here for a very simple semantic kernel application, There are other use cases where this might get a little bit more complex. But all I'm going to do is point that to the port that is LM Studio is currently serving that model on my local machine. That's the only change I need to make. So if I go ahead and run this now, that's going to reopen my terminal. There we go. So if I ask a question, say give me 3 chest exercises. Now it might may seem like it's taking a long time, but remember it's streaming the response back to me and you can actually make configure application to stream the response to you. So there the response is being streamed. It's been generated as you can see by the server logs. So once that's done, just go back into my terminal. And there we go. So it's given me my response. I can see barbell bench press there, dumbbell flies and spelt flies. But interestingly, that's that's that's cool. Pretty sure that is in there. Not sure it's supposed to be in there, but whatever. OK, that's OK. You know, large line AI can be a little bit temperamental, so it can make mistakes. So don't trust AI fully and also push ups as well. There we go. So which is which is pretty cool. So that's a really quick demonstration on how we can actually use large language models locally. With semantic kernel applications. So if you're in a situation where you can't access Azure Open AI and you don't really want to spend your own hard earned money on the Open AI's API itself to actually test and build semantic kernel applications, we can use these options or Llama and LM Studio to download large language models onto our local machine and use semantic kernel to interact with those LLMS locally without having to purchase any credits for an API. Or try to get access to Azure Open AI. So I hope you enjoyed this video. If you did give it a like. If you have any questions, please pop them in the comments section below. And as always, if you want to see more content on semantic kernel or anything related to Azure, don't forget to subscribe. No matter where you are in the world, I hope you're having a great day and we will see you all next time.
Info
Channel: Will Velida
Views: 867
Rating: undefined out of 5
Keywords: AI, AI Agents, C-sharp, Machine Learning, Microsoft, Semantic Kernel, Semantic Kernel SDK, ai plugins, beginners, coding, copilot, dotnet, introduction to memories in semantic kernel, introduction to semantic kernel, introduction to vector databases, plugins, programming, sdk tutorial, semantic kernel, software engineering, tutorial, vector databases, vector storage, vectorization techniques, vectors, ollama, lm studio, lm studio vs ollama, large language models
Id: OEQDZLe3slM
Channel Id: undefined
Length: 14min 28sec (868 seconds)
Published: Mon Jul 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.