Getting Started with Groq API | Making Near Real Time Chatting with LLMs Possible

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you probably remember Gro the company that is building language processing units for fast inference of llms well they just started rolling out API access to developers they were claiming nearly 500 tokens per second for mix Moe and in my testing that seems to be true in this video I'll show you how to access the API for free and then I'll show you a couple of example use cases in which we're going to build a chat box and this thing is crazy fast more on that later in the video in order to get access to the API you'll need to go to gro.com log in with your email or in my case I logged in with my Google account apart from the API they also opened up their playground where you can test two models currently one is the Lama 270 Bill model and the other one is the mixe model they also are providing really detailed documentation more on this in a little bit and you will be able to create your API keys in here before working with the API let's have a quick look at the playground you can provide your system message here then user input here you have two different options for the model so for our experiments we're going to be using the mixe okay so my system messages you are a helpful assistant answer as John Snow the user input is explain the importance of low latency llms since we are testing the speed of groc API so I think this is an appropriate question and then you can also set other parameters this will control the behavior of the model so you can set the temperature maximum new tokens that is supposed to generate also top p and if you want to include any stop sequence hit submit and this is realtime speed that you're going to get from the API let's have a quick look look at the response well greetings as John known I am not an expert in low latency large language models but I can certainly try to explain their importance in a simple Manner and then it goes on to generate a response in this video we are not really interested in the accuracy of the responses we will be only concerned about the speed so that's why I'm not going to go over the responses that the model generates so let's say you come up with parameters for the Model Behavior after that you can click on this view code button and this will give you python code on how to start calling this API there's also a code available to call the API within JavaScript or even Json next I'm going to walk you through a few examples for that we need to create a new API key so click on create API key provide the AP API key name let's say we're going to call this Croc one and if you hit submit this will show you your API key just copy this API key in a secure location I am going to delete this because I'm going to be using the existing API key that I have so let me first walk you through a basic structure of how working with groc looks like I'll walk you through a Google colab but before that we need to install the package using pip so we're going to use pip install Croc this will install the package for us after that in order to use this package we need to import it we are importing operating system then we are importing the gron package next we need to create a client so this is going to be the grock client that we're going to be using you will need to provide your API key since I'm am using the notebook so I'll have to set the environment variable within the notebook book I'll show you that in a bit once you create the client the rest of the process is very similar to calling the openai API key so we are going to be using the chart completion endpoint we create a new message so here we define the role the role is user so this is directly interacting with the model later I'll show you how to define the uh system role as well so that you can provide a system message next you need to provide your prompt from the user after that you will need to select the model that you want to use in this case we're using the mixt Moe model and once you make the call you can get the response of the model using the choices variable then messages and then the content of the message so that's how the basic structure of the API usage is going to look like here's a Google collab that I'm using this is exactly the same code that walk you through but let's first set an environment variable so you can click on this Secrets button then create uh or add a new secret here I have provided my grock API key so you'll provide the name of the secret or environment variable and then the corresponding value make sure to toggle this notebook access button so that your notebook can actually see the key that you're using this is exactly the same code that we saw now let me show you the speed of generation in real time so I'm going to click on this button and this will start generating the response Now you kind of feel like it's streaming but this whole thing was generated all at once this is actually crazy fast I haven't seen anything like this before later I'll show you how to actually enable streaming as well although for this you you don't really need that now let's look at some other options so how do you add a system message in that case you simply need to provide a new role called system so this is going to become our system message and we are act asking the model to act as JN snow the user message or user role is the same there are quite a few other options that you can also set so for example you can set the temperature this will control the creativity or randomness of your output what is the maximum number of tokens the model can generate or you want the model to generate next is top P this will basically control the sampling mechanism through which it's generating output if you want to use any specific stopwords you can define those here and whether you want to stream the responses or not so if you enable streaming you will need to change the way you get the output and I'm going to show you that later in the video but we are going to use the same prompt explain the importance of low latency llms and let me show you the real time response that we get so here's like the speed at which it was able to generate the output which is pretty amazing and it's also actually sticking to the character that we asked it to so that's pretty nice as well next let's look at streaming responses so here we're using the same structure again again I'm using the previous client that I uh created so not really creating another client the difference that you will see in here is this so we enabled streaming and now that means that the model is not going to generate the whole response all together but it will create this in chunks so what we need to do is we need to take one of those chunks at a time and show those to the user and that's why the mechanism of printing is different than what we were using before for now let's look at the speed at which it streams data so this is realtime streaming for you as you can see this opens up so many possibilities for example with this you will be able to have speech communication with the llm so you can have a speech to text model which converts your speech into text feeds this through the grock API gets a response in near real time and then you uh convert it back from text to speech using another model next we will look at how to use stop sequences so in this case we want the model to stop generation if it encounters a six in its output and the prompt is count to 10 your response must begin with one then we give it example as well like how the generation is supposed to look like the stop sequences are really helpful specifically if you want to interrupt the model generation in the middle so this this is actually interrupting the generation although it's supposed to counter 10 but whenever in it counter six it stops the generation let's look at a real example use case and in this case we are going to be looking at summarization okay so here is an essay from Paul gram how to do great work I copied this essay to a Google Talk and it's about 27 pages so here I created a variable called text and copied all the text in here and we're going to ask Gro to summarize this since we're using the mixe model it has a context window of 32,000 tokens so this should be good enough the system prompt is you're a helpful assistant your job is to identify main themes in the given text and create summary provide the summary in 10 bullet points and the user contacts is the text that we copied and here I want to stream the response so so let's run this now this is going to be real time as you can see this was pretty fast so let's run this again just to give you a sense of how fast it is so whenever this yellow arrow sign comes here that's when it actually starts generating the response so now it's sending it to the model and here's the response that you get so if you're run this multiple times you will get uh different responses but the summary seems to convey ideas for from the essay another thing that I have noticed is this none at the end whenever we use the streaming API so let me show you another example like whenever I use the streaming API I always get this none at the end this might be something that the model is using as a stop token yeah you can repeatedly see that now when I use the same prompt with chap completion endpoint rather than the streaming I don't really see that n so if you're doing streaming just make sure that you are aware of that none character in this last example I want to show you how to use the gro API with streamlit the code that you see in here is an example code that the grock has provided in their GitHub repo and this is creating streamlit app that will enable you to chat with grock API the only difference is that in their requirements or text file they are using I think close to 50 different packages you don't need all of them you just need these five packages so let me walk you through this code step by step first we importing all the required packages that include streamlit Gro as well as L chain so in this case we want the chatbot to remember the previous conversations and that is why we're using the conversation buffer window memory that will restrict how many previous conversations the bot can remember next we need need to load our API key so we are using EnV file here that is storing our grock API key then we have our main function so first we load the grock API key this is just for some cosmetic uh purposes so the title is going to be chart with Gro and this is a line that you will see in the app next a couple of things for customization so you will be able to choose different models right now you can choose either the mixe model or Lama 270 billion model we also want the user to have the control on how many previous conversations they want to they want the model to remember so you can be choose between 1 to 10 and after that we Define our conversation buffer vend memory object next there's going to be a user input button so user asks a question and based on our conversation memory length that we have chosen we are going to add that to history next we will create our chart object so we are passing on the API key the model name to this chart chat Croc this is basically internally using the same schema that I have shown you before in order to interact with the API and we create our conversation so we provide our llm all the memory that the model is supposed to have so that goes into this conversation chain and this is being created by using L chain next we're simply keeping track if there is a new user question in the uh text box then that will run this um uh chain get a response show that response to the user and add that to history so that's pretty much it we're going to create a new virtual environment so we're going to use create DN then the name of the virtual environment in this case I'm going to call it uh YouTube I want to use the same virtual environment for all my YouTube videos then we we going to define the python version that we want to use I already have this so I'm going to say no but I want to install all the required packages and for that we're going to use pip install dashr requirements. text now there are only five packages as I said so this pretty quick and after that we are going to run the app so in order to run the app we're going to use stram lit run groor chatore app and this will launch our app okay so here's the app on the left hand side we have this drop- down menu you can choose between llama and the Mixel model you can also change the conversational memory length so we're going to ask machine learning related question and the question is what ISM I have seen some issues with the streamlet app sometimes it's pretty slow in response so here's the response from the model it talks about swm it provides a simple explanation right so it seems to be working but what I have seen is this streamlet app that they have provided sometimes doesn't work at all there might be some issues with the uh integration with length chain I will create a followup videos and explore it further so play around with the API it's free for the time being and as we saw in this video it's extremely fast if you need help in building applications on the grock API you can reach out details are in the video description I will be creating more videos on Gro API because this really enables real time conversation with these llms so if you working on an llm related project I do offer Consulting and Advising Services both to startups as well as individuals check out the video description if that's something you are interested in I hope you found this video useful thanks for watching and as always see you in the next one
Info
Channel: Prompt Engineering
Views: 29,412
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, LLMs, AI, artificial Intelligence, Llama, GPT-4, fine-tuning LLMs, Groq AI, LPUs, Language processing Units
Id: S53BanCP14c
Channel Id: undefined
Length: 16min 19sec (979 seconds)
Published: Thu Feb 29 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.