Insanely Fast LLAMA-3 on Groq Playground and API for FREE

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so this is the actual speed of generation and we're getting more than 800 tokens per second which is crazy I haven't seen anything like this before so since the release of Lama 3 earlier this morning a lot of companies are integrating this into their platforms the one I'm personally really excited about is Gro Cloud because they have the fastest inference speed that is currently possible on the market now they just integrated Lama 3 both into their playground as well as the API so now you have both the 70 Bill and 8 Bill version available I'll show you how to get started with it both in the playground as well as on the API if you're uh building your own applications on top of it so uh we can select the Lama 3 models so let's start start with uh the 70 billion model now I'm going to be using this uh prompt as a test prompt we don't really care about how the respond ERS are in this uh video we only care about the speed of inference uh so the prompt is I have flask for 2 gallons and one for four gallons how do I measure six gallons and it probably has seen this prompt in its training data uh here's the speed of entrance which was crazy fast uh so it took about uh half a second and the speed of generation is around 300 tokens per second and we are talking about the bigger Z l so this is pretty great okay now uh let's test the same prompt on the 8 billion model let's see how the response is going to be so this time it was about uh 800 tokens per second and it took a fraction of a a second so this is pretty great now let's see what happens if we actually ask it to uh generate longer text because as you as the model generates longer text it's going to take more time but let's see if it has any um impact on the number of tokens per second so here we are asking you to write a 500 word essay on the importance of Open Source AI models first I'm going to use the 8 billion model and here's an essay now the number of tokens per seconds is pretty much the same this is pretty impressive next let's look at the 70 billion model and after this I'll show you how to use the API so we're going to run this okay this was realtime speed um it's definitely not 5,000 words but probably somewhere around a couple of thousand words uh but the speed of generation is pretty consistent so this is awesome now you can also include a system message if you want you usually want to use the playground to test the model as well as the prompts and once you're happy with that and you want to integrate created as a part of your own applications then you want to move on to grock API so that you can start serving your users okay so I put together this Google uh notebook to just show you how to use Gro in your own applications through the gro API so first we need the python client so we use P install uh Gro to install that next we need to provide our own API key so for that go to the playground click on API Keys then create a new API key and here you can create a new API key I already have existing API keys so I'm going to be using those and since I'm using uh Google collab so I actually put my API key as a secret here and enabled access uh of that specific key to this uh notebook so now we need to import uh the grock client we will create the grock client using this grock function uh and we need to provide our API key so since I am reading it directly from the uh Secrets within Google collab notebook so I'm using this uh user data function within the Google collab uh API client okay so this is how you set up the client but let's see how you do inference so it's pretty simple straightforward uh so we're going to be using the uh chart completion endpoint here we are going to create a new meth message uh so right now we're just using a role of user so the user is asking a question and the prompt is explain the importance of low latency llms explain it in the voice of Jon Snow now you can also add a system message later on I'll show you how to do that uh and then you need to provide the name of the model now as of the recording of this video on uh the documentation page the supported models lists only the Lama to family however uh sandep who is the head of groc cloud was kind enough to point out to me that it's available so in this case I just uh used that model the L70 Bill and then uh following the exact same uh format I provided the contact length and this seems to be working uh probably by the time this video is released they probably will have updated the documentations so let me show you the actual speed F inflence which is pretty crazy so I'm going to be running this it will uh create this message send it to the API get a response and then python is going to print it here but here's the actual uh speed of generation that we get so you can see it's I think under a second pretty crazy that you can do this uh with an API and we are running a 70 billion model all right next uh let's see how you add system message so here I'm adding a system message to the message flow the rule is system and we are saying you are helpful assistant answer as John snow so the rest of the prompt is exactly the same as what we did before and we are selecting Lama 3 as a model now you can also pass on some extra parameters for example you can set the temperature which will control the creativity or the selection of different tokens uh and you need to also or you can um also pass on Max tokens that the model can generate these are optional parameters now with that system rle here's the actual speed of generation again this is pretty fast um usually you want to add streaming so that the user uh is not just waiting for the response but uh groc has figured out how to do really in send the fast inference okay but if you want to do streaming is actually also possible here uh so the structure is the same as before the only thing that you need to do additionally is to enable streaming by uh setting stream to true in this case we are creating a streaming client now when you are streaming then you're going to be getting a chunk of text one at a time so we basically uh getting that chunk of text printing it to the output and then waiting for another uh chunk of text to to arrive showing that again and so on and so forth so here's the actual uh streaming speed let me run that again okay so this is pretty fast and I think uh we all know that Gro is known for this uh but this is probably the fastest Lama 3 uh infant that is currently available on the market okay so a couple of other things uh both the playground as well as the API is currently uh available for free so you can use this in your applications uh for absolutely free they will probably introduce a paid version pretty soon uh but uh since it's free there are rate limits on the number of tokens it can generate so make sure to look at this okay so I'll be uh creating a lot more content both around uh Lama 3 as well as Croc so if you're interested make sure you subscribe to the channel I think they are also working on uh integrating support for whisper on crack uh when that is implemented I think that will open up uh the possibility of a whole new generation of applications so I'm actually really looking forward to that I hope you found this video useful thanks for watching and as always see you in the next one
Info
Channel: Prompt Engineering
Views: 20,554
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, LLMs, AI, artificial Intelligence, Llama, GPT-4, fine-tuning LLMs, Groq APi, groq api
Id: ySwJT3Z1MFI
Channel Id: undefined
Length: 8min 54sec (534 seconds)
Published: Sat Apr 20 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.