How to Pick the Right AI Foundation Model

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
If you have a use case for generative AI, how do you decide on which  foundation model to pick to run it? With the huge number of  foundation models out there, It's not an easy question. Different models are trained on different  data and have different parameter counts, and picking the wrong model can  have severe unwanted impact, like biases originating from the training data  or hallucinations that are just plain wrong. Now, one approach is to just pick the largest,   most massive model out  there to execute every task. The largest models have huge parameter counts and are usually pretty good generalists,  but with large models come costs, costs of compute, cost of  complexity and costs of variability. So often the better approach is to pick the right  size model for the specific use case you have. So let me propose to you an  AI model selection framework. It has six pretty simple stages. Let's take a look at what they areand then  give some examples of how this might work. Now, stage one, that is to  clearly articulate your use case. What exactly are you planning  to use generative A.I. for? From there you'll list some of the  model options available to you. Perhaps there are already a subset of foundation  models running that you have access to. With a short list of models you'll next  want to identify each model's size, performance costs, risks, and deployment methods. Next, evaluate those model characteristics  for your specific use case. Run some tests. That's the next stage, testing options based on your previously  identified use case and deployment needs. And then finally, choose the option  that provides the most value. So let's put this framework to the test. Now, my use case, we're going to say  that is a use case for text generation. I need the AI to write personalized  emails for my awesome marketing campaign. That's stage one. Now, my organization is already using  two foundation models for other things,   so I'll evaluate those. First of all, we've got Llama 2 and specifically the Llama 2 70 model. a  fairly large model, 70 billion parameters. It's from meta and I know it's quite  good at some text generation use cases. Then there's also Granite that we have deployed. Granite is a smaller general  purpose model and that's from IBM. And I know there is a 13 billion parameter model that I've heard does quite well  with text generation as well. So those are the models I'm going  to evaluate, Llama 2 and Granite. Next, we need to evaluate model  size, performance, and risks. And a good place to start  here is with the model card. The model cards might tell  us if the model has been   trained on data specifically for our purposes. Pre-trained Foundation models are  fine tuned for specific use cases such as sentiment analysis or document  summarization or maybe text generation. And that's important to know  because if a model is pre trained on a use case close to ours, it may  perform better when processing our prompts and enable us to use zero shot  prompting to obtain our desired results. And that means we can simply  ask the model to perform tasks without having to provide  multiple completed examples first. Now, when it comes to evaluating model performance  for our use case, we can consider three factors. The first factor that we  would consider is accuracy. Now, accuracy denotes how close  the generated output is to the desired output, and it can be measured objectively and repeatedly by choosing evaluation metrics that  are relevant to your use cases. So for example, if your use case  related to text translation, the B.L.E.U. - that's the BiLingual  Evaluation Understudy benchmark, can be used to indicate the quality  of the generated translations. Now the second factor relates  to reliably of the model. Now that's a function of several  factors actually, such as consistency,   explainability and trustworthiness, as well as how well a model  avoids toxicity like hate speech. Reliability comes down to trust, and trust is built through transparency  and traceability of the training data and accuracy and reliability of the output. And then the third factor that is speed. And specifically we're saying how quickly does a user get a  response to a submitted prompt? Now, speed and accuracy  are often a trade off here. Larger models may be slower, but  perhaps deliver a more accurate answer. Or then again, maybe the smaller model is faster   and has minimal differences in  accuracy to the larger model. It really comes down to finding the sweet  spot between performance, speed and cost. A smaller, less expensive model may not offer   performance or accuracy metrics  on par with an expensive one, but it would still be preferable over the latter. If you consider any additional benefits,  the model might deliver like lower latency and greater transparency into  the model inputs and outputs. The way to find out is to simply  select the model that's likely   to deliver the desired output and well, test it. Test that model with your  prompts to see if it works, and then assess the model, performance  and quality of the output using metrics. Now, I've mentioned deployment in  passing, so a quick word on that. As a decision factor, we need to evaluate where  and how we want the model and data to be deployed. So let's say that we're leaning towards Llama 2 as our chosen model based on our testing. Right, cool. Llama 2. That's an open source model and we could  inference with it on a public cloud. So we've got a public cloud already out here. It's got an element of choice in it, which  is limited to we can just inference to that. But if we decide we want to fine tune  the model with our own enterprise data, we might need to deploy it on prem. So this is where we have  our own version of Llama two and we are going to provide fine tuning to it. Now, deploying on premise  gives you greater control, and more security benefits compared  to a public cloud environment. But it's an expensive proposition, especially when factoring  model size and compute power, including the number of GPUs it takes to run a single large language model. Now, everything we've discussed  here is tied to a specific use case, but of course it's quite likely that any given  organization will have multiple use cases. And as we run through this  model selection framework, we might find that each use case is better  suited to a different foundation model. That's called a multi model approach. Essentially, not all A.I. models are the  same, and neither are your use cases. And this framework might be just  what you need to pair the models and the use cases together to find  a winning combination of both.
Info
Channel: IBM Technology
Views: 35,533
Rating: undefined out of 5
Keywords: IBM, IBM Cloud, watsonx, hugging face, ai models, artificial intellgence, artificial intelligence models, Llama 2, Granite, Model size, Generative AI, text generation
Id: pePAAGfh-IU
Channel Id: undefined
Length: 7min 54sec (474 seconds)
Published: Fri Feb 09 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.