How to Pick the Right AI Foundation Model

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

If you have a use case for generative AI, how do you decide on which foundation model to pick to run it? With the huge number of foundation models out there, It's not an easy question. Different models are trained on different data and have different parameter counts, and picking the wrong model can have severe unwanted impact, like biases originating from the training data or hallucinations that are just plain wrong. Now, one approach is to just pick the largest, most massive model out there to execute every task. The largest models have huge parameter counts and are usually pretty good generalists, but with large models come costs, costs of compute, cost of complexity and costs of variability. So often the better approach is to pick the right size model for the specific use case you have. So let me propose to you an AI model selection framework. It has six pretty simple stages. Let's take a look at what they areand then give some examples of how this might work. Now, stage one, that is to clearly articulate your use case. What exactly are you planning to use generative A.I. for? From there you'll list some of the model options available to you. Perhaps there are already a subset of foundation models running that you have access to. With a short list of models you'll next want to identify each model's size, performance costs, risks, and deployment methods. Next, evaluate those model characteristics for your specific use case. Run some tests. That's the next stage, testing options based on your previously identified use case and deployment needs. And then finally, choose the option that provides the most value. So let's put this framework to the test. Now, my use case, we're going to say that is a use case for text generation. I need the AI to write personalized emails for my awesome marketing campaign. That's stage one. Now, my organization is already using two foundation models for other things, so I'll evaluate those. First of all, we've got Llama 2 and specifically the Llama 2 70 model. a fairly large model, 70 billion parameters. It's from meta and I know it's quite good at some text generation use cases. Then there's also Granite that we have deployed. Granite is a smaller general purpose model and that's from IBM. And I know there is a 13 billion parameter model that I've heard does quite well with text generation as well. So those are the models I'm going to evaluate, Llama 2 and Granite. Next, we need to evaluate model size, performance, and risks. And a good place to start here is with the model card. The model cards might tell us if the model has been trained on data specifically for our purposes. Pre-trained Foundation models are fine tuned for specific use cases such as sentiment analysis or document summarization or maybe text generation. And that's important to know because if a model is pre trained on a use case close to ours, it may perform better when processing our prompts and enable us to use zero shot prompting to obtain our desired results. And that means we can simply ask the model to perform tasks without having to provide multiple completed examples first. Now, when it comes to evaluating model performance for our use case, we can consider three factors. The first factor that we would consider is accuracy. Now, accuracy denotes how close the generated output is to the desired output, and it can be measured objectively and repeatedly by choosing evaluation metrics that are relevant to your use cases. So for example, if your use case related to text translation, the B.L.E.U. - that's the BiLingual Evaluation Understudy benchmark, can be used to indicate the quality of the generated translations. Now the second factor relates to reliably of the model. Now that's a function of several factors actually, such as consistency, explainability and trustworthiness, as well as how well a model avoids toxicity like hate speech. Reliability comes down to trust, and trust is built through transparency and traceability of the training data and accuracy and reliability of the output. And then the third factor that is speed. And specifically we're saying how quickly does a user get a response to a submitted prompt? Now, speed and accuracy are often a trade off here. Larger models may be slower, but perhaps deliver a more accurate answer. Or then again, maybe the smaller model is faster and has minimal differences in accuracy to the larger model. It really comes down to finding the sweet spot between performance, speed and cost. A smaller, less expensive model may not offer performance or accuracy metrics on par with an expensive one, but it would still be preferable over the latter. If you consider any additional benefits, the model might deliver like lower latency and greater transparency into the model inputs and outputs. The way to find out is to simply select the model that's likely to deliver the desired output and well, test it. Test that model with your prompts to see if it works, and then assess the model, performance and quality of the output using metrics. Now, I've mentioned deployment in passing, so a quick word on that. As a decision factor, we need to evaluate where and how we want the model and data to be deployed. So let's say that we're leaning towards Llama 2 as our chosen model based on our testing. Right, cool. Llama 2. That's an open source model and we could inference with it on a public cloud. So we've got a public cloud already out here. It's got an element of choice in it, which is limited to we can just inference to that. But if we decide we want to fine tune the model with our own enterprise data, we might need to deploy it on prem. So this is where we have our own version of Llama two and we are going to provide fine tuning to it. Now, deploying on premise gives you greater control, and more security benefits compared to a public cloud environment. But it's an expensive proposition, especially when factoring model size and compute power, including the number of GPUs it takes to run a single large language model. Now, everything we've discussed here is tied to a specific use case, but of course it's quite likely that any given organization will have multiple use cases. And as we run through this model selection framework, we might find that each use case is better suited to a different foundation model. That's called a multi model approach. Essentially, not all A.I. models are the same, and neither are your use cases. And this framework might be just what you need to pair the models and the use cases together to find a winning combination of both.

Info

Channel: IBM Technology

Views: 35,533

Rating: undefined out of 5

Keywords: IBM, IBM Cloud, watsonx, hugging face, ai models, artificial intellgence, artificial intelligence models, Llama 2, Granite, Model size, Generative AI, text generation

Id: pePAAGfh-IU

Channel Id: undefined

Length: 7min 54sec (474 seconds)

Published: Fri Feb 09 2024