If you have a use case for generative AI, how do you decide on which
foundation model to pick to run it? With the huge number of
foundation models out there, It's not an easy question. Different models are trained on different
data and have different parameter counts, and picking the wrong model can
have severe unwanted impact, like biases originating from the training data
or hallucinations that are just plain wrong. Now, one approach is to just pick the largest, most massive model out
there to execute every task. The largest models have huge parameter counts and are usually pretty good generalists,
but with large models come costs, costs of compute, cost of
complexity and costs of variability. So often the better approach is to pick the right
size model for the specific use case you have. So let me propose to you an
AI model selection framework. It has six pretty simple stages. Let's take a look at what they areand then
give some examples of how this might work. Now, stage one, that is to
clearly articulate your use case. What exactly are you planning
to use generative A.I. for? From there you'll list some of the
model options available to you. Perhaps there are already a subset of foundation
models running that you have access to. With a short list of models you'll next
want to identify each model's size, performance costs, risks, and deployment methods. Next, evaluate those model characteristics
for your specific use case. Run some tests. That's the next stage, testing options based on your previously
identified use case and deployment needs. And then finally, choose the option
that provides the most value. So let's put this framework to the test. Now, my use case, we're going to say
that is a use case for text generation. I need the AI to write personalized
emails for my awesome marketing campaign. That's stage one. Now, my organization is already using
two foundation models for other things, so I'll evaluate those. First of all, we've got Llama 2 and specifically the Llama 2 70 model. a
fairly large model, 70 billion parameters. It's from meta and I know it's quite
good at some text generation use cases. Then there's also Granite that we have deployed. Granite is a smaller general
purpose model and that's from IBM. And I know there is a 13 billion parameter model that I've heard does quite well
with text generation as well. So those are the models I'm going
to evaluate, Llama 2 and Granite. Next, we need to evaluate model
size, performance, and risks. And a good place to start
here is with the model card. The model cards might tell
us if the model has been trained on data specifically for our purposes. Pre-trained Foundation models are
fine tuned for specific use cases such as sentiment analysis or document
summarization or maybe text generation. And that's important to know
because if a model is pre trained on a use case close to ours, it may
perform better when processing our prompts and enable us to use zero shot
prompting to obtain our desired results. And that means we can simply
ask the model to perform tasks without having to provide
multiple completed examples first. Now, when it comes to evaluating model performance
for our use case, we can consider three factors. The first factor that we
would consider is accuracy. Now, accuracy denotes how close
the generated output is to the desired output, and it can be measured
objectively and repeatedly by choosing evaluation metrics that
are relevant to your use cases. So for example, if your use case
related to text translation, the B.L.E.U. - that's the BiLingual
Evaluation Understudy benchmark, can be used to indicate the quality
of the generated translations. Now the second factor relates
to reliably of the model. Now that's a function of several
factors actually, such as consistency, explainability and trustworthiness, as well as how well a model
avoids toxicity like hate speech. Reliability comes down to trust, and trust is built through transparency
and traceability of the training data and accuracy and reliability of the output. And then the third factor that is speed. And specifically we're saying how quickly does a user get a
response to a submitted prompt? Now, speed and accuracy
are often a trade off here. Larger models may be slower, but
perhaps deliver a more accurate answer. Or then again, maybe the smaller model is faster and has minimal differences in
accuracy to the larger model. It really comes down to finding the sweet
spot between performance, speed and cost. A smaller, less expensive model may not offer performance or accuracy metrics
on par with an expensive one, but it would still be preferable over the latter. If you consider any additional benefits,
the model might deliver like lower latency and greater transparency into
the model inputs and outputs. The way to find out is to simply
select the model that's likely to deliver the desired output and well, test it. Test that model with your
prompts to see if it works, and then assess the model, performance
and quality of the output using metrics. Now, I've mentioned deployment in
passing, so a quick word on that. As a decision factor, we need to evaluate where
and how we want the model and data to be deployed. So let's say that we're leaning towards
Llama 2 as our chosen model based on our testing. Right, cool. Llama 2. That's an open source model and we could
inference with it on a public cloud. So we've got a public cloud already out here. It's got an element of choice in it, which
is limited to we can just inference to that. But if we decide we want to fine tune
the model with our own enterprise data, we might need to deploy it on prem. So this is where we have
our own version of Llama two and we are going to provide fine tuning to it. Now, deploying on premise
gives you greater control, and more security benefits compared
to a public cloud environment. But it's an expensive proposition, especially when factoring
model size and compute power, including the number of GPUs it takes
to run a single large language model. Now, everything we've discussed
here is tied to a specific use case, but of course it's quite likely that any given
organization will have multiple use cases. And as we run through this
model selection framework, we might find that each use case is better
suited to a different foundation model. That's called a multi model approach. Essentially, not all A.I. models are the
same, and neither are your use cases. And this framework might be just
what you need to pair the models and the use cases together to find
a winning combination of both.