Hi, my name is Alvin Ryanputra, and I work on GenAI
and vector search at InterSystems. Today I'll be talking to you
about how to get started on your GenAI use case, as well as some best practices
on how to structure your project. This video will be more conceptual
than technical, and also more practical than theoretical. So let's get started. Now, most of us will think that this is
how an AI project is developed. You first try out AI with some API calls,
some experimentation. You put a front-end on it
to make it a neat proof of concept, and then you improve it
further to get you to your production code. Now, in reality, unfortunately, this line is going
to be much, much longer. And if there is a big mismatch between the expectation of the proof-of-concept
and its actual performance, it's often
going to get shelved indefinitely. But with proper scoping and structuring, hopefully we can avoid this and build
in iterations in order to get you to something that's truly effective
and useful for your use case. To effectively scope out
a GenAI use case, we first need to develop a good intuition
of what AI can do. Now, most AI projects
will use one of the following methods. First, you have prompt engineering, where you got the large language model
to do what you want it to do. If you add in the ability
to search and retrieve data, you have your retrieval
augmented generation, and if you're retraining your
model on your own dataset, that would be fine-tuning. So most AI projects will definitely have
some level of prompt engineering. So let's talk about the differences
between RAG and fine tuning. RAG is great for tasks
that involve retrieval of information. If all your use case requires
is retrieving relevant information and getting a general purpose
AI to understand it and give a response,
then RAG is your best bet. However,
if you need to shape a model's behavior, you need to fine-tune on some data
and some common tasks that will require this would be coding-
related tasks or something that requires very specialized expertise. Something like drafting up
a law contract or understanding some deep medical jargon. The next thing to know is
that RAG is typically going to be generic
and much more flexible, as you're dealing with a general-purpose
large language model. You can easily put together different
systems and different data sources. On the other hand,
fine-tuning is going to be fixed and specialized, based on what data you train it on. And pretty naturally, RAG is almost always going to be much, much easier. For RAG,
you're putting together different systems, which means that it's much easier
to build in iterations. Whereas on the other hand, for fine-
tuning, you need a high quality data set. And the path to improving a model
isn't always as straightforward as in RAG. That being said, fine-
tuning has the potential to use less tokens per request. In a RAG, you typically have to give
an AI a good amount of information before it can deliver a response,
whereas in fine-tuning, all of that is already done
when you train the model. So less tokens will mean a lower latency,
which may be important for user- facing applications and potentially some
cost savings in the long run. Now, some more mature GenAI systems will require
the best of both worlds, and that's
when they do fine-tuning with RAG. So for your GenAI project, you should always start,
no matter what, with prompt engineering and get to a point
where you require something better, and then you start to choose
between RAG and fine-tuning. Now RAG is almost always going to cover your use case,
and most AI projects will only need RAG. And I also highly recommend
pushing RAG to the fullest before considering fine-tuning, because fine-
tuning will require a lot more work. Now that we understand
the various methods of using AI, the next step in scoping out
your GenAI use case will be to constrain
the problem as much as possible. If this represents your use case, where you need to get from here to here, instead of applying a GenAI system to handle the entire end-to-end use case, you typically want to break down
the use case as much as possible by identifying sub-problems
within the use case, and then understanding
which sub-problem can benefit the most by implementing a GenAI system. And so your GenAI system may simply only solve this portion. And this is most often going to be more effective than applying GenAI
to the entire use case. Now to illustrate it with an example,
let's say you have a chatbot, and this chatbot has RAG implemented
underneath the hood. With this system, your system has
to handle a wide, wide variety of inputs, and a user could ask about anything
under the sun in this interface. Some ways that you can think about how to constrain this problem
would be to include a dropdown menu, for example, where a user can perhaps pick the data source
he wants this user interface to do RAG on. You could also put in different options that will help you implement
a different prompt, for example, in your RAG system,
and further constrain the problem. Overall, constraining the problem
and really understanding where GenAI comes in will help you
effectively solve your use case. Because after all,
you are optimizing for your use case and not to use GenAI. Now that you have scoped out
your GenAI use case, let's talk about how to build your project. The simplest and most effective way to
structure your project is just like this. You start
with building your GenAI project. You immediately go to evaluate it,
and then you conduct some error exploration and identify
how best to improve the system. So you go back to introducing the next
component or improvement to the system. Now most people are going to spend
way too much time introducing complexities and new
components at this step of the process, and not actually spending enough time
evaluating it and understanding why the GenAI system is not performing. This is a systematic way to ensure
that you're putting in efforts into things that really matter, that will get you
to what you want to achieve. Now, to go a little bit deeper, I'll be
talking now about your evaluations. Your evaluations
typically look like a data set, which you can test
your GenAI system on. And note that this evaluation data set can actually change over time. If you were to plot your AI's performance to be... something like this. It's okay to use different evaluation data sets over time, especially at the
start of your development. Because over here, the improvement in your system
is going to be very significant. Hence, it's okay if you don't have exact
numbers to compare your system against. Whereas at the later stages
of your project, it's important
to use the same evaluation dataset in order to squeeze that extra 3
to 5% performance in your system. And so you can also iteratively
build your evaluation data set over time. Now a few things about your evaluation. First, you want it to be diverse. It should be sufficiently diverse
across the set of tasks that you need your GenAI to achieve. Now, this can be generated by either humans, which will be your domain experts, or AI. Most of the time, you'll use both. A common strategy will be to get
a domain expert to generate perhaps 20% or 10% of the evaluation data set, and you can use an AI
to subsequently extrapolate that and generalize, introduce
some noise to it, before building a larger evaluation data set. Now, when it comes to understanding your metrics, if your GenAI
system has a fixed answer, you would use your traditional machine learning metrics such as your F1 score, position, recall, and so on. But most of the time
you will be evaluating based on quality, right? So what you can do here is evaluate it based on similarity
to your evaluation answer. And some methods of evaluating similarity
would be your BLEU or MTEOR scores. You could also employ again a human or AI to create the response
based on the correct answer. And for your RAG systems,
sometimes you may also want to evaluate the extent of hallucinations in its response. Now that we've covered evaluations, let's take a look at
how to improve your GenAI system. So RAG has two main components to it. You have your retrieval, and you have the large language
model itself. And it's important to understand
where your issues are coming from. A simple way to do
that is to examine the data in your RAG. Over here I have a diagram, a very simple
if I one of how RAG works. You have your chat user interface
where a user would ask a question. You would go to vector database
to retrieve relevant information and pass that data back to your AI. I would look at this data and imagine that instead of an AI, I have just a human. If a human with this retrieved data can easily
answer the user's question, that means the problem lies with the model
because the data was sufficient. But the AI was the issue here. And so you would look
at how to improve the model. On the other hand,
if you find that even with this data, a human could not answer the question
simply because the data was not good enough, or it was just
simply irrelevant or insufficient, then your problem is with your retrieval. So this is a really simple way to identify where the problems in your RAG are. When it comes to improving your model, it's reasonably simple. Just use a bigger model. Or you may want to use a fine-tuned model. When it comes to retrieval, there are a variety of ways
to improve your performance, and the most simple way to look at
it would be to start with your chunking, to look at how you're storing your data
in your vector database. So you can conduct
some chunking experiments. This is where you vary your chunk
size, your overlap, and how you're storing it
in your vector database. You can also vary
your retrieval method. For example, if I have paragraphs
stored in my vector database, instead of only retrieving that paragraph, I could also retrieve the paragraph
before that and a paragraph after that to provide
even more context into my system. One way to tackle relevance of data would be to improve your embedding model that you used. So you can use a larger embedding model. Or you can use an embedding model
that's more specialized to your domain. These are often fine-tuned to that specific domain. For example, there are embedding models
that can better differentiate different types
of medical related information. Or there are embedding models out there that can better understand
the language of law, for example, that will most often be able to better
represent the data in a vector format and hence, can get you more relevant
information over here. Now, the final way to improve your retrieval system
is to change the architecture of your RAG
by including more components. Some proven ways to improve
your system would be to introduce a reranking layer. So once you retrieve the top five from the vector database, for example,
you put it through a reranker that will more accurately rerank
it based on what you need. You can also include something called hypothetical document embeddings, which is simply a model that helps you to guess what a retrieved document could look like, and then use that
to search for data instead. And these have been shown
to improve performance of RAG systems. And there are also a variety of other ways
to do so. But the most important thing here
is to first understand what is wrong with your system. Is it retrieval, or is it a model? And then, what's wrong with the data? Is it not enough? Is it not relevant, or is it just insufficient? Now let's talk
about how to improve your fine-tuning. There are two main components
of fine-tuning. You have your data and you have training. Now, high quality
training data is extremely important to making a fine-tuning model perform
well. And there are a few things
that you should know about it. First, it should be diverse across a set of tasks
that you need it to accomplish. And it's important to understand
the distribution of your training data. If you do some analysis of your data,
you realize that it's extremely skewed, and you're missing data
from a few categories. You would expect your model
to also do poorly in those categories, and you can go deeper to analyze this
whenever you evaluate your model. So for example, you see that your model is
failing across categories one and two, I would go into my data, take a look. If I have enough data in categories
one and two, to understand if it is insufficient data,
or is it that my data and categories one and two are simply wrong and
those are causing my evaluations to fail? Another thing to note about the data that you're using for fine-
tuning is how you're processing it. Ideally, your data should be very clean, and the way you process your data also includes
how you add in your tokens to it. And that, that includes things like your end-of-sentence tokens,
which determines how it stops. And there are a couple other tokens
depending on how you fine-tune it. So that will also influence
how your model performs. On the other hand,
you also have the training aspect of it. So the main factor here would be your choice of base model. So again following this philosophy, I would start with a really small model. And each time, try and milk more and more performance
out of that fine-tuned model. So I'll start with a 3b model,
for example, Go to 7b, go to 15b, and eventually
maybe something larger like a 33b. And each time you may add more and more data to the model
as well. Another thing
that you can potentially vary would be the training hyperparameters. Now unfortunately, each training
run is going to take up some time and some money,
and you don't have the liberty of really doing a lot of hyperparameter
tuning. As such, what I would do
is to follow some best practices based on papers done by research
institutions or tutorials out there. Generally, if you follow those guides, most of your problems
will not come from hyperparameters, but will probably come from your choice
of base model or your data. And there is some research to show
that so long your parameters are within a pretty large range, it doesn't really affect model performance
that much. I also know
that there are also some tradeoffs here. For example, for Lora training,
if you choose a bigger R, it's going to take much longer and train
more parameters. On the other hand,
it could potentially be better. So these are all things to consider
when looking at improving your fine- tuning model. But again
I would always look at your evaluation and start with improving your data
before thinking about more specific items. To summarize this video, it there are only
three takeaways you have from this video, I want you to remember this. Firstly, start simple. Don't try anything fancy. Don't try anything too experimental. Do something really simple
and then slowly improve on it. The next thing to remember is to do your error exploration. These steps of evaluating your system and then identifying improvements
is extremely important, and is probably the fastest way
to improve your model. Lastly, focus on your use case. Stay close to the ground. Talk to domain experts
and talk to your users. Remember you're optimizing
to achieve your use case and not to apply
GenAI to anything out there. I hope you have learned something
from watching this video, and I'm excited
to see what you can build with GenAI. Thank you.