Hi, my name is Alvin Ryanputra, and I work on GenAI and vector search at InterSystems. Today I'll be talking to you about how to get started on your GenAI use case, as well as some best practices on how to structure your project. This video will be more conceptual than technical, and also more practical than theoretical. So let's get started. Now, most of us will think that this is how an AI project is developed. You first try out AI with some API calls, some experimentation. You put a front-end on it to make it a neat proof of concept, and then you improve it further to get you to your production code. Now, in reality, unfortunately, this line is going to be much, much longer. And if there is a big mismatch between the expectation of the proof-of-concept and its actual performance, it's often going to get shelved indefinitely. But with proper scoping and structuring, hopefully we can avoid this and build in iterations in order to get you to something that's truly effective and useful for your use case. To effectively scope out a GenAI use case, we first need to develop a good intuition of what AI can do. Now, most AI projects will use one of the following methods. First, you have prompt engineering, where you got the large language model to do what you want it to do. If you add in the ability to search and retrieve data, you have your retrieval augmented generation, and if you're retraining your model on your own dataset, that would be fine-tuning. So most AI projects will definitely have some level of prompt engineering. So let's talk about the differences between RAG and fine tuning. RAG is great for tasks that involve retrieval of information. If all your use case requires is retrieving relevant information and getting a general purpose AI to understand it and give a response, then RAG is your best bet. However, if you need to shape a model's behavior, you need to fine-tune on some data and some common tasks that will require this would be coding- related tasks or something that requires very specialized expertise. Something like drafting up a law contract or understanding some deep medical jargon. The next thing to know is that RAG is typically going to be generic and much more flexible, as you're dealing with a general-purpose large language model. You can easily put together different systems and different data sources. On the other hand, fine-tuning is going to be fixed and specialized, based on what data you train it on. And pretty naturally, RAG is almost always going to be much, much easier. For RAG, you're putting together different systems, which means that it's much easier to build in iterations. Whereas on the other hand, for fine- tuning, you need a high quality data set. And the path to improving a model isn't always as straightforward as in RAG. That being said, fine- tuning has the potential to use less tokens per request. In a RAG, you typically have to give an AI a good amount of information before it can deliver a response, whereas in fine-tuning, all of that is already done when you train the model. So less tokens will mean a lower latency, which may be important for user- facing applications and potentially some cost savings in the long run. Now, some more mature GenAI systems will require the best of both worlds, and that's when they do fine-tuning with RAG. So for your GenAI project, you should always start, no matter what, with prompt engineering and get to a point where you require something better, and then you start to choose between RAG and fine-tuning. Now RAG is almost always going to cover your use case, and most AI projects will only need RAG. And I also highly recommend pushing RAG to the fullest before considering fine-tuning, because fine- tuning will require a lot more work. Now that we understand the various methods of using AI, the next step in scoping out your GenAI use case will be to constrain the problem as much as possible. If this represents your use case, where you need to get from here to here, instead of applying a GenAI system to handle the entire end-to-end use case, you typically want to break down the use case as much as possible by identifying sub-problems within the use case, and then understanding which sub-problem can benefit the most by implementing a GenAI system. And so your GenAI system may simply only solve this portion. And this is most often going to be more effective than applying GenAI to the entire use case. Now to illustrate it with an example, let's say you have a chatbot, and this chatbot has RAG implemented underneath the hood. With this system, your system has to handle a wide, wide variety of inputs, and a user could ask about anything under the sun in this interface. Some ways that you can think about how to constrain this problem would be to include a dropdown menu, for example, where a user can perhaps pick the data source he wants this user interface to do RAG on. You could also put in different options that will help you implement a different prompt, for example, in your RAG system, and further constrain the problem. Overall, constraining the problem and really understanding where GenAI comes in will help you effectively solve your use case. Because after all, you are optimizing for your use case and not to use GenAI. Now that you have scoped out your GenAI use case, let's talk about how to build your project. The simplest and most effective way to structure your project is just like this. You start with building your GenAI project. You immediately go to evaluate it, and then you conduct some error exploration and identify how best to improve the system. So you go back to introducing the next component or improvement to the system. Now most people are going to spend way too much time introducing complexities and new components at this step of the process, and not actually spending enough time evaluating it and understanding why the GenAI system is not performing. This is a systematic way to ensure that you're putting in efforts into things that really matter, that will get you to what you want to achieve. Now, to go a little bit deeper, I'll be talking now about your evaluations. Your evaluations typically look like a data set, which you can test your GenAI system on. And note that this evaluation data set can actually change over time. If you were to plot your AI's performance to be... something like this. It's okay to use different evaluation data sets over time, especially at the start of your development. Because over here, the improvement in your system is going to be very significant. Hence, it's okay if you don't have exact numbers to compare your system against. Whereas at the later stages of your project, it's important to use the same evaluation dataset in order to squeeze that extra 3 to 5% performance in your system. And so you can also iteratively build your evaluation data set over time. Now a few things about your evaluation. First, you want it to be diverse. It should be sufficiently diverse across the set of tasks that you need your GenAI to achieve. Now, this can be generated by either humans, which will be your domain experts, or AI. Most of the time, you'll use both. A common strategy will be to get a domain expert to generate perhaps 20% or 10% of the evaluation data set, and you can use an AI to subsequently extrapolate that and generalize, introduce some noise to it, before building a larger evaluation data set. Now, when it comes to understanding your metrics, if your GenAI system has a fixed answer, you would use your traditional machine learning metrics such as your F1 score, position, recall, and so on. But most of the time you will be evaluating based on quality, right? So what you can do here is evaluate it based on similarity to your evaluation answer. And some methods of evaluating similarity would be your BLEU or MTEOR scores. You could also employ again a human or AI to create the response based on the correct answer. And for your RAG systems, sometimes you may also want to evaluate the extent of hallucinations in its response. Now that we've covered evaluations, let's take a look at how to improve your GenAI system. So RAG has two main components to it. You have your retrieval, and you have the large language model itself. And it's important to understand where your issues are coming from. A simple way to do that is to examine the data in your RAG. Over here I have a diagram, a very simple if I one of how RAG works. You have your chat user interface where a user would ask a question. You would go to vector database to retrieve relevant information and pass that data back to your AI. I would look at this data and imagine that instead of an AI, I have just a human. If a human with this retrieved data can easily answer the user's question, that means the problem lies with the model because the data was sufficient. But the AI was the issue here. And so you would look at how to improve the model. On the other hand, if you find that even with this data, a human could not answer the question simply because the data was not good enough, or it was just simply irrelevant or insufficient, then your problem is with your retrieval. So this is a really simple way to identify where the problems in your RAG are. When it comes to improving your model, it's reasonably simple. Just use a bigger model. Or you may want to use a fine-tuned model. When it comes to retrieval, there are a variety of ways to improve your performance, and the most simple way to look at it would be to start with your chunking, to look at how you're storing your data in your vector database. So you can conduct some chunking experiments. This is where you vary your chunk size, your overlap, and how you're storing it in your vector database. You can also vary your retrieval method. For example, if I have paragraphs stored in my vector database, instead of only retrieving that paragraph, I could also retrieve the paragraph before that and a paragraph after that to provide even more context into my system. One way to tackle relevance of data would be to improve your embedding model that you used. So you can use a larger embedding model. Or you can use an embedding model that's more specialized to your domain. These are often fine-tuned to that specific domain. For example, there are embedding models that can better differentiate different types of medical related information. Or there are embedding models out there that can better understand the language of law, for example, that will most often be able to better represent the data in a vector format and hence, can get you more relevant information over here. Now, the final way to improve your retrieval system is to change the architecture of your RAG by including more components. Some proven ways to improve your system would be to introduce a reranking layer. So once you retrieve the top five from the vector database, for example, you put it through a reranker that will more accurately rerank it based on what you need. You can also include something called hypothetical document embeddings, which is simply a model that helps you to guess what a retrieved document could look like, and then use that to search for data instead. And these have been shown to improve performance of RAG systems. And there are also a variety of other ways to do so. But the most important thing here is to first understand what is wrong with your system. Is it retrieval, or is it a model? And then, what's wrong with the data? Is it not enough? Is it not relevant, or is it just insufficient? Now let's talk about how to improve your fine-tuning. There are two main components of fine-tuning. You have your data and you have training. Now, high quality training data is extremely important to making a fine-tuning model perform well. And there are a few things that you should know about it. First, it should be diverse across a set of tasks that you need it to accomplish. And it's important to understand the distribution of your training data. If you do some analysis of your data, you realize that it's extremely skewed, and you're missing data from a few categories. You would expect your model to also do poorly in those categories, and you can go deeper to analyze this whenever you evaluate your model. So for example, you see that your model is failing across categories one and two, I would go into my data, take a look. If I have enough data in categories one and two, to understand if it is insufficient data, or is it that my data and categories one and two are simply wrong and those are causing my evaluations to fail? Another thing to note about the data that you're using for fine- tuning is how you're processing it. Ideally, your data should be very clean, and the way you process your data also includes how you add in your tokens to it. And that, that includes things like your end-of-sentence tokens, which determines how it stops. And there are a couple other tokens depending on how you fine-tune it. So that will also influence how your model performs. On the other hand, you also have the training aspect of it. So the main factor here would be your choice of base model. So again following this philosophy, I would start with a really small model. And each time, try and milk more and more performance out of that fine-tuned model. So I'll start with a 3b model, for example, Go to 7b, go to 15b, and eventually maybe something larger like a 33b. And each time you may add more and more data to the model as well. Another thing that you can potentially vary would be the training hyperparameters. Now unfortunately, each training run is going to take up some time and some money, and you don't have the liberty of really doing a lot of hyperparameter tuning. As such, what I would do is to follow some best practices based on papers done by research institutions or tutorials out there. Generally, if you follow those guides, most of your problems will not come from hyperparameters, but will probably come from your choice of base model or your data. And there is some research to show that so long your parameters are within a pretty large range, it doesn't really affect model performance that much. I also know that there are also some tradeoffs here. For example, for Lora training, if you choose a bigger R, it's going to take much longer and train more parameters. On the other hand, it could potentially be better. So these are all things to consider when looking at improving your fine- tuning model. But again I would always look at your evaluation and start with improving your data before thinking about more specific items. To summarize this video, it there are only three takeaways you have from this video, I want you to remember this. Firstly, start simple. Don't try anything fancy. Don't try anything too experimental. Do something really simple and then slowly improve on it. The next thing to remember is to do your error exploration. These steps of evaluating your system and then identifying improvements is extremely important, and is probably the fastest way to improve your model. Lastly, focus on your use case. Stay close to the ground. Talk to domain experts and talk to your users. Remember you're optimizing to achieve your use case and not to apply GenAI to anything out there. I hope you have learned something from watching this video, and I'm excited to see what you can build with GenAI. Thank you.
