Master LLMs: Top Strategies to Evaluate LLM Performance

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
there are hundreds of large language models or llms available for you to use from close Source llms like gp4 and Cloud open source llms like latu and falcon or even your own fine tune llm as we teach in our course but how do you ensure that you are using the best model for the task you want after this video you will know how to choose the right llm for your use case and what to look out for if you decide to F tune an llm yourself if you are time and resource limited the best choices to use close Source models like gp4 or Cloud since the team behind them makees sure to give you the best performance they can out of those models from the get-go still you may want to evaluate and compare their performances especially when your application becomes more and more specific and you will like to switch to your own llm a straightforward way to evaluate an llm is by using the perplexity evaluation metric llms predict word probabilities in sentences letting them create humanlike phrases perplexity is an evaluation metric used to assess the performance of llms it measures how well a language model predicts a given sequence of words like a sentence essentially perplexity gges the level of uncertainty the model has in assigning probabilities to sequence of words it tells you how confident it is for this generation here we have an example of computing the perplexity when generating four tokens we have four values of probabilities one for each token conditioned under previous tokens which basically means generating the rest of the sentence the probability of the sentence made by those four tokens is the product of their probabilities then we normalize the sentence probability with respect to the length of the sentence so we divide it by four since the sentence probability is obtained by the product of the token's probabilties we can normalize it by taking its geometric mean last the perplexity of the Cent sentence is just the inverse of the normalized sentence probability to get the probabilities of each generated token and then compute the perplexity you can use the model that compute transition scores when using the Transformers Library however perplexity isn't always sufficient most closed Source models do not provide us with the probabilities associated with each generated token making it impossible to compute and compare the models also this metric does not help us evaluate the model on more licated subjective language tasks such as how well the model can create a summary or how interesting is the script produced for a video about benchmarking llms plus even if you can compute the perplexity only looking at the perplexity metric May favor models that are just memorizing data rather than fully understanding it luckily there are other evaluation matrics available on curated benchmarks allowing you to compare with many other llms and approaches to see how well you're doing you cannot really know how good you are until you are comparing yourself with others which is why benchmarks are so important how can you know if your approach is the best if you can't compare it with others there are a bunch of benchmarks that exist depending on the task you want to achieve these benchmarks can test llms for World Knowledge following complex instructions arithmetics programming and more they will allow you to have a clear understanding of where you are at in comparison to all other llms out there several leaderboard exist to look at the progress of llms based on the most important benchmarks including the open llm leaderboard by hugging face and the instru eval leaderboard fortunately you don't need to run all benchmarks individually yourself there are scits such as one that we cover in our course that will run all the standards for you on your llm and return the Matrix which you can compare using the leaderboard so you can just jump on our free course and follow along with our practical EX example to do that but the problem remains what are those benchmarks and which one should you use well it depends on your goal if your goal is to build an llm for coding like co-pilot you may want to look at the human eval Benchmark created by openi this Benchmark is used to measure coding generation accuracy the data set includes 164 original Python Programming problems with eight tests for each the test assess programming language comprehension algorithm and math it also has some resembling software interview questions here the models generate key different solutions based on a single problem if any of the case solution pass the unit test that's counted as a win so when we see the results of pass at one we are evaluating the models that are just generating one solution so it's perfectly accurate well 62.2% here's an interesting results reported on the code Lama paper if we take a look at the results in the table the the unnatural Cod Lama version gets 62.2% on this Benchmark for pass at one on generating one answer the model is trained on a synthetic self-instructed the data by prompting a different llm and use the generated results to train their own model they got close to the results obtained by gp4 which scores 67% on the same Benchmark so you can see there's still a lot of improvement left to do this result highlights the Improvement that can have a small high quality data set now if you are trying to build an llm for common sense or trivial reasoning Tas which basically means task that involve understanding basic facts and logic that humans intuitively know and draw simple conclusions from given information there are two benchmarks to look at the first is H swag a challenge to measure Common Sense inference given an event description such as a woman sits at a piano the llm must select the most like ly followup she sets her fingers on the keys humans score 95.6% whereas gp4 scores 95.3% and the best open source model currently on the open llm leaderboard scores only 87. 51% the second one to look into is the ai2 reasoning challenge or Arc it's a data set of over 7,000 genuine grade school level multiple choice science questions assembled to encourage research in advanced question answering if your goal is to build a more truthful and exact llm something that hopefully never hallucinates you may want to look at two more benchmarks the first is called measuring massive multitask language understanding or MML it's an evaluation metric for llms on multitask accuracy covering 57 subjects across stem the humanities the social sciences and more high accuracy on this Benchmark requires extensive world knowledge and problem solving ability which makes it ideal for identifying a model's blind spot the second interesting one is trueful Qi it's a truthfulness benchmark designed to assess the accuracy of language models in generating answers to questions it consists of over 800 questions across 38 categories encompassing topics such as health law finance and politics state-of-the-art models like gpdr only achieve 60% so there's still lots of room for improvement and for you to work on having identified the benchmarks most relevant to your needs I encourage you to check out the free course we built in collaboration with 2zii active Loop and the Intel disruptor initiative this course provides a comprehensive guide on evaluating llms on a set of benchmarks with a coding example using a Luther AI language model evaluation harness Library it's a script that automatically runs a set of benchmarks of your choosing on any language model where whether it's proprietary and accessible via API or an open- Source model you are using so there you have it now you have a much better idea of which Benchmark to use and why then you simply use the first link below and follow the notebook we have for running the evaluation script for them I hope you've enjoyed this video and I will see you next time with more exciting AI [Music] techniques [Music]
Info
Channel: What's AI by Louis-François Bouchard
Views: 3,637
Rating: undefined out of 5
Keywords: ai, artificial intelligence, machine learning, deep learning, ml, data science, whats ai, whatsai, louis, louis bouchard, bouchard, what's ai, gen ai 360, activeloop course, intel course, cohere llm, cohere course, w&b course, lambda labs course, towards ai course, towards ai, llm course, large language models course, fine-tuning llms, llm, llms, build your own llm, train llm, train llm from scratch, fine tune llms, llm certification, llmops, foundational model certification
Id: iWlTCBUoru8
Channel Id: undefined
Length: 8min 41sec (521 seconds)
Published: Sun Oct 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.