LangSmith Tutorial - LLM Evaluation for Beginners

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

now if you are building applications with large language models then langmi is definitely a platform you cannot ignore so in this video I'm going to introduce you to Lang Smith and now Langs Smith is a platform for building production great large language models applications now in today's tutorial we will be going over what Lang Smith is really starting from scratch at the beginner level and then we will be going over some Concepts like data sets and evaluation to really understand how to use this platform now as always I will be walking you through every step of the way within vs code showing all the code examples showing you how to do this and also all of the code will be made available on the giup repository that I will link in the description and now quick note lsmith is currently still in private beta so if you don't have access yet already make sure to sign up and get on that wait list and now for those of you that are new here my name is Dave e and I run a company called Data Lumina which is a data and artificial intelligence Consulting business and we also have have two training programs one is called Data Alchemy completely free to join to learn the technical side of uh artificial intelligence and data science and the other one is a paid coaching program for data professionals that want to learn how to monetize their data and AI skills similar to how I am doing that so what is Langs Smith well it's a platform to debug test evaluate and monitor your large language model change and outputs now why would you do something like that for example if we have a question and answer pair that we want to do a search over for example in an application then we want of course if you put this into production have some way to evaluate whether the application is coming up with the right answers Langs Smith is the platform for that so what can you do with it you can log runs meaning every time you interact with a large language model you can leave a trace that you can look into you can organize your work visualize your runs and this is what it looks like if you actually go into the the application which I have access to and I will show you what that looks like but as you can see it's very visually oriented so you can exactly see the chains in each of the steps and then see the inputs and the outputs so there's also an opportunity to run prompts in the playground to debug and tweak some prompt templates inputs that you might be working with you can share your work and you can create data sets for testing and evaluation and this is what we will be mainly looking into in today's tutorial because I believe this really is the key strength of Langs Smith right now and it's not something you can do easily with other platforms or Frameworks out there now this really is going to be a Hands-On tutorial really showing you how the platform actually works now if you want a bit of a more high level overview of what Lang Smith can do then I would refer you to this previous video that I already made and then now some quick context of where this fits into the whole pipeline of building your large language model applications and it's really about putting them into production putting them to the next level so you build something locally you evaluated it and now it's ready for actual customers actual people to use these applications and in my work uh with data Lumina I've been working on chatbots for various clients and what you typically see if you get really to like the next level is you want to introduce some institutional knowledge some company data that you can use to base your answers upon because if you're just interacting with cat gbt with with open AI the the answers will be very generic so typically what you do is you upload uh frequently asked questions documents and you want to retrieve uh the information from that through Vector stores and similarity searches the whole thing if you want to learn more about that check out my other tutorials but the thing is you want a way to evaluate that as you are scaling your vector stores as you are adding more data to them as you are integrating more logic into your application that the answer still still remain valid so that's why this is important and why this is also why this is like the next logical step in the realm of llm applications and it's also nothing new in the sense that with regular machine learning this is basically already very common when you put a model into production you can for example use something like mlflow to like track all your experiments and monitor your drift model drift over time to really make sure that the models you put in place as you introduce new training data or new unseen data that the results remain accurate and by the way mlflow is also putting features in place to track large language model output so this could be a interesting comparison for the future so now let's get into the actual code and in order to get started we first have to set up four environment variables which I will be loading from uh static variables as well as aemv file so the only thing that I'm getting from uh the environment variable file is the lsmith API key which you can get by going to the platform and then in the lower left corner you can check out the API keys and Export or get that save that and that should be the setup so I'm now going to run and start all of this up in an interactive session and we are going to now first log our first run to show you what this looks like so we're going to set up a client which we import from Lang Smith and then it's as easy as interacting with Lang chain like you normally would so we have a large language model that we specify and we say llm predict and we put in hello world so this is all with the default settings we just create a simple one-hot prediction and we get a response back hello how can I assist you today but now the cool thing is if we come over to the Langs Smith platform and select the project that we are running this under langman tutorial you can see that the runs are now showing up over here so this is the one we just did and here we can see the human human input hello world and here the AI output so now we have an actual trace of this start time end time status tokens latency so you can see all kinds of metadata that is very interesting to log if you want to monitor this application also you can see what model we were using what uh version what system we ran this on and here you can also see how the chat open AI was constructed so very interesting stuff and let's see if I now come back to the runs you can also see like how fast this is in terms of like let's say we run this one more time let's say what can you do another prompt what can you do wait for this and then we have the result as an AI I can perform etc etc now let's go over here boom we have the other one you can see this one took pretty long so this already interesting like why was this taking 10 seconds so that is how you get started with lsmith pretty easy right so I love how easy and convenient they made this all you have to do basically is specify the project that you want to run this under now another cool thing is that you can uh use various organizations within lsmith so you can do that from the UI and you you can add members to different organizations as well and then you can have projects at the organization level and get an API key for every Organization for example so if you're working with various clients and you want to keep all of that separate you can do that which is ideal for the projects that I'm working on with multiple clients all right so now let's check out how evaluation Works which I believe right now as I've said most important feature of Lang Smith and we're first going to cover the quick start so to quickly get started and then we'll dive into more details into how data sets actually work how you can create them and then also the various evaluators that they have in place but for now let's continue with the Codex sample and for this we're going to create a quick data set so we'll create some example inputs and we also give the data set a name so we call it rep battle data set and now we can register this data set by running the code over here so we have client and then we create a data set we give it a name and also a description and now we can create this data set by running this piece of code and then coming back to our projects and then to data sets and we can now see that we have the rep battle data set which currently has no examples yet so let's put in some examples and that we can do over here so we Loop over our list that we've just created which is just four entries in here and you can see it's just a single input so it's not a key value pair like I've said we will be diving into into that in a bit but for now we're just adding some examples so we'll Loop over it and we say the input is a question and then we put in the input prompt we have no outputs and we put in the data set ID so now let's come back over here and here you can see we have an input and we don't have any outputs so just inputs all right let's come back to the next step so now we're going to quickly evaluate this data set so how this works this evaluation is we're first going to configure a run eval config now this is a bit tricky in the beginning at least I found it quite tricky but you can uh look up more information on how this works on the documents in the Lang chain evaluators but I will walk you through it and now they start out in the documentation with the QA evaluation so if you have question and answer pair key and value but right now for the Simple Start like I've explained we just have single inputs so this could be when you are running your app application in production and users just ask a question and there's no correct or there's no right or wrong answer but you still want to do some evaluation for that they have come up with some clever ways to do that and let me scroll down a bit no labels criteria evaluation so to configure the Run eval config we can put in criteria so that is exactly what uh we're doing over here so we're putting in criteria as the evaluator and now here we can do a couple of things so how do we evaluate outputs if we don't have a right or wrong answer meaning we just have the inputs well we can evaluate them based on a certain set of criterias and now out of the box they have put in place some let's see some criteria that we can already use so these are things like conciseness relevance correctness coherence harmfulness Etc so the these you can use straight out of the boook so how does it work you select a criteria and plug in any of the strings that you see within this list over here but you can also specify a custom criteria and in the example they um give this one of cliche so since we're creating rep battles we are going to evaluate whether the lyrics are cliche so how do we do that it's a dictionary so we have a criteria cliche and you respond with uh the following so that is the structure of how you set that up and now what we're going to do is we're going to run all of this so first we configure the invel config and then we do run on data set which again has the client that we previously defined data set name large language model and the eval config so again they made some pretty nice rappers to do all of this and now as you'll find out so this is now going to run in the background and and this will take some time because it's going to Loop over all the inputs and all the criteria and then within that multiple large language model prompts to do the evaluation so this now finished let's come back to the data set and click on the last run that we just did and now we get this nice dashboard so for all of our examples within the data set we have a run over here and we can click on that so let's check out the one with Barbie and Oppenheimer so the rep so first of all top level we have the input and we get the output so this is the whole rep battle with some versus okay so this is top level what we can see but now we also get feedback and this is where it gets interesting so now for all of the criteria that we've defined there is an additional run that we can look into now why do we have four well if we use criteria it says the default Criterion is helpfulness and then we've added harmfulness misogyny and the custom cliche one so four in total so let's now have a quick look at what's going on so first helpfulness we can drill down into this and then here what you can see is you enter into the middle of this tree basically where you can see the input and the output and then also the vyl final output the verdict basically of the assessment of the evaluation so let's see what's going on over here so this is just the input and the output that we've already seen But if we look at what's going on over here here you can see how the large language model is actually assessing and evaluating whether this is helpfulness So based on the criteria helpfulness this is kind of like the prompt instruction that Lang chain put already put into place in Langs Smith to do the assessment so you can see helpfulness insightfulness and appropriateness and then it concludes with a yes so with a the why so uh based on the criteria it can be considered helpful insightful and appropriate but now what's even more interesting is if we drill down to this bottom layer here we can actually look at the prompt engineering itself so you are assessing a submitted answer or a given task or input based on a set of criteria so we have the data we have input and we have the submission and then if we scroll down here we have the criteria helpfulness is this submission helpful insightful and appropriate if so response y if not respond n and then we have the end data so here you can see how the prompt was actually engineered and how it was sent to the large language model to get to the final conclusion so I know it's quite complex uh takes some time to cover all of this but I think it's really important to understand how this works top level because if we now look at the other uh let's see if the if we look at the other ones going on over here so we have the feedback we have the same for harmfulness misogyny and cliche but just in a different way so we can go over here and then again we have the tree structure over here where we can go all the way to the bottom where we can now see let's see how it was structured so again begin data input submission and then we have the criteria that we created so this is the custom custom criteria so here can also see how this uh relates to the input structure over here which is kind of like it's a dictionary but it is um split up in such a way where it's it's just one key and one value even though it's split up so you can see it's cliche and then are the lyrics cliche respond why if they are and if they are entirely unique so this also shows you how you can create custom criteria pretty easily based on your data and your use case that you're working with and then also all the way Trace back how then Lang Smith is using your custom criteria to evaluate this and then finally if we look at the top level of this tree of this Trace structure we get all the top level metadata and get to the final output that is also logged here within uh the UI so we get the feedback we have a key score and value and we also have a comment on that so now if we go back to our dashboard we can start to understand how they come up with this feedback over here so you can see we have four total runs in here you can also see the tokens and the time it took and now here you can see you you get a score and these are averages so if we look at let's see let's open this up a little bit you can see over here so cliche we have three zeros and uh one is identified as cliche so then it just averages that into a 25 for cliche score where one would be all of the uh outputs would be cliche and now it's just one of the four then we have misogyny which is at zero nothing in there also no harmful content in here which could be very interesting to monitor your business application for harmful responses and we also have uh helpfulness which has a a score of one 100% score meaning that all of the prompts were actually helpful so I think it's quite clever in how they've set this up and you kind of have to get used to it and really identify criteria that are relevant for your application in order to really make this work but this is a very good starting point to start your evaluation if you don't have key value pairs so you don't know exactly what the output should be for your application all right so that was the quick quick start on how to set this up and how to run your first evaluation but now let's look into the data sets where it gets a little more interesting when we start to look at actual key value pairs so um inputs and outputs questions and answers where a lot of the use cases especially if you're looking at information retrieval kind of like chatbots with your own data Enterprise data this is really where it gets interesting so uh top level in langmi there are three data types right now so we have SIMPLE key value pairs which we will look into as it's the most straightforward you also have large language model data sets which is a combination of input dictionaries and output dictionaries and we also have it for chat where it's kind of like the same but then you have a series of uh inputs and outputs back and forth in uh like like in a chat but like I said for now let's go over the key value Pairs and I will be going over these examples um a little quickly because they're pretty straight St forward in the sense that they all work kind of the same but I'm just going to show you a couple of ways you can go about this so first of all you can either do it uh using Code like we're we're doing right here which is probably the best way to go about it but you can also uh do it from the UI so let me come back where were we you can also come to data sets and create a new test run or upload a CSV over here so that also works but I'm going to show you first how we can get some example inputs and now we're going to use key values so we have a question what is the largest mble and we have an answer the blue whale and so on so these are now our inputs now we can give this a name and similar to how we uploaded the initial data set we are going to do the same for this and create it we have a description and then let's just run this so This goes uh pretty quickly we can go come to the UI boom we have another data set and now look here we just had let's see examples just inputs and now we have inputs and outputs and now here you can also see since we have not specified what what type of database uh we wanted to create it defaulted to key value pair all right so now the next thing we can also create data set from existing runs so let's create a quick data set again example data set come over here now we have an empty one and now what we can do is we can get all of the runs for a given project so uh we have the Langs Smith tutorial for example so if we go to all of our projects the uh all the evaluations will be put by default on the under the evaluators project but there are a lot of runs in here so let's actually look at this one that we uh ran in the beginning to do like the hello world example so we can take that and take all of the runs and just put it into the database now there's probably uh so let's there's already an example ah I see so this is an error that you get if you try to add duplicate entries to data set so if I come to the example data set here you can see it was working well but it just encountered since we ran the hello world example uh quite a couple of times with the exact same output you can only have uh unique inputs outputs in here which makes sense so that is how you create uh a data set from an existing run and now we're getting all of the runs under a project but you can probably be a little bit more clever about this and uh strategically extract the relevant pieces of information but I haven't looked into that yet now we can do the same thing from a data frame so this is kind of similar to a key value pair but then we use the upload data frame so let's just create this one and then let's see what it looks like so we now have the questions and answers in a data frame and we can can upload that as well so let's do that uh let's see it was my data Frame data set and go back over here boom Another input output so those are oh wait we have one more CSV upload which is also very interesting so here we can specify or link to a path in our directory so here you can see in the data we have our data set. CSV which is again question and answer same stuff but now let's do it from a CSV and let's come back over here come to the data sets and we also have a CSV data set so those are various ways that you can upload data to the Lang Smith portal all right so now that we have actual key value pairs within our data set we can look into do QA evaluation and there are three uh metrics or three functions that we can use for that and that is the context QA the QA in the chain of thought QA and the Chain of Thought QA is similar to the QA but it just adds additional Chain of Thought reasoning if you don't know what that is it's a series of intermediate reasoning steps that has to has been shown to improve the the performance the output of large language models and they also say in the docks over here that this is this will get you the best results but also uh takes longer cost more tokens because of intermediate steps um QA uh we'll look into all of these uh three but basically the QA is simply looking at the input and the output to decide whether it's correct or not uh Chain of Thought does that but then with more steps and context QA looks at the context so they say this is useful uh if you have a larger com uh Corpus of grounding dogs but not necessarily have a ground uh through to answer the query okay so let's see what that looks like and come over to the code over here and now you can see that we can configure the Run eval config again similar to we did before but now instead of putting in criteria we can put in these parameters over here and then we can hit run on data set again fill in all of the information so let's run this and see what we get all right so the evaluation is finished now let's come back to the data sets over here so we what we had we had the elementary animal questions and here we can see the test runs so this is the one we just did and here we can see total time was 40 seconds to evaluate this and here you can see that we get perfect scores correctness contextual accuracy and also chain of thoughts contextual accuracy but now let's look into these uh individual into all of the um methods that we've used to evaluate them and actually look at the prompt underneath that so let's start with the largest maml the whale so top level we have what is the largest maml and then we get the largest memal is the blue whale Etc and it GES gives some more context and then we have the reference output which is the blue whale so this was in the data this is the data set but this was the output and now you can see how sometimes it can be quite tricky to evaluate whether an output is correct because it is based on a partal match of the overall output so let's look at how this is uh taken care of within the evaluation model so let's start with contextual accuracy let's look at this one and here you can see what's going on so we have a query we have a result and we have context and then it is instructed to say uh or the output is correct so then let's look at the actual prompt and okay so here you can see the prompt engineering again you are a teacher grading a quiz you are giving a question okay this is very interesting and if question context student answer and then we we grade it correct or incorrect so this is pretty interesting to see and they really instructed the AI to be a teacher so they've probably experimented with this and instead of saying like hey this is an evaluation tool to score large language model outputs Etc they made it really simple and understandable in a way for humans as well but probably also for large language models so this is pretty interesting so then we have question context and then we grade it interesting okay so that's the first one now let's come back and now let's quickly compare all of them so we kind of get an understanding what the difference is so first we have the context Q&A so here you can see the prompt that went into it and I'm not going to go over all of this if you're interested you can read over this and here we have the QA eval chain so you can see what's going going on over here and then these are actually quite similar it's just a different way of of prompting and then in the end we have the Chain of Thought reasoning where there is a a larger prompt there's more to it basically so those are the differences but they all if we look like in the middle one over here they all have the output correct correct and then there's a little bit more information over here but we get the grade correct now and then coming back to top level your data set this is really where you can start to monitor everything so you can have a look at the overall scores on all the the metrics all the evaluations that you've uh put into place and now when you change something to your application either like in the actual application itself through like Pro prompt engineering or adding additional logic or for example in your Factor database so um you're putting more examples into it it is it becomes more more and more important to quickly assess okay does it still pass the test are the answers that were previously correct are they still correct that's really where where this comes into play and where this is so useful and now you can also start a test run straight from the UI over here so you can also see how you can have a you can select all of the criteria you can even do multiple and it generates the code for you already and let's see I don't think you can run it from the UI but it generates the code so you can just copy paste this and then run it within Python and you can update this so this is all very powerful and now within the code I have some more examples on how you can create these evaluations with custom criterias uh with labels and also without labels like we've already seen and then the final thing that I quickly want to get into uh which is another interesting evaluation or two interesting evalu eracing metrics we can use an embedding distance and a string distance and I won't go into great detail uh on what these exactly are but if you scroll down you can see the embedding distance how we can do that and also the string distance and if I come over here to the GitHub repository there's also some more information on that where you can see the uh we can use the cosine distance for the embeddings where we have 0 to one and we can also Al evaluate the string distance where it's the other way around so zero is an except match and one is no similarity so those are also evaluation metrics that you can look into and they put these into place to counter the fact that the evaluation scores uh that we were previously discussing don't have a direction meaning like higher is not necessarily better we just say hey if it's helpful we have a one but if we uh have have the criteria of malicious then we also get a one and it's hard to compare that from like a a modeling perspective so uh the embedding distance and the string distance are ways to to quantify the similarity where you can monitor small changes over time so for example if you have two answers which are which are both helpful both both have a one onh helpful um they cannot really be compared right whereas if you use embedding distance for example or string difference you can have two answers which are both helpful and correct but one might be better than the other because it's closer to the actual answer if that makes sense so let's quickly have a look at what that's uh looks like so we first have the embedding distance which uses the cosine here you can see Zer to one one is more similar so let's run that on the elementary animal question let's wait for that and then also run the string distance after that where it's the other way around zero is exact match and a one is no similarity all right so that one is also finished right now let's now get back to the runs and here you can see we have the embedding distance and now the other one should also come up in here I believe so we have the embedding distance and let's see here you can see the correctness and again com back to that where what we got going on it's a range between Z and one so this is quite interesting to see how this can be problematic in some way where the answer is correct we have the blue whale but if we compare the reference output to what was set over here we get a very low score and that is because there's there's just a lot more information in here but the answer is still correct but you probably if you really want to build this out you have to combine both so you want to at some point you want to do some uh some distance metric or string uh metric that you want to put into place and at other times you just want to evaluate is it correct yes or no now let's come back and see the other one okay so that one also uploaded so kind of like similar but just another way of looking at it we also have um so here it's a it's a little higher but then again remember that it is flipped over here so zero would be an exact match and one is no similarity at all all right now what have been some of my observations so far using Lang Smith well first of all the login function provides a really transparent and structured way to examine the outputs of large language models so it's excellent for that great tool creating data set allows different evaluations to be made independently on different runs so I also like that so that you have the ability to create different data sets uh different projects different organizations so you can really customize this towards uh your need your application your organization and then also uh it's very useful to evaluate and compare lmma outputs with ground TRS and use both the existing and the customized evaluation criteria like I've said we have uh quantifiable metrics but it's the customizations also uh are really beneficial at some point and then finally the customize uh evaluations um are really helpful but they can take up a lot of tokens so if we compare that uh for example let's go back to the first experiments that we were running on the rap battle data set so here you can see that we had 2200 tokens and if we go back to the elementary animal questions where we were just running the uh correctness for example on similar size data set we can see that it's it's a lot less and if we look at the cost so this this is also something to consider so these are we can look at the cost for today so it's $1.30 just from uh today's experiments but you can see we just run a couple of experiments with a data set with like four four records in it so you really got to be mindful of that and closely monitor your cost as you start to run these evaluations because I can really see how if you really like scale this up to hundreds of thousands potentially millions of records that running an evaluation is going to be expensive so you got to then really be careful about when to run that and also make sure you um create good subsets of the data potentially to not like run it over everything um so that is just something to keep in mind and something we all have to like learn and experience uh how this actually works putting large language models applications into production and monitor ing costs and performance over time all right and that's it for this Lang Smith tutorial so I now hope that you have a solid understanding of how the platform works and also why it's actually useful and I would say even crucial if you're building applications with large language models so if you found this video helpful then please make sure to like this video and also subscribe to the channel so go do that now and then if you want to learn more if you want to stay in the loop then make sure to check out data me completely free link is in the description and in here I share my entire data and AI workflow uh that I use to not only create these videos but also complete the projects that I work on for my clients so that's that make sure to check that out and also if you want to learn more about what I do with data Lumina you can check that out also look at uh the freelancer Mastermind that we have if you're interested in that then uh that's it for now I want to thank you all for watching and then I'll see you in the next [Music] video

Info

Channel: Dave Ebbelaar

Views: 25,642

Rating: undefined out of 5

Keywords: data science, python, machine learning, vscode, data analytics, data science tips, data science 2023, artificial intelligence, ai, tutorial, how to, langchain, langsmith, mlflow, llm evaluation, llm monitor, ai monitor, ai feedback, responsible ai, llm debug, degub, monitor, test, evaluate, ai app, llm app, langchain app, langsmith tutorial, langsmith demo, langsmith beginner, how to langsmith, data alchemy, data freelancer, langchain explained, gpt 4, large language models

Id: tFXm5ijih98

Channel Id: undefined

Length: 36min 10sec (2170 seconds)

Published: Thu Aug 24 2023