Monitoring LLMs in Production using LangChain and WhyLabs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone uh we're going to get started in about 20 second here today we'll be talking about um monitoring llms and production using Lang chain and ylabs I'm Bernice Herman I'm a senior data scientist at ylabs and um if you are interested in following along either with the slides or the Jupiter notebook um highly suggest the notebook um at the bottom you can see the links um there then we'll try to get the links pasted as well um and in the meantime uh shout out kind of where you're coming from what um what city or country that you're in um and what the weather is like the weather is beauti beautiful here in Seattle today and the last couple of days all right um we'll get started um so some introductions uh so who am I I am a senior data scientist at wabs um and do lots of fun stuff I guess here in the right um the other role that I have is a a um research scientists at the University of Washington where I do research on exactly what uh I like to think about across both jobs and that's really um evaluating machine learning and um llm models um in real life situations and use cases uh so we're going to talk today about I think a important version of evaluation really monitoring um llms in particular okay so what do I do at ylabs um in particular um so ylabs is this um well has a number of different products including two open-source packages one called wogs one called Lan Kit we're going to focus on linkit today um but kind of within this larger observability system um that that I like to view in this way so um you have lots of different data coming from different sources um and we would like to understand what's happening in that data in a way that is uh kind of privacy first um and so the the way that we do this is we use y logs which is a u kind of L or AI Telemetry um agent uh so we can do something like statistics on top of that data in your environment so we actually don't see a raw data we capture highle statistics and only those highle statistics gets passed onto um our observability platform um and from there we can do things like anomaly detection um visualization um machine learning over time to understand what kind of issues are going on um to do monitoring and alerts um and from there you can do lots of things so you can um alert the engineers the data scientists the the business team members um of what happening in your model potential issues that are happening in your model or U directly trigger pipelines within your system maybe um automatically retrain your system at a certain point or um send your data to do some sort of um human in the loop process or do some sort of Guard railing that allows you to take a little a deeper look at the data or prevent the user from seeing um some information that might be harmful or um not in line with your goals so where do I fit in in this well I work largely on anything that uh touches the statistics and machine learning Parts um and that's actually three parts uh so um this slide needs some updating here um so one is y logs so this is this AI Telemetry um bit so it's our open-source um python package that um has a lot of kind of heavy weights to statistics in it um in capturing those metrics and statistics on your data in a way that's privacy first um and then Langan kit which is another open source package uh that we'll talk a lot more about including the next slide um and then finally the side of things where we do the anomaly detection and the um monitoring um because we have a lot of machine learning that we use and statistics that we use to do so okay okay so now a little bit about linkit uh so in addition to just capturing statistics uh which is very important um these things get really hard when you're using text um and and using llms I and the reason for that is that the sorts of metrics and the sorts of things you need to think about it about are just much more difficult to capture uh so for example if I have numerical data if I have a regression model or something like this uh the outputs are numerical often um and so things like the average and the median and um you know maybe more complex statistics on top of those numbers uh can really capture and summarize what's happening in the system to the point that uh we might be able to look at those statistics and uh generate helpful alerts anomalies and um even kind of you looking at that and coming up with actionable next steps uh that differs quite a bit for text uh just because text is hard to summarize and uh describe and aggregate um and so what we need to do here is come up with new metrics um some metrics that are more kind of semantically meaningful that tell us something about the data uh that can be helpful and for us to use so if you look here at this example um of what we kind of expect to see for an llm which are a number of prompts so text coming in and then responses and those responses often um are texts themselves but could be other sorts of generative um output like music or code or images or so on um what we want to do here then is capture different metrics more advanced metrics um about the inputs and the outputs um that pertain to a number of different categories so in in linkid we think of these categories as quality sentiment security and governance that you hear see here on the bottom so what sort of questions does this answer um we kind of want to answer questions like how were my prompts and responses written um does this are these kind of very Advanced levels is this with an academic or a medical tone that sort of thing um are my prompts and responses readable and accurate to their design intent so something about reading level or um kind of how interpretable is the text um and then we might want to know the relationship between the prompts and the responses themselves so if you have prompts that are about one topic and responses that are about a completely different topic that um an aggregate is going to be certainly a concern um because this might tell you that our llm isn't response responding on topic in a way that we would expect it to um and then so that's quality but then we can look at things like sentiment so um many of you may be familiar with sentiment from um natural language processing right so this often is some measure of positivity or negativity in the language um and this can be helpful for a number of reasons um and applications so certainly if you have kind of a customer service chatbot or anything related to this um the question is is is my llm responding in a way that's in line with our expectations for uh the type of tone that we expect um our llm to be using or our users to be um using toward us um does my llm talk about or summarize things that I explicitly don't want or use tones that I don't want uh so this is something maybe more related to toxicity um so are we using toxic language um is the llm using tox toxic language and there's a number of questions related to security so is my llm receiving jailbreak attempts so jailbreak attempts are um often prompts that are sent to the llm um attempting to get the llm to respond in a way that the designers of either the llm or the application surrounding it which often uh we are um in ways that we did not intend so um if I have an llm that isn't intended to give kind of advice for doing criminal activity um I want to know are customers or users of my product or tool um asking questions that that are attempting to get around that I and maybe if they ask directly the llm is correctly saying no I won't give you a response to that uh but there's lots of many sneaky jailbreak um attempts um and approaches um to get around that and we we certainly want to know that that's happening and uh with what frequency and if that's increasing over time or decreasing or specific instances of it um is my llm leaking sensitive information so we know that um any machine learning model um that gets trained on trained on massive amounts of data um takes in and often memorizes little bits of that data uh just because there are so many parameters in such a model um and so that can include things like people's Social Security numbers or phone numbers or um Med medical diagnosis tied to a person's name or something like this um and what we don't want to see is that our llm is outputting a phone number or outputting a social security number or any of this stuff um as a response to a prompt certainly to a prompt that didn't ask for such a thing but maybe not at all even if you do ask um and then onto governance so I want to know um is the information reaching my llm in line with policy right um I want to know how much of this data is leaking either in the prompts so customers passing on maybe private information that they they shouldn't have um or certainly information that's coming from the llm to the customer um so yeah so now um if you haven't already I'd love to hear a little bit about you so that was all about me all about the products that I work on at yab I'd love to hear a little bit about you um pass in your name your company and rooll um or if you're job hunting um and then importantly how many models have you deployed in production um has your organization deployed in production and if so how does your team find out about issues in your model um and this can be either an llm um or a machine learning model in general uh we'll talk a little bit today about kind of the differences between those two I mentioned it already a little um but I think any of that experience is really helpful um and getting some context about where you are on that will be really helpful right I'm going to take a sip of water [Music] here okay thank you yanic um let's go forward to llms and generative Ai and feel free to join in the chat later if you'd like okay so at this point I don't think I have to do the uh sort of why uh generative AI or llms are important um I think we see lots and lots and lots of news about them and probably the impacts of them in our personal lives in addition to our kind of business lives um but one thing that is worth pointing out is uh the difference between generative Ai and um llms what do we mean when we say some of these terms um so the thing to point out here is that um while what we talk about most often these days are llms and language models um that is not the you know full extent of generative AI right we have code we have images we have speech um generation we have video 3D and other um so um this can be really helpful so we're going to focus today on text um but many of the things that we're going to talk about today apply much more broadly um not specifically Lang chain although um I think it's worth noting that even in the production of many of these other types of outputs um the input is often still text so often understanding text and using Lane kit the Y the ylabs tool um will still be helpful um for the input of the model even when it's something different but generative AI um has similar challenges okay so as we mentioned before a little bit um when we're talking about llms we're going to assume that we're interacting with prompts and resp responses so a prompt being some text input that asks a question or you know um starts a conversation in some way and then a response to that question um keeping that prompt in context right so um you often have to look at these two together and it's also helpful to compare the prompt and response to understand um the quality of the response so the challenge for language as I already mentioned is that there's just really complex tasks and goals um so for example um this might be a response I think this in particular happens to be a news a news article but let's say it's a response from an llm um it's really difficult to understand this maybe we can do some simple things like count how many words there are count how many characters there are so on and so forth um but we what we might want to do in understanding whether or not this LM is hallucinating or giving factual information is understand in this example how many different specific entities are in this text so um named entities are things kind of like proper nouns so people dates organizations uh that sort of thing um and having tools that track this having some Metric that tells us okay these are the named entities or these are the number of named entities can be really helpful um so then we get to other challenges um so named entities is something that in natural language processing people have worked on before llms um that's not quite the case in some of these other challenges so this is an example of a hallucination uh so this is a prompt that you see at the top I'm asking to summarize this article and then they pass over an and URL um and then the llm gives a response to this um and uh we don't quite know if this response is a good response does this describe the article or not well in this particular example the reason that it made its rounds in the internet is um that this article doesn't exist this was just a URL that's completely made up if you copied it over into um your browser it won't work there was no New York Times article with that title that day um and so what this tells us is that the llm looked at the words in the title right in the URL so chat GPT prompts to avoid content filters and took that um and generated a hallucination hallucinated a um a whole description for the article right and so the question is is how do we think about these things statistically how do we find them how do be monitored for them um because llms are certainly capable of doing something like this okay so now that we've talked a little bit about llms um and generative AI broadly let's think about evaluation monitoring and observability so evaluation is a term that you're um most likely to have seen um it's something we all think about um and learn kind of as we're training as data scientists um an evaluation is about understanding the quality of a model um so for this model here uh well for a model um often you might answer questions like how well does this model perform on test data that I've held out from um my um data set um and what you need for this is often labels so you need to understand well what is the right answer um what is a ground truth label that we have and then we can use that to understand well what percentage of the heldout data does my model get correct this is what we often think of as accuracy or many other questions about um for example recall or Precision or many other metrics that can be helpful to evaluate but the key for many of these is that evaluation is about understanding is there a problem and to what extent is there um a problem with the quality of our model and it requires us having some ground truth some right answer to compare our model's performance to this differs when we get to the world of monitoring so um evaluation works really well in experimental settings um and in settings where we can get ground truth um but often in a production use case we have an evolving data set so we see this here um on the left we see an experimental Mach machine learning setup we have our train data set we have our test data set um but for production machine learning uh we often have let's say a daily data set or every five minutes we're collecting new data um or every week we're collecting new data um and so we might not necessarily have labels or ground truth um instead what we have is change over time I and monitoring is about kind of analyzing changes in quality over time um and what this is helpful for is one noting when these issues take place right so if I know that our model has been pretty similar in um the responses or the distribution of data that we've seen until three days ago um where things changed drastically this might tell me a number of things maybe there's been a change to the model or changed the outside world or some sort of seasonal pattern or sale that's gone on um but this now starts helping us to get down to what could the issues be I me those issues could extend beyond the model um so um I have a question from um Marsha um hopefully I'm pronouncing your name right um so this is a really good question it's um once you use heldout data to evaluate is it then unusable going forward and you need to get more so uh this is true theoretically uh so this is what we called um what we call adaptive data analysis uh so this study of well if we're holding out a data set and we've now evaluated using this model or sorry using this data um won't we leak some of the knowledge that we've gotten for this data into the decisions that we've made maybe into the model itself so on and so forth so theoretically you would like held out data that like hasn't been seen or evaluated on S uh before um because what could happen um is a number of different issues um so I think the main concern is that you've looked at the results of this and any changes that we've made to the model are dependent on those results so the worst case scenario is that we you know test on or sorry we train on test data that would be bad um then we're not really testing our model and being able to generalize on new data um but assuming that we're not doing that um it can be okay to let's say look at the same test cases every week or every you know every time they do a release uh that can be okay um but what we want to make sure is that we're not making changes to the model including kind of individual personal changes um based off of those results ads um so for example if we have a basket of data that we're using over and over and over and over again um and we're taking those in that information and tweaking the architecture or hyperparameters for our model um we're going to start to get away from generalizing because there's information that's feeding back so all of that said that's kind of the theoretical um speech about this uh there are many many many cases in the real world where people have done some sort of data leakage of this form I andan it tends to generally be okay is it the best thing to do not not necessarily the case um but I think there are some interesting kind of studies about well you know we do this in practice a lot kaggle competitions um any sort of these challenges these sort of kind of regular benchmarking tests um and they don't seem to infect affect the models um as much as we theoretically think they would so um if that's all you have I say absolutely go for it um but there's a lot of theoretical work that would say that you probably shouldn't use that held out data all right so um going on from monitoring which where we're looking at quality over time um let's go over to observability so now instead of just looking at our whole to set across many uh rows of data over time um we might also think about observability which is pinpointing issues within your system so now we're looking at um maybe multiple points within the same data point that we're trying to run our model on um maybe before we do some feature engineering or um after we've done some Transformations and after we've done predictions we want to capture um statistics about our data so not only can we understand um overall in time when there's been changes to our data but where in the actual process of our system do there seem to be changes injected um and this I think is the kind of Holy Grail of um really understanding your system pinpointing where it may be within the system um in addition to just time and that's observability and that's what we strive for at ylabs okay um so the problem is that capturing signals in your model quality is very difficult uh so I talked a bit about statistics um so even capturing the right statistics can be really hard um so you know certainly we can do things like a mean and a standard deviation or median um and other kind of ranked um things um but this isn't quite enough um and there's many reasons for this um but a lot of this comes down to um taale events and understanding kind of rare cases so it's not quite enough to understand what the average case is uh when we're monitoring what we're trying to understand more is um what are these rare cases and when are they happening why are they happening so that we can find the issues that are causing these more rare cases and predict kind of issues um before they come up in the future so statistics gets us one uh one place um but doing more advanced work uh so we use data sketches in our case which is um a kind of advanced technique for capturing um a mix between statistics and a sample um with kind of accurate error bars uh this is really important to us um but going Beyond statistics um especially as we kind of talked about for text uh we need we need to start thinking about other measures so one measure are things like data quality so all of the stuff that we talked about for text things like capturing um the sentiment capturing the presence of toxicity or many other kind of text specific metrics um there's also performance metrics so if you are lucky enough to have some ground truth data for your data over time um maybe because you invest in getting uh some labels and it doesn't have to be all all of your data labeled but maybe some small proportion of your data labeled or you find some other kind of um proxy metric or you just happen to be really lucky and work on a problem that naturally produces um labeled data over time um then you absolutely want to use that um and that's another way to start understanding your model quality um but finally sometimes we don't have any of that um might have some statistics about the inputs and outputs um but we need to augment this with business kpis or things like this uh so it's not directly tied to the model itself but it's tied to the impact it may have on customers right so um maybe I'm using a particular model to interact with customers and I find that customer engagement with my chatbot is much lower um this isn't directly tied to any responses of the model but we um with enough patterns we may start to rely on this to better understand our model quality okay so finally uh we're going to talk about how we combine with Lang chain um and Lang kit the the tool that we use um how that works so we'll get to you some code here but I just want to give some context up front uh so first um we are as I mentioned a big fan of Open Source software um and we think open s Source software is crucial uh kind of in the space of um AI in general um many many many of the tools that we use um in the AI space are open source um or rely on open source software um and for fun um I have some pictures down here of different open source um or open initiatives um and tools that we use in the AI space so um if you can name any of those uh feel free to do so in the chat um but the commonality that we have between um well we have many commonalities but uh the commonality that we have between um Lang kit at ylabs and Lang chain are that we're both open source um and that we can work together easily okay so let's learn a little bit about Lang chain uh and the I I would personally describe the benefit of L chain is to reduce the internet or sorry the engineering boiler plate that's needed to interact with llms so if um if you've made an application using llms there's a lot of um interaction that you have to do through API um and there's a lot of boiler plate that starts to grow as you make more advanced applications that rely on llms so for example um it's one thing to pass in directly a string of texts from a customer to your llm I would not suggest that customers say crazy things llm say crazy things um so then what you're often going to do is you're going to build some templating about the prompt um both to kind of filter through what the customer says but also to generate the outputs that you want um it's our job as creators of applications that use llms to form kind of what the customer is saying or what the user is saying into a question that's better suited to get the response that we know they're looking for so often what you see are things like templates and um retry logic and lots of this sorts of stuff and lying makes it really great to build applications uh through this sort of composability so that you write templates you can use those templates to call llm you may be able to um you may need to call multiple llms um that sort of stuff um I won't go very far into the Lang chain framework um there's many um tutorials and things online on Lang chain specifically um but there's a lot of different concepts and structure that's really helpful here so um these are some major components in Lang chain uh so models uh we've talked quite a bit about llms and other um other models such as Tex and embedding and so on um prompts so different ways to structure a prompt different um styles of prompt there's user prompts and system prompts and templates for those prompts etc etc um indices memory um so certainly when we think about um llms one example that comes to mind are things like chats um so in those cases we have a message history we have not just one prompt and a one response but we have series of prompts and response that all kind of live in the same context and require some memory um then we have the chains themselves right so um again we're calling multiple um calls to the llm we also could be calling multiple llms um and then finally the agents so um anything that's acting on um well one acting to the llm helping to connect to the llm or acting on the response at the lln okay so now how does this relate to Lan Kit that we talked about earlier well it turns out that uh we're already integrated into the Lang chain package so uh you've already downloaded uh ylabs link kit um in your download of link chain so we see here that there's a ylabs callback um and this WAP call back is in Lang chain which allows you to very simply set up um using Lang kit um while you're using Lang chain and we're going to see code um in a little bit but first let's look at some on the slide to kind of understand so um so what do you do so when you import Lang chain you can also import the ylabs call back Handler um and then when we're calling um our open a AI method here or class here I believe um we can pass in all of the stuff that we're used to so temperature um later we would pass on a prompt so on and so forth um but then we're passing in a list of callbacks including the yabs call back um and this is all you would need to do uh to do so um okay let's uh let's go ahead and well I'll do a quick summary and then we'll jump in so again y logs is the um the open source package that we use to do that kind of statistical profiling um and to create the Telemetry that we pass on um yeah let's let's jump in uh to the demo here and then we're going to jump back into that okay so um again there are links in the description about how to get to this demo yourself if you want to run it in parallel with me um but let's go ahead and get started so the first thing that I'd ask you to do is um go to um open a free ylabs account account um and so there is a link I believe also passed around um here so yabs airfree um and when you go to that link you can click right here on get started for free and sign up um I'm already signed up so I'm going to log in here um but but uh sign up is very fast um typical of internet thing so um your email address and then they'll get an email um to authenticate that um and then you'll be able to log into your account uh so I'll walk you through the account a little bit here oops too big okay so uh what we see here is our organization you you won't quite have any resources here yet so what you can do is actually click on the global demo organization um and see some examples of models that are already up and running so for example we're going to look here at our demo llm chatbot so when I click on this um I see um a number of things I can see that there are 47 data profiles that were uploaded for this um and that it's an hourly model um many different information about the health um the um different metrics that are available in our data set and so on so what I'd like to do is actually start here with explore profiles um so that I can click on the right range of data here so this is an older model that we're going to look at here um and we are in our inside Explorer uh so this um both tells us a little bit about the profiles of data ourselves so the statistics that are being captured um and so we can see prompts we can see responses and we can see many many um different metrics that are automatically captured using Lan Kit on top of our prompts and responses um and if you click into these we can see our distribution here um box spots some example data if you've chosen to set that up um and so on and so forth okay um the last thing I'll show really quickly before we get into our uh creation of this data by hand um is that if we go into dashboards um when we're specifically on an llm um model um project we can see different things so um we can see this dashboard here for security and as well as performance um so in our security dashboard we see things like um has patterns which is going to tell us about the data leakage in our model jailbreak similarity so this is uh for example the 99th percentile um of the similarity score the cosine similarity score of um the text that is being passed in for prompts so we see the same thing for response below um but for a prompts uh two known jailbreaks so what we see here is that even the 99th percentiles even the uh top 1% of most concerning um prompts are not terribly similar to um a jailbreak attempt until this last day where we see um a higher percentage similarity to jailbreak attempts so this is something that's worth noting um and something that we caught you know within that hour um similar for different other things so a sentiment for example uh we see fairly neutral but positive sentiment uh throughout this time and this is for the mean um and then the mean sentiment drops precipitously so this is saying um that you know the the sentiment really has dropped in this last hour somehow um and so on and so forth um and toxicity unsurprisingly has gone up right as the sentiment has gone down okay so these sorts of insights uh can be really helpful um and kind of the importance of this um monitor during process for your llms okay so let's um hopefully you are able to have gotten into your uh account by now um and what we'll need to do is go to um our settings here so click on this hamburger menu settings and you can go to um access tokens down here um and if you don't have one I do um if you don't have one uh you can create an access token here uh so give it a name if you would like give it an expiration date um and hit create access token and you can save this um because we're going to use this to um do our demo um and I think you will want to make sure that you're on your organization um that you have control over not the demo so I'm going to go here then go to settings oops not user management access tokens and then you can create your access token okay so um once you've done that um we will get back to to our notebook here and uh just a test to make sure everything's working well we have our hello world of course um so um for those who aren't familiar with Note uh Jupiter notebooks or collab notebooks um a shift enter will run the cell that you're on and move your focus to the next cell so the next thing that we want to do is setup so as I mentioned um The Lang kit um call back is actually already included in Lang chain but we're going to log into these separately because I want to show in a little more detail how these things work um the very last couple of cells of the notebook allow us to do it the fast way but I want to do it the slower way so that way can understand actually what's happening and show how to exert some more control over the that you select so if you run the cell um to pip install linkit with the all extra here um and then pip install Lang chain and this is going to take probably a minute or so um so I will give you some time to do that okay as it's running um as I mentioned um you should grab your keys uh that you just received from the wab ylabs platform uh so I like to store them in a text file on my machine so I'm going to go ahead and grab those as um you are installing Imports here okay so uh once you have your key and um you have it copied here uh there's a couple of things that we need to do uh we just need to set environment variables so hopefully this is run um if not give it a little bit of time um but then what we're going to do is we're going to set a couple of environment variables that we need uh to run through this demo so there are four uh the first one is our default or ID so this is the name of the organization ID um the way that we can find this is if you go back to our ylabs platform here on Project dashboard again make sure that you're in the my organization and not the demo org right now um there's a couple places you can see it so it's actually up inside of the URL if you want to copy it from there but I find that the easiest way to see it is actually to go to settings um and then I believe in Access tokens right here we will see our organization ID highlighted in a little gray um box rounded box so uh what I'm going to do is I'm going to copy this and paste it in uh when we run this the next thing that we need is our default data set ID um so let me actually show you where to get that as well so if I click on yabs um I can create a new resource um for our new model so I'm going to hit create resource I'm going to give our resource a name here I'm going to delete one of my models first just so I have some space um so we're going to give this a name so let's call this Lang chain plus Lang kit no plus let's just do L chain L kit demo um and we are going to make this a large language model resource type um we'll only add one daily is fine um since we are all are all using the free plan here um and then we're going to add models or data sets Okay so what we see here below is that um it's been given a an ID of model 31 in my case in your case probably model zero um if you've created this new um so that's what we're going to pass in down here and then so actually let's go ahead and run this we'll do this one at a time here so I'm going to copy in my organization ID I'm going to copy in my my model number so my case is model 31 yours is probably different um I'm going to copy now my API key so this is what um I copied what I created when I created my access key um and it doesn't show up again so if you didn't get a chance to save it um you can go ahead and create a new one um so in Access tokens um and just create a new one if you'd like um but hopefully you were able to save it I was able to save mine so I'm going to copy this over as well and then finally um any open API or open AI API key you have so this will allow us to actually call um call open AI uh to run that particular demo if you don't have an open AI API key that's totally okay okay U we can leave this empty for now but I do okay um so I have a question here I'm happy to take it uh the question is um so Lang chain can combine some models does that mean you're taking the final result um and that is up to you so uh there's different ways that you can think about combining models uh one is is um taking the responses so often um in Lane chain you will either call the models one at a time so I might call a model um and have some sort of function that determines whether or not I'm happy with that model and then I may call a backup model or so on and so forth um so in that case right I'm taking the the response of the last Model um the one that I was happy with um but there are other situations in which you may want to call multiple model models at the same time and then it's up to you on what sort of logic you want to do with those different responses maybe you want the shortest response or maybe you want um to combine them somehow yourself um but you would do that in a kind of custom way okay um hopefully this is um works for everyone here so we have our environmental variables U let's go ahead and get started so the first thing that we're going to do is we are going to import um Lan Kit um but particularly the llm metrics module um so we have a number of packages of metrics uh that can be helpful to users llm metrics I think is the more standard one um if you have a maybe slow machine or just want to get through this quickly um light metrics is much faster um so it will get rid of some of the more expensive metrics that we have that rely on language models themselves um and but still be quite helpful in determining things like data leakage and um other other readability scores and other things like this and and finally we're going to import y logs so y logs again is this um open source package that does this Statistics collection for us this Telemetry uh so once we import these two maybe I should have separated them in lines uh we can initialize our llm metrics here um and this will take a little bit of time um but what this is doing is downloading all of the models and um information we need to just prepare to collect metrics um and so this is kind of a um a setup and getting everything spun up so that uh things are hot when we're actually calling with Tri okay we not go with the custom snippet here or the custom widget here okay so once we've imported um this uh you'll see that we've stored our um result of our initialization in schema so this will um be passed into calls to Y logs uh so that we can know which metrics and which schema to use um for collecting of these statistics particularly for llms in our case okay so now we can call y. nit and what this is going to do is help you to find the right session uh so if you've inserted your um your session information correctly uh this should work well it should initialize a session type ylabs um with the information that we passed in if this for some reason didn't work um you're allow you're welcome to um it might ask you a question of what of one or two to use ylabs info or to use an anonymous session in ylabs uh feel free to type to if if that's the case Okay so let's get further into our demo here um what we're able to do now we're able to um download some data so we um before we get to the open AI example I just want to show some data that we've already loaded into the package so that you can play around with it um and this is a lot of stuff in one cell so I'm actually going to delete this one first just because I want to show you the data itself um you're welcome to run this all at once but um so what we can see is that we've um loaded some chat data um and there were 50 records in our data and then we're showing the first chat uh this is actually right here let's push this up a little bit this is actually just a data frame of um two different text Fields so one being the prompts and one being the responses so um these are real responses uh that were received um I believe from GPT 3.5 although my memory is not serving me well right now um and so you can see many many different um chats here and we have 50 um so these are prompts that uh were collected from the Internet or generated by the team and then responses from the chat GPT or GPT 3.5 particularly okay so how do we use wogs um so again uh wogs is what's going to capture these me for us and Link kit is what defined some of these metrics specifically for llms so what we can do is y. log pass in our chats data set um pass in our schema that we initialize our llm metrics with um and then just give it a name oops so the name right here is linkit sample chats all that's totally fine um if I run this it's going to go through all of these lines and log um all of the metrics that we talked about and thought about for our model okay so we see that we aggregated 50 rows into this profile and now we can click on this link that was given to see that profile so we see here uh we're launched right into that insights page apologize this is a little small now um and we can see many things so we can see all of the metrics that were passed in so despite the fact that we only had prompt and response uh we see the many metrics that were calculated on top of that including character count difficult words reading scores um things like finding Social Security numbers and mailing address and phone numbers and credit card numbers inside of the responses uh which can be a concern um and we have many insights here so if you see this insights um we can see many um things that were flagged as possible issues um into our data or just insights into our data so things like our prompts including some mailing addresses and two examples um a High um jailbreak similarity score of 0.31 which isn't too bad um a reading ease score of 58 out of 100 um which implies that some of the responses from the llm are difficult to understand refusal similarities that sort of thing okay so we just used linkit uh for the first time this is awesome uh now let's start thinking about using it with Lang chain so um for those who are familiar with Lang chain you've probably seen something like this before again there's many many different ways to use Lang chain um but what we have here I'm going to import a number of different Tools in Lang chain chat prompt template um system me message prompt template and human message prompt template um and a couple of schemas for messages so um what are we creating um so in this example we have um we're trying to use an llm to um rewrite user text into more happy and upbeat text uh so what we do is we're passing uh we're writing a template um using a system message here that says exactly that telling the llm what the contacts here and then we're going to pass in um the text from the user so we're just going to in our case just pass that directly um passing on our text but in many other production cases you know there's a lot of complexity to how we're going to generate this text okay and then from there uh we can use our link chain. chat models chat open AI um and assign that to our llm variable okay so now um we can do um we can create a function here to take our prompt take our template and we're going to pass uh the new prompt from the customer for example um into our LM function um and print the response and format the response a little bit and print the response so let's define that function here um and let's just give an example so one example is I Don't Like Mondays or Tuesdays for that matter today is Tuesday um and although I'm no fan of Tuesdays but we'll see what the llm says um and then we're going to pass in our template which we named template a Beat up here um and run this so I think I might be calling this in a slightly old way um but this still works for us here and so we can see our prompt and response is um slightly more a beat I thought a fan a big fan of T Mondays or Tuesdays either that's slightly more a beat um Okay so um so that's what we that's what we can do with Lang chain this was just an example of how to use Lang chain individually before we think about combining with link kit so one option that we can use is actually to use y logs to um take our link kit schema kind of like we've done in the past and profile in the same way that we've done in the past um I won't do that here um I'm I'm going to kind of jump into the combination of the these sorts of things um one thing it might be worth noting is that um if we run multiple problems here this is easy to do so we're just going to make a loop here um where we track oops that's what happens when I don't run everything so um we can take our prompt and response we can create a new example here um and then we can both look at uh the sort of data that we see inside of our profile um but then also add more data to that so I'm going to just do that here um okay and then we see this that this profile has extended so now we have four responses and four and three prompts here um and many many statistics on top of that you could look further into this so it's worth noting that um these profiles themselves contain many different metrics that are accessible individually so if I wanted to go and get the response. agregate reading level column and grab the maximum from the distribution statistical distribution of the data I can capture that so this is the max response reading level um and we talked quite a bit about ylabs um but let's go ahead and look right here at uh the code again we saw them in the slides but the code to do this all much more simply with the laying chain callback integration so again what we can do here is the ylabs Callback Handler we're going to import that we're going to import open AI um and now we can um set up our callback Handler here with the just default parameters and then we're going to create an open AI um class here with a temperature of zero and then our call back for ylabs and now we can do the same sort of thing but much faster so we're going to go lm. generate which is a lang chain function um but that is going to be using this call back and um we print the result here and then what we use is um ys. flush so this happens every 20 minutes but we don't have 20 minutes to spare right now um and this has upload uploaded a um data set cool um so this is the whole demo um feel free to check this out here so uh we now have many many different profiles um over here at yabs um please do check out the tool um and feel free to um hopefully use Lang chain and Lang kit together much more easily um and if you have any other questions about how to monitor llms how to think about um monitoring complex llm setups happy to chat about that thank you

Info

Channel: WhyLabs

Views: 268

Rating: undefined out of 5

Keywords:

Id: W8OvQUwdBD4

Channel Id: undefined

Length: 62min 20sec (3740 seconds)

Published: Wed Apr 03 2024