Monitoring LLMs in Production using LangChain and WhyLabs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone excited to get started um hopefully you can hear me we'll see if if that's going to be another bit all right well um as we get started um please do put your location in the chat um and what brings you here um are you interested in Lang chain have you used Lang chain before are you new to llms or have are you pretty experienced with llms maybe even in production just going to give maybe 30 more seconds or so and then we will get started I will try not to hear myself all right we see someone in sunny Florida which is nice I wish I was in Florida okay well we'll we'll get started and um hopefully folks will join in on the chat um as we kind of start a conversation about um monitoring llms in production um and then particularly using two open-source tools or tool sets at this point there it's kind of a wide ecosystem um so Lang chain and the related framework tools and ylabs and the related uh framework in tools including L kit okay uh so the first thing to note maybe on this slide is that um there are a couple of links here so uh we'll need to do so in our demo but there is a yabs platform sign up so it's completely free to sign up um we'll use it in our demo but if you want to go ahead and get that started now uh go ahead um there's a link to our slack group um and then for today's Workshop there's the slides uh that I'll be going through now and then the collab notebook the code that we'll be going through for our demo okay so introductions um so I'll start with myself um so I am a senior data scientist at waps um and I do all sorts of things related to um machine learning kind of thinking about what metrics and what do we need to measure uh when we're working with monitoring and understanding different models in production um it as far as kind of monitoring algorithms as well as some of our open- source tools such as y logs and Lan Kit um I also do research in this space um and I think it's really fun to kind of see both sides of things um so um at the University of Washington here in Seattle and that's stuff I do for fun on the side um so further into my role at ylabs um the yogs side of things um looks sort of like this um we will have a production environment and there's many many different ways that you can get data uh so we think about data science and machine learning um and AI in a fairly data Centric way so we're often thinking about well what are the types and forms of data that um comes out of this process that goes into and comes out of this process um and you can see it in many different forms so it might be batch data it might be streaming data um it might be online it might be in a feature store lots of different formats uh that's in the blue boxes on the left um and wogs is a tool to kind of package that data um or metadata about that data into a small package um that we can use as Telemetry and store and Save um and analyze over time uh to give us observability about our machine learning and AI models so um yogs is our open- source tool to do that um and it doesn't store your actual data at all um if if you don't want it to um what it does store is a bunch of distributional information so you can understand what's happening in your model um find outliers that sort of thing um that can be passed on to ylabs platform and I highly suggest that you do because that's where we can now think about these profiles not just one at a time or you know a couple out of time but in mass and so looking at this you know for all of our systems across all of our days or hours or or five minute buckets um and start to debug what's actually happening what's actually changing are we having some sort of drift are we seeing some sort of anomalies compared to what we see in the past how do we compare this model that we have in production to this other model that we have in production or to our training data and um being able to visualize it and um greater number of team workflows to um get alerted and and react to these changes okay so um these are the two parts that I work on mostly as data scientists so y logs that um the kind of intense statistics to be able to take all of that data and pull out the important metrics and pull um and compress it in a way that is really helpful um as well as the monitoring side of things so um the anomaly detection uh finding Trends in your data once you've uploaded it okay so um that's where we started and then the next thing that we um came out with um is a lot more directly related to llms and that is Lan Kit so link kit um well for those who are um maybe not familiar with llms or just quick reminder um llms can look um different ways um llm stands for large language model which is actually a term that's been around for quite a while um before the current day but now what we typically mean for large language model is a generative AI model that um takes in some input as text often so often something like what we see on the left is prompts and outputs some response most often that um is also text but that could look like an image it could look like lots of things we're going to FOC mostly focus on the case where the output is text here and that's typically what people are talking about when they say llm so the question is is once you use these models how do we think about monitoring them how do we understand what's happening with these systems um we'll talk about some some interesting differences in this world um but we think about a number things and we've created the ylabs language toolkit or Lang kit for short uh to dive into some of the most important questions when you have one of these llms or an application around an llm in production and that's the quality of the text uh the sentiment security uh so um maybe toxic information things like this um and governance understanding where this data came from okay here are some of the questions uh that you might ask um if you have one of these systems in production uh certainly these things come up when I'm working on the system so um H so for a particular days traffic he might say how were the prompts and responses written to the application um maybe you know there's been longer responses or longer prompts over time right so longer prompts might signify that users of your application have changed the the way that they interact with your system longer responses or differences in responses may say something about um a change in the underlying llm that you're using um as well as many other things um are are your prompts and responses readable and accurate to their designated intents so right not just the length of responses but we actually want some sense of how readable is it what is you know like is this using really complex language is this using really simple language is it very Stern and direct or mind um certainly depends on the application what you want there and I won't go through all of these but I think a couple of interesting ones um that that you might be surprised matter so much one is um understanding sentiment in your um in your text um both in and out um this you know depending on application might not be directly important to you but what we find is that seeing these differences in sentiment um is another important signal in changes in either your overall system or the users of your system and then certainly a big thing that comes up um one that we have many separate workshops on um and just will be a sliver of today is security so um one case of security um issues may be something like a jailbreak so if you've use an llm um which hopefully you have but we'll we'll uh get to use one today um what you may do is you may ask a question and get a response that looks something like sorry I can't give a response to this or sorry I can't do that um and this is called a refusal from the llm um but then users may be clever and try to get some response that isn't a refusal right so even if it's something like using toxic language or um the example I often uses asking how to hotwire a car um the llms many of the llm on the market not all um will say no I'm not going to give a response to this or you know to medical questions I'm not going to give a response um but you may be able to change your the way that you've asked to get a response and this is really important even as someone who builds an application around that llm um because you want to understand how your users are using the system you're going to be wanting to track um where the application has broken down the path and sometimes that break actually happens because the response from the llm even though it's plain text it's actually one of these refusals um so finding out how what percentage of people are attempting to break this what percentage of times are these refusals found those sorts of things are super important as well as finding kind of sensitive information maybe people put in their phone numbers or social security numbers and you want to track that before you even send it to the llm because imagine if you're calling an llm uh that you have no control over so um maybe it's a proprietary one like open AI um you're being responsible you would be responsible for taking that customer information and passing it directly on to another company and you really want to prevent that and so how might you do that you might use tools like linkit to find these sorts of sensitive information um and then I'll skip over to governance uh so questions about um the policies that I have and the path through which these kind of um decisions are being made okay um so I'd love to hear a little bit about you in the chat um so in YouTube or LinkedIn uh wherever you are um share with the group kind of your name if you want to share um your company and role I think this is important uh to understand you know some of us are going to be data scientists like myself and maybe a little more oriented toward the statistics side some of us are going to be machine learning Engineers or Engineers um more Broad um and some of us are going to be you know on the business side um or many other roles maybe your job hunting now um but it's really nice to get a sense um of the group that we have and and where people are coming from and along with that if you could share if you have any models deployed in production so in addition to playing with llms um have you built some application around a model or deployed an llm yourself which is a lot of work um and and how does that go how do you find out issues about that system right this is the large challenge that we're we're always trying to solve all right I see some aspiring data scientists nice all right I'm going to um please do keep filling that out but I'm going to keep going here um and just do a little bit of introduction to llms and generative AI before we jump into the open source side of things um with Lang chain and Link kit okay so uh luckily this gets easier and easier every time uh I'm sure many of you have heard of llms um talk about you know many generative AI systems like stable diffusion in the past it's crazy that February 13 2023 was such a different world I don't think we even uh talk as much about sable diffusion today because um many of the language models have jumped up so much um but I think it's worth mentioning because often even though we have this conversation about llms or about text specifically um there's a strong relationship between kind of language and what we often mean when we're talking about many issues and tracking and stuff like this and that is generative AI um more broadly so generative AI doesn't have to be text like we see in the left of this graph although that's what we're going to focus on today frankly and what lots of people focus on today um but this all extrapolates out to things like code generation right um and image Generation Um speech Generation video um 3D models and scenes um and other applications um so lots of the issues that we run into are General to these generative kind of approaches um and we'll talk a little bit about what makes them hard um but then we'll talk a little bit about text specific issues as well again we've already gone over this um we're going to focus on llms which often mean text prompts as inputs and often text as output although an llm technically could have many different types of outputs so I want to give one example of a complex task um and a complex goal that we have um and why this might look so different so um for anyone in chat um feel free to share what task you think this picture is trying to represent um I'll describe the task and then we'll give it a name in a little bit but um in this task what we're doing is we're going to pass in some text this is a couple of paragraphs of text here and what I'd like to know is some label for some phrases within that text um so so that I can understand is it a person is it an organization or a date or what sort of entity is this so This Might Be One Clue Into the name of this uh which is entity recognition so what makes entity recognition so hard well um one is that it's not simply kind of classification of each of the different words in this phrase although it looks very similar it looks very close this is actually this sort of phrase finding task right so we're looking at multiple we're combining them together um and we're coming up with some entity or choosing an entity from a list uh that we have for this um but the funny thing is like this isn't even to the challenge of generative AI often so often for generative AI we have an even harder task which is not only just going through and labeling some you know flexible sets of data but it's generating that data so here's a um a maybe true example of generative AI we have a request here um to chat GPT U to summarize this article um and they pass a URL and then chat GPT um provides a response so this paragraph describing uh this article so um one thing that this shows is like this is quite difficult right um there's not a lot of information here um and your system may or may not be able to go look up this URL and and take the text from that URL and come up with this um but so the reason that I highlight this and and show why it's so complex is that this is a hallucination um so in fact this URL uh that it points to does not exist um or at least didn't at the time of of the search um this is a completely just madeup URL um and the llm gives a response anyway even though there's nothing at that that web page because that that page doesn't exist um so why does that happen how do we detect things like this like this is a huge challenge uh that will be really difficult to te detect um and something that um I can promise that we do perfectly but um as we collect more and more metrics we'll start to um get hints at whether or not these issues are feeding into our system more less over time okay I'm just going to pause and check for questions I see some introductions awesome okay so uh that's my brief inter overview of kind of llms generative AI um I have many more slides if folks want to talk about uh kind of definitions and the basics of those I'm happy to share that content out um but I want to move move on to well what do we do when we were building on top of llms and we have to support a system in production with these llms um and I want to talk about three concepts one is evaluation one is monitoring and the third is observability okay so evaluation um if you've done data science work you've probably seen this term the most um and I would summarize evaluation as under understanding the quality of your model so um here's a couple of examples here of um where we may be going through some evaluation of two different models so on the left you see a model that creates kind of a straight line to distinguish between these two types of data and on the right we see um a much more high-dimensional line uh to distinguish between these two types of data um and evaluation really difficult uh for lots of reasons um you often want to have something like um testing data so you need labels uh so that you can look at the percentage of um instances where we've gotten things correct or things that we've gotten incorrectly um but those labels aren't 100% accurate um so then we need to think about evaluation um more broadly than that and then further than that um we may you know it's not always the best thing to have perfect accuracy especially in a world where we worry about overfitting um so we need to think about how how representative our data for the test is to what we're going to see in the real world so as a data scientist maybe on the more academic side although we have to think about this issue too um in that part of my life I might you know just think of the test data as something very similar to the training data and I don't have to worry about production where you have you know this much more complexity and much more realism about um how this system will be used and how that that will change over time and how that'll affect our system and how well we can evaluate it how well it works um and then more importantly in production settings um and actually many academic settings as well we may not even have training data or labels or sorry we may have training data but we may not have more testing data more realistic testing data or labels um which means that evaluation takes um a completely different path um and we have to think uh quite creatively about how to evaluate maybe we use uh different kpis that relate to our system maybe we move to the world often we move to the world of change we're not quite thinking what's the percentage accuracy we're thinking how do the characteristics of the data that I saw yesterday dep differ from the data that I saw today is that a consistent Trend or is it not that sort sort of thing um and that brings us to monitoring so monitoring is um something that you'll see much more often in production so on the left side here we see experimental machine learning train and test the sort of kind of static amount of time we just collect data and we kind of um focus in this world of just this data set that we have um and how have we split it but in production machine learning we're collecting data over time we'll see new data tomorrow and the day after that and so on and so forth and we can use that time Dimension to understand our system and to um not quite evaluate in the same metrics that we used to do but um get toward evaluation um so this it takes a completely different skill set um and something I've talked about in the past but this transition between kind of the experimental machine learning and the production machine learning way of looking at things and learning how to monitor your system is kind of that big jump between the two and then the third example we have here which I'm going to go in a little more detail about is observability so not only do we want to monitor that things may have changed our system which tells us something um but it doesn't give us any hints about what to do about that um maybe you might decide to retrain your model or something like this um but what we would love is observability so something that helps to pinpoint issues within our system because not only are we looking at the overall system inputs and outputs we want to look at different points in time across the system um so you think of machine learning as a pipeline so we have lots of different steps in the process um so we start with um collecting data input data and we translate those into features um depends on if it's kind of deep learning style of things where that happens inside of the model or um classical machine learning where that that happens outside of this but often in production cases there's a little bit of both um even if you're largely doing deep learning there's normally some sort of storage of your data um maybe you have information about the users that are using your system and that gets feature transformed at least somewhat um or data cleaning would be a similar thing um so we want to collect some data kind of through multiple parts of that beginning feature side of things but then we also collect data in the model itself so you might have um a system where at a certain layer we want to Output um statistics about the data as it comes in or um different checkpoints within your your model itself or if you have an emble of models um multiple models that you then bring all of the information together and come up with an average kind of output then we definitely want to keep you know the the outputs of all of those different pieces um yes and then then we move on to predictions so we certainly want to store the predictions that the model made just like the inputs we need the outputs as well um and then it for and honestly with a lot of time in between often um we may get some hints at the ground truth or the um the true label for these sorts of things so it certainly depends on the application um applications um some that I have worked on like related to inventory and demand these things are actually a little bit easier because we can actually see how many items were bought um the next day and then that gives us some ground truth um another example that I like to give is like ride sharing if you have a model that predicts how long it might take to get your Uber from this point of the city and this point of the city um you will get ground truth um after that time is a laps and if if you've collected that in the system uh that differs from other applications where ground truth is much much more rare um where you may not see any data about um what eventually happened whether or not my predictions are correct um and in that case um you certainly will lean on kpis so different metrics related to the business um maybe not directly related to your model but you suspect is somewhat correlated so for example um whether like customer feedback whether or not they came back to your site how many minutes did they spend on your website that sort of thing okay um and so these are some examples again we we're kind of moving from the harder statistics from data that we have in hand um to things that we've kind of created so things like performance metrics and data quality where we have to do some sort of measuring to get that um and may need some ground Truth for that um to do business kpis and things which take effort on their own but are a little less related directly to the model okay I'm going to pause there just a second to to see if there's some questions that I can clear up um and then we'll go on to linkchain plus ylabs so I see one question maybe or may not be toward me about um building course or courses on building llms from scratch um there are some really great courses on that um not any that we give at wads um I think largely llms certainly learning how an llm works and building an llm from scratch will really help you from the data science side of things um this is why I've gone through these sorts of things um I would say that for many many people though um it isn't something that is super necessary um if I'm happy to share some resources on um on building llms for scratch that I found that I kind of enjoyed um but I would say for for folks who aren't interested in doing that like no worries I don't think that um that it's necessary I think you learn a lot about the math um but we've we're kind of in a paradigm right where llms that we may use such as open AI you know take millions of dollars to build hundreds of millions of dollars to build um maybe more than that um and so there's kind of a real difference um between kind of what we'll be able to build or not um especially depending on what you mean from scratch so there's some really great open source llms things like llama 2 and others um that I've actually worked with and find tuned um so um but I still wouldn't say that I've built those from scratch mine are even more simple than that okay um that's the only question I see on YouTube um let's keep going so now let's talk about um more of the code side of things the Practical side of things how do we combine Lang chain and ylabs and maybe first what what are these things um and why do we care about them so the first um thing that I want to mention is is that we're really focused on open source software here at yabs um and you see logos here for um both ylabs and Lang chain um but a fun comment is um that or a fun challenge that I'd like you guys to do in the comments is if you can name um all five of these logos for other open-source machine learning initiatives and why is open source important uh well so one there's a lot of tools um that have existed for machine learning that are open- source that that we're built on right so um P torch and in the past um many other tools and tools you know underneath that many like numerical tools and things like this um these have been here for a long time and you know without these tools machine learning wouldn't even exist as it is today um and then going further from that tools that allow us to share new research and um open- Source models and open- Source um platforms for sharing data and all of these sorts of things are so important um and kind of integrated into our ecosystem so it's really hard to extricate yourself from open source um and I also think that we really wouldn't want to right so um being able to see the source and and interact with it and contribute to it um is a major benefit um to us in understanding where our system may be failing and getting you know feedback about how to fix it and all of these sorts of things and we're no different um at yabs and so let's keep going uh so first we're going to talk about Lang chain and then we're going to talk about how that blends with um with what we have at wabs and L kit specifically so Lang chain um I'm very curious if folks have heard of this before it's very popular tool um for composing different llms and different processes related to llms together um so I say that it's kind of reducing the engineering boiling boiler plate with Lang chain um so things that we can do um might be seen here in the components for Lang chain and these things are changing all the time so this is a fairly old um example of it um but these are some basic components that we have so we have different llm models um things like open AI or things like um open source llm that you might have access to um and we want to be able to communicate with those the apis to these different models um and it would take a lot of of effort to write up some code some Boiler flate code often to interact with these different systems um often in slightly different ways different ways you need to format things and so on and so forth um there's also lots of work that we have to do on the prompt side of things so um I think when we all first start working with llms you know we're typing things directly in um but eventually when we start to think about how a system might work on top of an llm we have a lot of structure to how we would create a prompt um maybe we want some template for that prompt maybe we think about um how we're sending in metadata for that prompt we may send it in different ways as a system prompt or as a um a user prompt sort of things lots of different kind of structure uh that needs to be added to prompts as we kind of move to production um I won't go through all of these um but just to say that there's other components that Lang chain does really well um so agents to work with these things a chain between these systems so maybe you use an llm and then you pass the results of that llm into some other system that does some sort of Guard railing or something like this um we have the kind of the issue of memory so um many times when we're thinking about llms we may be thinking in a chat context where you've asked a question and then you've gotten a response and then you ask another question that presumes that there's some sort of history there um so keeping track of all of that history can be really complex as well something that we grapple with a lot as well over at yabs um and so Lang chain is really great for um developing lots of best practices around all of these things and having a package that is you know admittedly evolving a lot um but includes ways to combine these things together in a reasonable way all right um so we are integrated into Lang chain so very popular package for doing this sort of stuff um and we'll we'll see how to use it in a little bit but this is just a call out that um when you import Lang chain you are importing um a yabs callback Handler um or you're able to import a yogs callback Handler to to use it directly um you're downloading um the the tools that you need to be able to do that and so how do we use it we're going to see it both on this slide but then we're going to see it in in the demo in a little bit um but it's very very simple really just lines one and four um are involved with yabs the rest is kind of the very basics of what you need for Lang chain uh so once we've installed this um we just import the ylabs Callback Handler um and then when we use Lang chain which you'll see at the the bottom number four line four here all we need to do is pass in a list of callbacks that we're using so what's going to happen is is that we use Lang chain to do um kind of our interaction with llms maybe one maybe multiple maybe other systems around the those llms and then that information gets passed on to um the ylabs open source tool so in our case Lan Kit and Y logs um to process from there okay so now what does that mean when we pass things onto wogs what does wogs do what does lanit do uh so yogs is that tool it was the little orange square that I talked about at the intro um where we profile the data so we find a number of metrics and um we take the raw data which is you know lots of private information lots of stuff um and we reduce that down to statistics and metadata that are important for for us to understand Trends in our data um and specifically to kind of machine learning um as well so data profiling is this General concept of kind of analyzing a data set to find information and structure you often collect things like descriptive statistics um types patterns keywords search for data quality that sort of thing um so Lan Kit is a tool specific specifically for text and llms on top of yocks and what we do there is we're actually curating a number of metrics uh that are important for text and llms so the problem with just you know collecting some information from text is you might get some basic information about the length of strings or things like this but that doesn't quite tell you you know does this text look like a hallucination or does this this text seem more simple than the text that we've seen in the past that sort of stuff um we've done a lot of work on curating a a list of metrics that are important especially for production systems to understand Trends and potential issues in your system related to security related to data quality related to governance related to sentiment and and many more at this point okay so you might ask like what's the point what um why are metrics so important why you know why have a separate tool to do this um and I'm going to give an example with images uh because one I'd like to just belabor the point that generative AI can go beyond text um even though we're focusing on text today but two because I think it's more fun to see images than text on slides so um this on the left is an example of DC Gan so this is a generative model um that produces image produced images from faces of faces from faces as well um and you can see the quality of it it's a little painful to look at um and this even in 2017 which is quite a while ago now for us here in 2024 um is a different version of a gan um producing images of faces much better but here's the question if we were trying to detect differences in quality between the two excuse me um how would we do so algorithmically how would you come up with metrics to do so um if folks have ideas for metrics feel free to drop them in the chat um yeah just looking at these two um you certainly don't have to write down python code in chat or anything like that but just how would you approach how would you write an algorithm to say hey this image looks more realistic than this image I'll give you some time to to come up with ideas I'm I'm going to share a few that I have um the number of eyes might be something that you think of um in some of these cases there are less than two or more than two um but this already presents a problem like how do we measure an ie what do we know what an ie is um things like the color gradient so the the sort of like Clarity of gradient and colors um you might do similarity to some composite face different things using models um so all of these things there there's a huge challenge in coming up with these metrics and it um doesn't make a lot of sense for all of us to kind of come up with these individually um and so a lot of our work at Lan Kit is starting this process of coming up with these metrics and aggregating these metrics um from around um for text um and then as well as metrics that we are gathering for other modalities like images okay that's what I have for slides um and now now I want to jump into the demo uh which is perhaps the more fun part so um what I would want you to do is go to ylabs platform um let me do that right here share my screen okay so this is the ylabs platform um if you go to hub. yabs app um I will start on the the start page here um I've cleared out um what you can see here actually let me just sign out and we will be on the same page so you'll see a page like this um it may have something um at the top um mine doesn't since I've signed in um but go to login if you've been here before welcome back um if you haven't feel free to click sign up right here um super fast process um as often is on the internet these days so just your email address you'll get an email um and a link to click to um confirm that's your email and you'll be all signed up um and once you do that um you will be able to log in so I've logged in with my Google account so I'm just going to do that here oh and maybe because I am signed in somewhere else okay um so this is i' I've cleared this out for you guys so that I'm as close to to fresh as possible um but once you get in you'll probably have some more interesting uh tools to help you out but what we're going to do is create a new model so that we can pass in um some some text so I'm going to hit create resource here I'm going to give a model name um so let's go ahead and call this uh Lang kit demo I'll just make one of them keep it at Daily um and you can change this to a large language model right here and we'll add a model okay so we'll get a couple of things here we'll get a model ID so be sure to note your model ID minus 25 because I've made many models before um and there's a couple of other information points that we're going to want so you're going to want to um go and well first we'll take the model ID then we want our organization ID so there's a couple of places that we can get that uh one is up here at or um and then your or ID um then you'll need to create an access token uh so I I have many access tokens but if we click under settings access tokens um you can give a name for your access token type that in um you can give a date expiration date if you want but then you can create an access token by hitting that orange button so once you've done that and you've stored those three things um then we are ready to get into the demo I will show you how this works okay we will reconnect to our demo here and I will switch over okay so um now I'm switching over to the collab notebook um and just some instructions for collab um so because this is a view only for you guys what you'll need to do is um go to file and then save a copy in drive and then you'll be able to run this notebook and edit this notebook uh as you want here's a link to get to your free wlabs account and then the third thing is to um have an open AI API key so the open AI API key will um allow you to use that particular llm that we'll be using in our example um and here are some other useful links that may be helpful for you uh so the first thing we'll do is just make sure that our system is running um so we have our Hello World um and then the next thing we'll do is we'll want to install two things so um we're going to first install linkit so this is the ylabs open source project specifically for text and llms which installs y logs as well um but then we're going to install Lang chain and so as I mentioned before Lang chain actually um um has as a dependency Lane kit in it but we're going to use them separately to kind of understand what's working underneath the hood so we're going to install both separately I'm just going to um close that and let this go oh and it wants to say open that's okay so this will take a little time I'm going to check just chat really quickly thank you for sharing those links all right so we're downloaded here I'm just going to close this again um now what I've done here is um some setup for different organization IDs um and in all of the info that we just grabbed so when I run this I'm going to get a couple of popups um and so the first thing I need to do is put in my organization ID the second thing I need to do is put in my data set ID so my case was Model 25 others will have different numbers um the third one is my API key so this is the one that I would have made on [Music] ylabs and then finally my open AI key so this is the one that you would get from open AI um or the llm that you're going to use um here we're using open Ai and pass that in okay so once I've set these up uh we're ready to rock and roll here um I just want to show you how to get started with lenet so it's super easy um all you need to do is actually import a particular um module that you'd want or kind of set of different modules that you want to use um so one example is the import linkit from linkit import llm metrics this is kind of a nice Stu set of different metrics that are important for LOL um there's light metrics if you want a couple less um and then there's ways to get many more food want as well um and I'm going to import y logs because I'm going to directly use y logs here for example uh so when I run this um this is going to take a little bit of time for the first time that includes me since I'm on collab here um because linkit uses not only uh different methods to calculate metrics but some of those methods include um using machine learning models themselves so you might use a machine learning model to detect something like sentiment or to detect something like uh reading score and so there's a little bit of time to download those models for the first time okay and then what happens after this is I want to in initiate this into a schema so a schema is just an object that we're want to use to hold um all of the metrics um about our data that we want to store and different settings and configuration settings for that okay so let's get started um the first thing we're going to do is actually use some link kit examples um so um we have some chats so I'm going to um call link kit. samples um we're going to load some chat and then we're just going to see the first chat in our list and then we're going to log it so let's do all of those things at once so we see that there's 50 examples the first one has a prompt of hello and a response of world okay um and then we see uh for the first time what we're doing here is we're logging oh oh no session found oh okay um I guess what I could have done here is initiated so let's do this code we'll say y. init um so we're initiating a session here apologies I think I deleted this a little bit earlier in my cleanup um and so um this says we found y laabs information because we passed in all of this organization information um and we're all set to go so now let's run this again should be a little faster um and we are going to to see results from our log of our 50 records of data um and we've given a name um of Link kit sample chats so what do we see here we see that we've aggregated that those rows into some data profile um and we can click this link to hopefully this thing will yes cool um will load um and we see a number of metrics related to our text um before we get to the metrics let's go ahead and look at the text that we passed in so what did we pass in we saw um let's see here let's look at chats so in this first example what I'm doing is I'm just passing in the Raw text so in a data frame of prompts and responses and this goes back to the fact that um linkit and ylabs in general is very data Centric so um we're not you know we're not forcing you to run the llm through us or anything like that this is something that Lang chain does really well but what we are asking for is just what's the raw text of the prompts and then later the responses um from your messages so that's one way to interact with this close this since that's lots of scrolling um and oh I had go to do that right here all right we'll close that as well um so what gets stored inside of this profile U we can look at a little bit of this um so we're going to take the profile oh sorry we'll show this a little better uh we'll take the profile um and we um when we profiled we saved this as an object called results so now we can look at results and we can look at a view of these results um because our results um have these complex data structures that we don't really care about in this case so we just want a view of them um and then we can actually look at them in a pandas format which might be more familiar for us so what do we see um we see different columns here um so the prompt itself um we see lots of different information I wonder if this is just a tad too big to see yeah some of these lines here uh so we can see things like the cardinality uh so this is like how many unique values do we have across your data so the first thing that you might notice here is uh are that these are estimates right this is a statistics based process um and so we have a lower bound and an upper Bound for our estimate for cardinality as well as lots of other things here um but one thing that's helpful to us so if this was a production setting and we knew that we've sent 50 prompts 50 examples to our data we've already learned something about our data set which is that there's at least three repeats right um or or roughly three repeats um because there are 47 estimated unique examples which means that we have three examples where maybe our system accidentally reran the same prompt or something like this and this is worth knowing um lots of other information about kind of the aggregate reading level of the data of the prompts that we've sent um the the average character count um the number of diff difficult words these sorts of things um so note that these are all the cardinality of these sorts of things but we have different metrics here so we might look at for example the mean um it's a little hard to see here but so the mean character count is 81 and we can see the median character count is is 59 characters and so on and so forth our smallest character count for a prompt is a six character prompt probably our hello okay let's go forward so that is that um and so now let's uh briefly kind of break this down a little bit and understand what's going on here so um first I'm going to actually go through using Lan Kit open AI in Lang chain in kind of the longer way and then we're going to wrap it up very quickly uh to see how to do it um with the integration so first um we want to understand how Lane kit works and how you would take typically sorry Lang chain Works before we get to Lan Kit um so how do we use Lang chain Lang chain again has um many of these different components that are really helpful for us to make templates to call llms um and and different processes related to llms so for example um I may want to start a chat um with a couple of different types of prompts so I'm going to um create this template here um that that says the following so I want a system message uh so this when you call at least open API or open AI um you can pass in a message kind of to the system uh so I'm telling uh the llm hey you're a helpful assistant I want you to rewrite the user's text to sound more upbeat and happy and then we have a human message prompt so this would be um a prompt that is from the user and in this case case I'm just going to pass over whatever text they've provided okay um and then for using Lan chain uh for openai what I can do is link chain. chat models import chat openai and use this method um to call open AI with this prompt so first let's just set up um variable pointing to this now let's create a new function um that will take this and package it so we'll take our prompt we'll take our templates here um and we just want to um call our llm so what we've just created with our prompt uh that we pass in here um and then we're going to Output it in a specific way just that's helpful to me um because lening can give the response in lots of different ways um we want to Output The Prompt as what we already had passed in so we don't even need to pull it from Lang chain um but we're going to dig into that response object and pass in the content of the response um that we got from our llm from Lang chain okay so now we have a new method um and so let's try this with an example so I'm going to say something like I Don't Like Mondays or Tuesdays for that matter so this is already pretty negative or sorry this is already pretty negative so we'll see what the upbeat llm should tell us as a response okay and I might have a older way of doing this um all right so um what we see as a response is I'm not particularly fond of Mondays or Tuesdays for that matter but hey the week is full of endless possibilities so some up beatness to that okay so that's just one example of how we use Lang chain right um and so this idea of creating templates can be really helpful when we have applications especially when we're giving some Contex text to the llm that we want to separate from inputs and prompts and so on um so um now we can pass that info into Lan Kit so we've already um done some of this we've already initiated and so on and so forth um we can just profile using the schema that we had before um so we've created one row because this is just one prompt that we've done um and we can see that profile there um often what it's nice to do is actually give it a name uh so giving this a name is especially important for like training data um so I want always say something like positive example I'm just going to give this one a name um and we'll look at that okay uh probably was a little faster than the data there we are um so we see um not much of a histogram going on because we only have one data point or I did this a little too quickly um but we've seen our data come up now okay um and we can of course look at the same view that we've seen before okay we can obviously do this with with multiple data points um but I want to move down to how to um look a little further into our examples here so let's see here we'll have to I think we've created a profile up here so if I run this we can um see that the profile object is not just like a pandas data frame of data but it's actually um data structures that are underneath that contain kind of all of this important metrics and information that we might need about our data so for example this is one way to search for specifically um the distribution Max of a particular metric being the aggregate reading level of the response um and this and we can see there um the answer here being 28 so this tells us a little bit about the scale of the reading levels which you can look into more detail in in our docs okay um so now what I want to do is um maybe pass straight down to um how we might use this so um in our example here what we could do is make another one of these um this time I'm going to use a uh telary agent to run this cell before um we're going to use a different kind of pattern to do this so instead of just y log directly um and passing in the schema what we can also do is create a writer um and this writer um we can give many settings about how this information will be passed from wogs to um and and Lane kit to um the wogs platform so um we can show this with another example um being a second model I think for time I'm just going to skip that um but I I made a more grumpy version of the same system um and you can see that you'll have multiple models of information that can be kept separately and analyzed separately because as you imagine there's probably multiple streams of your application um and finally we get to the code that we saw in the slides here um and this is of the official Lang chain integration um so it's very simple just a couple of lines of code in this case um to Now call the call back Handler for yabs um create our llm here with some settings um and we can do the same sort of thing so this won't be our grumpy template um just for this example this will be uh just passing these back but when we run this we've called um these the llm for all of these sorts of all of these um prompts we now see the outputs of those um down here in the response and these are also passed on to wabs um and so this is you know a much faster way of doing this um but but both are really viable both are fair fairly quick now I'm going to jump to the slides I think that's pretty much it here um we'll skip through this but just a reminder so we kind of looked at three parts so we looked at monitoring and actional um production insights in the while laabs platform um and we actually didn't get a chance to look too deeply into those um but there's lots of really great insights that we can get here um if we look at for example demos um I might look at another one here um if we look at our demo organization we'll see um really great demos from llm chat Bots um that we can see dashboards of for example security issues oh probably picked the wrong date let's go back here L on chatbot we'll look at the profiles here we'll look at a particular batch in time um and we can see um for some historical data what's going on and kind of dig into issues um so I'll leave that more as an exercise to the user here um but there's lots of really interesting actionable insights uh that that come out of this um and gives us kind of an in to-end monitoring and observability approach of our data over time all right um so and then we have yogs um so this is the tool that does that data profiling and then link kit which has this llm specific tools um and is integrated in to Lang chain and then what we saw here before um thank you um I'm I'm going to check for questions um and hear some instructions for swag and things as we kind of wait all right so I see one question um and that is how is Lang Smith different from Lang chain or sorry to Lane kit uh so yeah so Lange Smith is kind of a um a more recent addition to Lange chain there's too many Langs in the world today um and so um link it is really meant um so our tool at yabs is really meant for production settings there's a lot of um statistics and data science that goes into which metric should be there um and what does this mean about observability of your system so not only do we want kind of some standard metrics that we um talk about in the research or talk about um when we think about evaluating a machine learning model generally um we want to think about how those metrics are calibrated to production use cases um as well as um how to do that in a way that is kind of private um so that we're not collecting the raw data uh things like this um works for distributed system so um a whole different presentation to talk about why y logs works for distributed systems very well and so you can merge data from multiple systems together um and still maintain the sanctity of those metrics which is very rare um and something that lengths Smith doesn't do um and just really oriented toward production use cases and observability um another question so the prompt response pairs are sent to the wabs API rather than having the metrics commuted locally correct no actually the opposite so um yocks is the open- source tool that locally does all of the calculation of the metrics um so that object is the only thing that gets pass so only on your local machine does the prompt response pairs exist and on that local machine we calculate a bunch of metrics and send those metrics over um there is an option if you'd like to send some subset of those metrics often the top 30 prompts and responses um so you can do that if you'd like um but that is not kind of the model the model is to do all of that calculation locally and only send the the metadata and and distributional information great question thank you for allowing me to clarify that okay um I don't see any other questions here um I don't quite see LinkedIn questions in the same way um I will try to respond to those in text right after um have a great day everyone um and again feel free to contact me bya email via lots of social media just at my first name Bernice um here's the links to the slide and collab and then if you would like some slag or objects um feel free to go to that link and join our community slack okay have a great day

Info

Channel: WhyLabs

Views: 483

Rating: undefined out of 5

Keywords:

Id: Gxs6VpP3Sww

Channel Id: undefined

Length: 66min 38sec (3998 seconds)

Published: Wed Jan 24 2024