Using Vector Databases: Practical Advice for Production // Sam Partee // LLMs in Prod Conference

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] excuse me do you guys know who Sam party is anybody anybody no dude they don't know you they don't know you what is going on Mr party they don't know you oh I can't hear you though uh-oh I think Mr party I'm not sure if it's my side or I don't know I just it's it's impossible to go after you in anything you're just what am I supposed to do here uh the best part is is that I'm not sharing the top quality Vector database content like you're about to bring so I think people came more for the vector store talks than they did for the random llm improvised songs but you know both of them make a nice little sandwich hey well look it's gonna be some good content it's going to be a little bit more advanced than the last one so if you didn't see the last one definitely go check out the last one it's Gonna Fill In some blanks there we go but uh before I share your screen I just I just want to make sure that you saw this excuse me do you guys know who Sam party is anybody anybody no why do they why do they not know you Sam I don't know it's because I'm knowing I'm knowing I'm irrelevant that's that's gonna change right now because we are about to make you famous everybody's gonna be saying your name after this talk right here all right I'm sharing these slides let's hope it doesn't crash let's go baby all right I'll see you in another one who doesn't crash right okay all right so as I said um this talk is a continuation of my previous talk so if you didn't go see part one it's gonna lay out in a little bit easier of a way then this one will but I'll start where uh a little bit further than where I started last time and go over this I show this live almost every talk I do um but I feel like it's important to start here so that everybody kind of gets on the same page I'm gonna rush through this so once again definitely go watch my previous talk and you'll get some more information all right so what are vector embeddings vector embeddings are essentially lists of numbers I say this a lot but think of it like and I've I've recently used this analogy a lot it's a list of groceries and it's every single one of those groceries is something that you need to go get well in this case it's a list of numbers where each of those numbers means something about some piece of unstructured data audio text images and it's highly dimensional sometimes Sparks but usually dense and highly dimensional in the sense that as again each one of those items means something so in that list you can say um if you think about like convolutional neural networks how they pick up one you know filters if you think about the illness the curve of every digit each one of those dimensions in the list actually means something about that input data it's never been easier to create and there's apis to do so our friends at hugging face open Ai and cohere congrats to Nils and Co on the round but it's never been easier to actually create these embeddings so again I'm going to be blitzing through this but if you didn't watch the last talk or if you need more background you can go down to that MLS Community blog post that I did the way that we search through vectors each of those important data points you can subtract essentially or use something like cosine similarity to compare the distance between them if you use one minus the distance you get their similarity and so in this case we're able to see an example of three sentences and a query sentence where that each are turned into a semantic Vector meaning not lexical something like a bm25 search but actually semantic and that implies that we're looking for the meaning of the sentence rather than the words of the sentence and so in this case we're calculating the difference in meaning between all of these sentences and we can see that that is a very happy person is the most similar to that is a happy person which makes sense semantically those are the most similar obviously Vector databases as I explained in my last talk are used to perform similarity searches so you can create a search space within a database that has broad operations and then with those embedding models supplied by our friends take unstructured data index them inside that database and then perform searches against them obviously I worked at redis and redis does this with in addition to write a search and so when you combine those two things you get a pretty powerful Vector database and I obviously am going to be skipping through this but um if you didn't know redis is a vector database we have a bunch of Integrations shout out to Tyler Hutcherson uh for doing a lot of these um and we have two indexing methods our Ann is H and SW um which should be H in SW not hsnw um and then flat and then we have some distance metrics L2 cosine internal products support for hybrid query support for Json I explained a lot of this in my last talk but we're going to get to the fun stuff now okay so now here's where I'll slow down this is design patterns so these are things as I've said and I've said um to a bunch of people that when I was giving this talk if you saw chibs talk from the ml Learners group uh I did some of this but I've expanded upon it these are design patterns that I've seen that I've deployed for either customers or in practice or in demos um for llm usage with Vector data so not all of these apply to every case but they're things that I've seen out in the field once again and so it's important that everybody kind of gets on the same page with them one interesting thing I thought would be to start is this Sequoia MLM survey I'm sure most people have heard of this PC but 88 of the surveyed group believe the retrieval mechanism was essentially a key point of the architecture of their llm stack um everybody knows that I I you know I'm not going to go for llm Ops or vector Ops Daniel spinova but um I do think that in this case the llm stack together a vector database and a large language model it's clear that there's a Synergy here and you'll see that throughout the design patterns again 88 believing that there is some retrieval mechanism necessary for their large language models so the first one and the one that you see the most is context retrieval and I showed this in my last talk but it is the most important one and that is a kind of an overarching group and a design pattern here but the whole point here is that you have some question and answer Loop or some chat bot or some recommendation system and your your goal is to retrieve contextually relevant information from within your knowledge base to be able to supply the llm with more information such that it has contextually relevant information at the time of generation this is cheaper and faster than fine-tuning it allows real-time updates so very importantly if you have a very constantly changing or rapidly changing knowledge base or source of information you are not able to change the model and redeploy in such a speed or such a you know velocity that it actually is represented in the end result so the only way you can do this is by having Vector databases and injecting that context into prompts when you find semantically or otherwise relevant information to include in those prompts this also protects you from sensitive data scenarios so say that you use something like uh you're like a bank for instance and you have multiple sets of say analysts and you want to be able to say this group of analysts is allowed to access these particular documents and these particular group events are allowed to set access these documents well if the model has been trained or fine-tuned on those documents you can't guarantee that you can't say oh I'm positive but will not say this particular piece is a knowledge base but with a vector database you can you can use things like role-based authentication control rbac with effective databases like redis and say that this particular user does not have access to these particular parts of the index and that allows you to then segment your data accordingly and so that allows you then to say have different more interesting application architectures protecting your sensitive data this is good for all types of use cases not just a question and answer the answer but it's the one that's the most easily recognized there's a couple examples and honestly many of the examples that I have are going to be on reddit's adventures that's our github.com red Adventures it's our teams repo so go check that out one more advanced thing on the context retrieval is something that I've done a couple times now called hide I think it's hypothetical document embeddings but essentially the point here is that you're going to end up using summarization on top of context retrieval and then use that again as the context in a prompt so there's actually two invocations of a generative model here so the first is you have this context retrieved from the queries and then you take those and you use that actually to then get the context for the llm that is going to answer the question and so this this because it takes multiple trips as an lln model is slow and it may take seconds to do and so the way that we've used this in the past is actually you if you're familiar with asynchronous Python program you can do what's called a gather or even just two asynchronous calls that both go to go complete some action so the first one would be go do this without hide go search for contact and you can have a either a threshold or you can have something like that some here some similarity score metric that you were counting on in that case to say is this context good or not obviously if nothing's retrieved then you can go to this second asynchronous call which would then be the hide approach since it's going to take a little longer you can wait and say okay did that first one return anything if not then now let's wait for the high to approach to finish and so that is one way that we've seen that you can get around the problems of context maybe not being perfect for a specific question I do recommend going and checking out that paper there's a lot more to talk about here but I don't have that much time okay so feature injection really cool one shout out to Simba um and co uh feature form for putting in um their the ability to actually use redis as a vector database in addition to a feature store this is something that we've done a lot of since redis is commonly an online feature store so when you're able to use both of them in the same context pun intended you're able to then use the qualities of both of them at the same time so what do I mean by that well say that you had a e-commerce website and a user logs in and that user has bought some product from your eCommerce solution well say that chatbot is you know it's it's you know whole thing is designed to help the user with issues right well if the user says I had a problem with my last order what you could do in that case is then go have you know uh semantic representations where you're bucketing and you're saying I'm identifying that this is a user asking about ordering and doing a lookup but instead in this case you can have this such that a prompt can recognize when you need to go and retrieve features from the database and then retrieve that contextually relevant information at the same time and so this allows for real-time inclusion of that entity information whether it's a user or product such that that model beforehand had no idea that that user had bought that product but now it's able to go and look up specific things about those users and then include that in the context window so every time the user does something in that context window you may have last 10 items user bought or uh last you know type of user active user you know rewards user thanks for being a rewards user and so all of those specific things can now be injected into the prompt because your feature store is uh allowed to be in the same Loop included I guess in the same loop as your vector database in your llm and so especially with redis you're able to do this with the same infrastructure which is really nice because you can co-locate metadata with your vectors so your features with your vectors um go check out feature form um on that um they they're able to do both it's pretty cool they just released it so go check that out um Samantha cashing talked about this one last time too this one's getting even more popular um and so basic concept here is everybody think about a typical hap you know cash right action input it's used as a key right and that you know you do something like Cirque 16 um or you know something something similar to decide what hash slot things go to um and then you can do a lookup by saying okay I have a new input is it the same as that one that I've hashed okay well in this case what if you know going back to the product scenario someone said uh can you tell me about X product and then the next user said can you tell me about X product question mark well technically those aren't going to Hash the same thing but the similarity semantically might be 99.9 percent and so shouldn't that return the same answer and you can set that threshold you can say oh I only want 98 or 99 similar to be returned and you can even caveat it and say this was a a you know pre-written response even better some people are if you have a very uh set set of questions and answers like an FAQ and you're having a bot to answer all of them you can just go through and answer all of those questions and then have them cached the the benefit of this is you're not invoking an llm you're not uh incurring the cost monetarily or computationally and it's going to speed up your application so QPS wise it's going to get a lot better if you've pre embedded all of those answers and queries so basically query gets you know you embed the query and then every time something's answered you store it in the database such that if a new user does it you get that same answer back According to some threshold and so this is becoming very popular um and if you want uh to try out a kind of an alpha of this definitely let me know because we have some stuff coming um so another one that was talked about earlier actually is guardrails um in Factor databases can be used as such one really cool example of someone who's doing a lot not necessarily in Vector databases although it does integrate with Lang chain and does uh can do retrieval um is Nemo guardrails from Nvidia really interesting way to have this like colang definition of what you should be able to do as a bot um finds how that that button Express itself defines the flow forward and these are really important for large language models as it allows the uh you know the the ones that are really prone to hallucinations especially the ones that are very large to be used in ways that are much more contained or strict and so Vector database is in a similar fashion not necessarily in this case but a vector database is in a similar fashion you can say okay uh if no context is returned returns that answer and have a default answer or if no context is returned choose another path and so you can think of a downward facing directed acyclic graph right uh you know and you can Branch from there in Vector databases like redis are so fast you can do multiple context lookups and have that chain all the way down such that you don't necessarily have to incur as much cost going to the llm every time you can just simply have that that tree of options for yourself and so getting really interesting with how you actually pre-compute all of those embeddings and make that directed acyclic graph of options for your model to go and explore in the case where no context is retrieved is a really interesting way to handle putting like a barrier on your llm or guard rails the way I like to explain this sometimes is it's like bowling sometimes llms they get stuck in the gutter right I know I shoot it in the gutter all the time I'm awful at bowling but guard rails just as they sound they're like bumpers um and so you can be a 10 year older you can be me and you slam the ball down the alley way too hard and it will still run into the bumpers and most likely hit something at the end of the alley uh or lane or whatever they call them bowling all right so long-term memory talked about this one I think last time as well but this one's a really interesting one um now that context length is a lot bigger um it's you know somewhat uh like people immediately assume that this kind of thing might be irrelevant but even as context Windows get bigger from practice that I have it's not necessarily best to always just slam everything in the previous history into a context window so if you have every single thing in the previous conversation in a context window there might be tons of irrelevant information and so doing a doing a different type of lookup that's actually based on you know a specific user or a specific topic or a specific query um is going to end up returning much better results thinking about things like how many tokens go into each embedding and I'll get into this in a little bit so I don't want to get into too much here but how many tokens go into each embedding how many specific pieces of context do I retrieve for each prompt and so thinking through and going through each of those things is really important and it's not just oh there's 100K tokens now let's slam everything up into the context window and get charged a bunch for you know really computationally expensive implications um it's better to be you know work smart not hard essentially um so that's uh that's that topic common mistakes so I thought uh this would be an interesting one to go over so these are things that I've seen people do or talk about that uh you know make me think you know people are just not they don't really care they just want to do demos and then leave it um and so this is like common mistakes specifically for production okay the first one is pretty obvious which is laziness um I know it sounds funny that I was kind of just talking about context window too but it doesn't solve everything um and so I as I said I was going to get into this in a little bit you see a lot of use cases where someone just takes and I have all these Integrations over here um you have all of these Integrations and there's a lot of defaults that are in them so defaults for things like um you know prompt tokens LM tokens or uh you know specific chunk sizes or um you know ways that actually tokens are parts or taken from documents or there's a lot of assumptions that are made and people using them without understanding them often don't really realize how their data is jumped up and they've never actually inspected it inside of their database so they've never actually looked at it after it's chunked or looked at you know when you actually go to answer a question what's in that prompt and so thinking about all of these things is really important and you'd be surprised how how you can use worse models for the generative side and actually improve upon these factors and do something which I've been doing recently which is uh it's like k-fold if everybody's you know remembers that traditional machine learning exercise in this case you take the variables here context window size number of retrieved pieces of context number of tokens per embedding and those types of variables and then you essentially you can grid search for a good one and then capable which is you take each one of those and do it on different chunks of your data set and that's a really interesting approach to make sure that you've reached the best variables that are the factors listed here and there's obviously a lot more but for your specific problem and for undefined problems sometimes this is really hard if you have a really open-ended problem in the case where you know it's generation could be a number of things so I think more creative problems that's going to be really hard to do this on but something like I was talking about earlier where you have an FAQ and there's questions and answers that are supposed to that are there and supposed to be correct this is something that you can do and is something that I've done a lot um so I don't think that was to me okay sorry the chat I'm distracting me index management um I don't know let me check my time okay I gotta get going okay so how do I set up an index if I have multiple use cases this is one that people specifically uh don't really I don't I'm not sure if they don't know or they it comes up super like a ton it's like imagine if I'm Shopify and I want to have a per shop uh you know set of embeddings right well think about this there's a couple different ways to do it you could do one gigantic index and then have some type of metadata or filter on each one of them for the store so the first thing you do is actually filter it down to just ones that belong to that store and then do a vector search but how many total embeddings are in that index how many uh what's the size of those embeddings does your database charge money per index how does it charge you and does that data support hybrid database for hybrid in the first place is it efficiently supporting hybrid or is the metadata stored somewhere else which I'll get into in a little bit um those specific things are actually super important so I called this index management but it's really index architecture I guess um which would be a better way of saying it um what I've typically done and this is specific to redis because redis does not charge per index it's just you know you get memory um and so in this case a lot of times the multi-index scenarios there is some overhead to having multiple indices but the multiple index scenarios if there's you know under let's say a thousand indices it's better to have multiple small indices and then as soon as it gets over something like a thousand maybe ten thousand something like that really depends on the size of your index once again and a couple other scenarios but um in that case then you can move to a large index or specifically grouped indices where you can do uh search at the same time in asynchronously um and so you can kind of split it up in that way so thinking through how your indices um are grouped and specifically set up in terms of which approach you take is really important not only for things like QPS or recall but it's also important for how much you get charged and how much you are able to spend or use moving forward as your use case grows so do that ahead of time project your cost test performance and mock scale make fake schema you know make fake embeddings um so just you know go through that stuff um and you know really dive into that because that is something that I've seen people go wow okay I've reached an untenable cost um now because I get charged per indexed and I don't know what to do I have to change all of my back-end code now um so think through all of that okay separate metadata so this specific two plots here come from actually a recommendation system uh how's it going to Demetrius am I am I over time way over but and there's so many amazing questions in the chat but don't worry well let me just get through it let me just get through it I'm gonna Blitz every metadata comes from a Rex's use case um you see on the left-hand side you have a before two Network hops metadata in this case is you know stored it's not actually but stored separately it requires two Network cops to go get it now if you can use the k n search to go and get those specific things then you have a huge boost because you are eliminating a network cop there's a lot more here but go check out the ones there I spoke about it at GTC architecture considerations I'm going to skip that one okay there's a lot more here go check out chip stock if you want an explanation of this but all these things are very important um if you want to go to Eric burn hardin's repo check out that one providers I'm going to skip it example architectures I wanted to get to this okay example architecture one you have uh this is for AWS an example architecture for quite an answer um Vector database for this Enterprise you can use Amazon sagemaker llm's coher multilingual monitoring permethius API layers fast API CSV AWS chat Bots q a etc etc we have terraform 4 down there uh these slides will be on my website party.io Azure based document Intel very similar thing all in those repos there so go take a look at that but that one is actually like one place to deploy terraform shout out Anton thank you very much and then a general on-premise one which is really interesting uh this is one that was actually set up by us and so in this case uh it's all on-prem this can be deployed anywhere there's also a case version of this if you're interested um I I'm sorry I didn't get to more than me to the talk I talked too much apologies um but uh hopefully I can do it again or something and I'll post it again yes well that dude why couldn't you talk that fast the whole time I felt like I was talking fast oh that was a classic there's so many incredible questions in the chat man so anyone that is asking these questions in the chat I think Sam will probably go in there right now and answer them but also he's in slack and we have that channel Community conferences in slack and so tag him directly and we can start threads with that because then more people can chime in it's a little bit more organized than Sam knows okay we're talking about this and it doesn't get as chaotic the chat is great here but if we're really trying to have deep conversations throw it in slack and uh we'll do that anyways Sam for uh for everyone I think the the main thing of this talk is you know what you're talking about people should know you and especially the horses and cows around my neck of the woods they are going to know your name next time I come with the questions aren't they yeah and I uh again I I'm too passionate I'll I'll talk about it forever so I I just took too much time but anybody who has specific questions do tag me like Demetrius said and I'll get to it I promise and so the other thing that people can do is if you click on the solutions tab on the left hand sidebar there is a whole redis section and you can go super deep and you can see that redis has all kinds of cool stuff on offering and so that is uh that is very worthwhile you can enter their virtual booth and check it out so Sam sadly yeah there's we do are you are you sharing something else and on the screen that I should share wait what is this well I was just gonna say go follow me um party or on LinkedIn uh first slides to end the talk he wants to make sure this gets to it which you know you never know with Demetrius obviously but uh I'll I'll repost them on my website as I do for a bunch of my talks go check out the cookbook redis Ventures is all up there um take a screenshot and hit me up Twitter LinkedIn or something yeah so luckily this time you guys are the uh oh I got you off not your slides I wanted to see your face when I say this one you guys are the uh the diamond sponsors so you're gonna be the first talks that I did you're coming out like in a few days hopefully if we can get to it in the meantime I'm gonna kick you off and we're gonna see each other in two weeks when I'm in San Francisco for this llm Avalanche Meetup love having you
Info
Channel: MLOps.community
Views: 2,387
Rating: undefined out of 5
Keywords: MLOps, Machine Learning Operations, DevOps for ML
Id: tchI22onVYE
Channel Id: undefined
Length: 29min 58sec (1798 seconds)
Published: Mon Jul 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.