Vector Database for Large Language Models in Production (Sam Partee)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I see some familiar faces Eugene we've got a um let's see hey buddy um is that the sahoon Troy I know there may be many but if you worked at Craig been that's the city and I know hey uh similar uh please uh make me uh Coast uh you should be a co-host now yep okay uh feel free to get started Don't Dance all right so Gracie everybody see the familiar faces chip asked me to uh come and give uh kind of a very like a little bit more in-depth uh version of my large language models and production talk on record databases and llms um and so I wanted to do that uh and she asked me specifically to kind of go deeper into scenarios so you'll see if you if you saw that um you'll see I'll talk about a little a little bit different content some different examples um some design patterns some trade-offs and considerations in these types of architectures things that I've gleaned from deployness for people and set it up and you know actually run through these use cases so without further Ado I'll get started oh and host disabled participants screen sharing classic can we give me permissions there one second uh if uh Zoom would work could you please make him co-host uh yeah I'll do that right away and uh before and MP4s um same body gets started uh that's let me do a quick intro so same is the uh staff uh engineer and release and he's been very very active around and get involved in the lrm space and as we all know uh there is this thing called memory that uh that's that's usually used and required in building the large language model applications um and send today is going to introduce us to some of the nuances and challenges in doing so now thank you yeah no so um get great thanks for the introduction um you already know me so I'll just go ahead and get started got about 20 slides so um I'll get through it probably in about 30 minutes um and we'll be able to take some questions hopefully afterwards I don't know um but where I always like to start is kind of the top so what are large language models um most people at this point are familiar with them most people have used them most people have at least interacted with them through chat gbt um and it's in it's this long train that we've had from the very beginning of everybody remembers the mnist networks that we used to build back when tensorflow was actually a great framework to use for building these types of models nowadays you don't even build your own types of models you just go and get one of these variants that you see three of them listed there and then you tape them and apply them on your task and it's been interesting as someone who came from high performance Computing where a lot of large you know large large models were trained to now this space where essentially no one's really creating their own models anymore it used to be I worked on determined AI if you remember that platform bought by hpe I'm now part of that difference you know the machine learning offering that they have however back then it was much more I have this problem you know whether it was CV or it was you know something in NLP and you were building a model specifically to address that task well nowadays there are many more options um and actually I feel like the bar and this will be a consistent theme throughout this talk the bar for utilizing these models in production has actually lowered dramatically now that doesn't mean that it's easy and that doesn't mean that there aren't trade-offs and things to consider in these architectures but the uh the bar on which these types of models can be used to make something like a product in this space has been lowered from that point at which you said I have a lot of unlabeled data I have to go through all of that data work process all of the model building model variations model tuning processes to then use that in a product so now the bar is much lower and so you see some of those listed here but one interesting aspect of these models is that they're not even large enough and I like to say that because it's it's kind of funny that they're called large language models but at the same time they don't always include everything they need to accomplish a given task and that's where this concept of vector similarity search and Vector databases come in which would be a theme throughout this talk and something that I've been doing explicitly over the past you know two years I've been really focusing on and some of these use cases you see there will go through a bunch of these specifically that last one and it's variance and how it could be used and we'll also talk about some of what you see there on the left large language model providers closed Source versus open source application development what framework should I use what you know you see Jerry just raised 8.5 for llama index to become a real um you know product essentially besides the open source and this space is exploding and there's a reason and it's because of you know how easy it is to actually create these really powerful new products with these off-the-shelf models so that's a quick summary um but really now I'm going to focus on that that Vector similarity search and that side so that I can introduce the vector day database topic and then I'll talk about how they're actually used together and then talk about kind of going to production and some examples okay so I like this slide I've shown this probably in 10 different talks now but it's important to actually understand what a vector embedding is where it comes from what it does um they are representations of unstructured data and the way I like to say this to people who aren't familiar with the space is that they're just a list of numbers everybody remembers uh you know fourth grade or whenever you took geometry right um sohcahtoa well that's essentially once you have one of these Vector embeddings the most complicated operation that you're usually running um is something in that space vectors no different than they were back in the school days when you had a you know Direction and a magnitude except for now the number of dimensions in that Vector is 15 036 in the case of the open AI models and so these models can take in audio and text and images and videos and whatnot and they can create Vector embeddings that are representative of that unstructured data what I'd like to say sometimes is also it's anything that's not in an Excel spreadsheet so anything like an image or you know audio or text that's not necessarily what's built for an Excel spreadsheet and these are packed with information so they're highly dense most of the time sometimes there are star protectors but they're usually dense vectors where each number truly means something and when I say mean something they mean something in relation to another one of those representations one of those vector embeddance and so when you compare them the comparison operation despite being such a simple operation that you did in third grade is actually a really powerful operation because each of the numbers in the list means something really important everybody probably at some point has gone through a CNN tutorial where you start to see the the you know the edges and the you know thinking about end this again the edges and the curves of a number are start to get bucketed successfully again and again in the smaller and more important features well or larger depending on how you're in your model but uh in this case those same attributes of what's in an image what's in text what have you is present in that Vector embedding through these dimensions and as I mentioned earlier these are really easy to create now so you have hugging phase Transformers which on the right hand side if you've done a tiny bit of python in your life you can create Vector embeddings um import pick a model have some sentences and encode them and then boom you have your vector embeddings so I show this one a lot as well um you've probably seen this um I overly simplify this example so this is not accurate but if you if you didn't realize you can't necessarily visualize 15 036 Dimensions so you always have to do some type of reduction into a space that actually you can see this is completely oversimplified for the purposes of the demonstration but in this case we have a search space of three semantic vectors that you saw May create with the hug and face model on the last frame and those three sentences the goal here is to say if I have some queries in some query Vector that I've also created I want to compare that Vector to the rest of the vectors in my data set and that's what search algorithms do they are data structures essentially that allow you to efficiently search over a large number of these vectors you've probably heard of some like hnsw hierarchical navigatable small worlds that allow it's essentially a tree based structure in some ways that allow you to officially search through a highly dimension no space very efficiently and so when you use something like cosine similarity calculating the cosine angle between those vectors you can see then that you are able in in really it's that's the similarity that you get it's one minus that or that's the distance that you get one minus that is the similarity of those vectors so you see that one minus that is a very happy person is 0.943 to the query sentence that is a happy person this might make sense and you might think that bm25 can actually do the same thing but as soon as you look at that as a happy dog or today is a sunny day you start to realize that even though there are other words present in that sentence that vm25 would pick up on this semantic meaning of that sentence is significantly different and that's also something that's going to come back through this topic is that each of those Dimensions um they're more representative of what's actually in that sentence because of that model and the vector representation that model creates it's been trained on all of Wikipedia and all of Reddit and all of you know whatever the data set the open Ai and everybody else is collecting to actually pre-train these models and in this case all of that information is used to pack this Vector with information about um you know what what's actually what that test actually says and so just like chat gbt can understand certain things about what you put in that text box so too can these vectors represent what you put into that model and hence that Vector that it creates so um I always point out this one because I I wrote a lot of really simple code just using numpy and whatnot uh to actually calculate the vector similarity if you want to go check it out the ml apps Community page also chip has some really great stuff on this to check out our blog um so what is a vector database then a vendor database is essentially you can imagine those same data structures that I was talking about for efficiently searching Plus Chrome operations and in some cases not even crowd operations but usually a fully supported crud and database like features in addition to some type of secondary indexing or data structure that allows you to officially search over set vectors that you store um I worked at redis and helped run the vector database unit which essentially using Reddit search helps make redis into a vector database so you can store vectors in hash or Json alongside like uh you know other texts numerical Fields Geographic Fields Etc and you can do interesting combinatorial searches between vectors and all of those other fields so um you can do this as mentioned with any type of these models that are now off the shelf what I like to say um and I'll get into a little bit of this um you know for when to use what providers from what we've seen in the field um but really if you're doing something like putting together a demo and just trying this out yourself it really does not matter all that much now if you're deploying this in production and you're going into like having a really powerful search we'll get into how to actually go through some of these various things that are really important I'll point to some resources that we've made in conjunction with open AI that will that will help out so that is essentially what a vector database is it organizes all of these vectors into that indexing methods that I mentioned earlier and then also provides the crowd operations to be able in the case especially at redis to be able to change them and index them in real time such that you can deploy them safely and comfortably as an engineer into an application um so I'll talk a little bit about this and I'll move on because I don't want this to be an ad but we have a bunch of Integrations now as you probably saw um my team's done llama index relevance Gene the metal Lang chain and open AI retrieval plug-in uh we have more on the Slate one thing that's coming through is uh you'll probably see it open PR for Reddit search to add the raft based indexing from Nvidia we're working with Nvidia Jensen announced that a GTC um and that will be a really cool new indexing type they'll actually run on GPU so if you're interested in that and you have a use case for a really high scale really fast really high throughput database that has this capability let me know I'm interested in doing some plywood projects there but we have a bunch of different distance metrics um and when I say hybrid queries here for those of you who know what a hybrid query is really what I mean is filtering preamp post filtering of a recall set so you can do things we have had read a search for a really long time text as I mentioned numeric Fields uh text being like full text search Geographic you know range Etc and you can use this in combination with Vector search and then also new and 2.6 you can do this in Json all right so now for the interesting part that's not just about redis um some design patterns so things that I've seen in the field uh or that we've deployed for um and obviously generalized deployed and generalized um so that we can talk about it um but things that we've seen people do that we consider interesting for this type of talk so I'll talk about three specifically here um context retrieval large language models memory which you know might seem a little interesting and then caffeine and specifically semantic fashion which is a really interesting new way to use Vector databases especially one like redis so contact retrieval um in these things called q a Loops uh it might actually not be a chat bot it might be something like you know uh say document retrieval but it all kind of looks like the same Loop the point here is what we're doing is we have a vector database set up ahead of time where we've taken some number of documents or images or what have you some usually some piece of text um and some body of text and we've broken those up into specific Vector embeddings that represent that text we've then indexed those in a vector database and then created a you know a framework around that Vector database to be able to ask your question embed that question into a vector embedding use that to look up in the database what's relevant context and then construct a prompt with all of that information that we've retrieved from about your database one really interesting thing about Reddit series that it actually allows you to do this in the vector search call we'll get to this when I start talking about other Vector stores but essentially a lot of things a lot of stores or a lot of platforms right now they have separate metadata um and what's what's really crucial is to store metadata and vectors in the same place to avoid Network latencies um and because think about it if you are searching through something you don't want to wait for you know say you have a million documents that chunk up into five million in beddings you don't want to have to wait a really long time to search through those five million bettings just to retrieve the context for a large language model that's already possibly going to take five seconds 10 seconds in the case of some of the GPT models of the network so storing that metadata in conjunction is really important so in this case if we have a hash or Json slot we can store text in a vector embedding that's semantic embedding that represents that text in that same data structure and then when we do a vector search we ask for those text Fields Back by getting those text fields and doing the vector search we can then put those text Fields into the prompts that we pass to GPT in this you know this is uh you know something like uh what you know Lang chain was facilitate um or some other Frameworks um like relevant say aye you can then construct a prompt that answers that question with context and so I mean there's a couple really important parts here that might not be obvious if you're not familiar with record databases a lot of people talk about fine tuning there are cases where fine tuning still may be needed in this kind of loop if the text isn't you know it's it's out of Corpus or it's you know there's there's a couple other reasons why but specifically here this is much cheaper and faster than fine-tuning a model and what I mean by faster here isn't necessarily the process uh the offline process but the online process if you have to go and re-tune your model which would be insane for you to do in you know production regularly then you're going to be incurring a lot of time so in this case and there were people that were doing this before this is a popular approach however um right now instead you can just update the vector embeddings in the text we're in or add to that Corpus and then you have new context immediately which allows for real-time updates so in the same milliseconds to do a stuff something in your cache you can now also update a vector embedding in that index and then that text with that embedding and so if you get a new document or if you get a new answered question in a community or if you get a new um any piece of data that comes into your data set you can update that index and that piece of text in real time instead of having to wait for a you know model change a b rollout to be done so the last thing here is also sensitive data this is a big thing um work with a lot of Enterprise companies and a lot of Enterprise companies don't feel comfortable making a model but they don't know will output something sensitive so instead you can do context retrieval and in that context retrieval you can put in rules for specific pieces of data um you think about like role-based authentication control are back you can have roles that essentially are privileged for various users to see different contexts and that model beforehand does absolutely nothing about that sensitive data so there's no way if you you know fine-tune your model for you to guarantee that it's not going to say something about that sense of data however if you do it in the context retrieval Loop you can guarantee that the last thing I'll say here is also about hallucinations um so that's a really big topic right now Vector databases can be used as a sort of guard rails they can say oh if I retrieve no context what do I do I actually just did an llm stack hackathon with some of the people that were actually probably in this uh this year's Zoom um but one of our judging criteria chip chip and I when we were judging was how well does it deal with hallucinations it's something that we see in production with a lot of these Loops whether it's q a or not models love to make stuff up they do and one way to stop them from actually uh making something up is to tell them when not to their instruction based models so if you instruct it if no context is retrieved to not say anything or produce an answer as you probably see gbt does all the time or chat gbt rather then you'll see that's actually how these Vector databases can be used so there's some use cases we'll talk about here and I'll actually give a couple of those but if you want to go check out a couple examples these everything I'll be talking about today is on redis Ventures that's our uh our GitHub for my team um that's the applied AI unit we are running out of things to slap on the end of redis so we thought Ventures sounded cool all right so long-term memory this was actually a project that we uh interacted with quite a bit um it was outside of our group however it was really interesting so we wanted to see kind of what was going down with them so um in this case there's a finite context window and I know some people might say oh but anthropic has a 100K token uh you know context window now well even if you have a hundred thousand tokens and you are trying to build something like say for your ml learner's slack group right where that slacker every single message within it is actually a vector embedding and you will have the text associated with all of it eventually you run into the point where you can't just stuff everything into it not only that but that all of that might not be relevant and it might confuse the model into thinking that it is and so instead of going to the approach of well just stuff everything inside it's let's find the most relevant pieces of information to actually put inside of the database and so in this case what this long-term memory approach does is through conversations it's able it's able to actually pick out and decide which pieces of a message history are actually relevant to the conversation this allows you to do things like topic changes logging off logging on continuing a conversation and having those previously relevant pieces of information returned imagine if you are building a bot for TripAdvisor or you're building you know for Expedia or something you could then type in something about going to the Bahamas and then start talking about a trip to Germany and it wouldn't then use the context of your trip to the Bahamas like some approaches would that are out there today um that just use your entire previous message history in that session instead it's going to go and look up every single Vector that's relevant to that particular message and instead pick those pieces to then have a conversation about and so then you could go right back after talking about Germany you could talk about your trip to the Bahamas and it would not be confusing in order to pick out the relevant information so this is a project called long-term memory since you can hold significantly more um vectors in a vector database than you can uh a context window even for you know anthropics I think it's improper models um and this is useful for a number of things think about information retrieval as well not just chat Bots and now a really cool one um so one that we've actively been working on um but a it's called semantic caching and so you could think of let's let's talk quickly about caching what does caching do in general so yeah some web page or some query um and you you know say you're in Salesforce and you're looking up a number of companies or something I don't know um not in sales but essentially if you have some list of companies that you've had to compute based on some queries you wouldn't want to then go immediately run that again you would want to take that query in the result associated with that query and hash it such that you can retrieve it just by an 0-1 hashing and retrieval and a key Value Store something like redis now that only returns that value in the case where the query is exactly the same because a hash is a explicit hash of that text and it's not meant to have collisions right the whole point of it is that it has to be the exact same well what if you had a really similar query like say it has a similarity score of 99 but and you know the answer is going to be the same but it's not going to Hash the exact right thing because a user maybe put in a question mark and then the other one didn't put in a question mark or something like that it won't be the exact same that query is not the same text so it will not hash to the same slides what we've been looking at is how can we take this concept of semantic uh you know search and apply that in a caching sense so what you're going to be able to do and and you can do right now by the way there are people doing this and in production um we're just looking at making it easier is essentially be able to do a vector query for your cache contents set a threshold for how how similar can something be here and you know what that or you know to the point where it's not necessarily looking up explicitly the same thing but how similar can it not be and say still return that same answer and what does this do it saves computational cost that's that's one of the big things every time you do one of these lookups you create an embedding point zero zero two sentences and open Ai and made up and then also it invokes an llm which you know depending on the model you're using may be somewhat expensive and so you're going to save on costs and so we've seen people deploy it and immediately start saving money because the truth is people say the same things people ask the same questions if you're building something that answers questions on a website the truth is you're going to get a very similar question from time to time and so you're going to return this returning the same answer is actually even better some people even go as far as to pre-compute a lot of those questions that they expect to get and then set a similarity threshold and have tests per win particular questions are of you know the same nature and this is something that you can do ahead of time and then instead of actually relying on the users to populate you can pre-compute a lot of those answers and then speed up your service overall and save costs this the first part I said there is also really important when you return something from a cache and you do a vector query in the case of redis you're looking at milliseconds or in some cases sub millisecond latencies that is significantly faster than a large language model almost everyone I can guarantee it because they're slow they're really large they move slowly they're lumbering Giants and in that case you're not only saving on costs but you're speeding up your service which is kind of the whole point of cashing in the first place anyway um and what does this apply to well pretty much every single llm use case which is actually really nice because there there's a huge need right now to reduce the costs and increase the speed of these types of architectures so it's a really interesting space to be in so let's speaking of which talk about some architectures I've ever seen these are just some considerations so I talked about this what about what providers how do I choose a provider how do I choose what model um you know is it open source or is it closed Source do I have uh the actual model itself from my opinion API do you use a platform or do you roll your own um do you take a small model and fine tune do you see a lot of progress right now being made um it's instead of using these really large models taking much much smaller models and actually fine-tuning them for a specific purpose and you see them outperform even some of the largest models um and it's hard sometimes to discern when to do one or another but a lot depends on your application and your use case so if you need really high QPS low latency you may think about doing something like that but if you need the utmost flexibility and your team is really small and you're not machine learning experts you may rather choose that API um what's the Enterprise Readiness of this is has this model been you know given an SLA like are you using something off the shelf are you going to get rate limited those are all things to think about when you actually take something to prod that's what we call Enterprise Readiness um a little bit of Buzzy there costs too um so this is a tough one if you host you know a hugging face model and you're gonna put it on Triton and then put that on a you know G4 instance and ec2 did you chart out the call so that infrastructure rather than just ping open AI because they have and you know there's a lot of times in which it's it's cost advantageous to do that there's also times it's cost advantageous to just use open Ai and so cost is actually a really important dial here because sometimes especially with semantic search you don't really need really good events um the the better the embedding obviously the better the search but for some use cases that's not really necessary sometimes it's used as a backup strategy to vm25 or TF IDF um you know like a tech search so um that's an interesting thing to consider scale and performance as well how much does it cost to host model replicas how much does it cost to scale out those replicas using something like red and Hans t-shirt right there um and you know what what is the performance of these models in terms of GPS but thinking about through thinking through all of those things when you go to set up these architectures is what I do on a daily basis and it I tell you it completely depends on the use case um and everybody wants to do something different so it's interesting to see how different this space is rather than something like say normal database usage where it's pretty much just like I need a fast one choose this I can afford it being slow and relational and use this or something like that there are a lot of different things to consider in the space including integration providers can you trust that these providers that are being put out are not going to change so quickly when you go to production that something's going to break if you update your your version there and so strategies that have been in software engineering for years like a b testing are still really important or even more important in some of these cases where this space is so quickly moving releases are happening on a daily Cadence um so I talked about Andrew cost and model time Dev time um one more interesting point here is uh the the kind of large language model providers themselves they are somewhat different um so uh if I'll show an example later but you can go to it now and play with it if you want it's https doc search dot redisadventures.com um and it is a interesting demo because it has three sets of embeddings 1.2 million in total and it has a hug me face model it's uh uh all ml peanut one um and then it has cohere and open Ai and each of those do something better than the last one they all fail in different use cases as well like for instance if you use the hunting based model and type inquiries it may do very well but if you translate that to Portuguese you'll see completely different results come up now if you take the open AI model and you do a query and then you translate that query into Portuguese you'll see it does fine this semantic meaning behind languages is slightly different but you'll see that actually the open AI model does pretty well the cohere embeddings that we used aren't actually their multilingual embeddings but we have used those as well and those are probably the best for multilingual use case I think it's like 1500 languages including like deadlines I think like Latins also included and that that includes the non-latin characters and things like that it actually is really really powerful for situations where you don't want to build a search service for every single language um and so that is also a super interesting piece of this um in that you're not needing to put in a translation layer and things and use cases like that are important to consider when you're choosing what model you're going to be using for these use cases also really important is performance of the vector database so Eric bernhardson created annoy um at Spotify um and it was as a driving factor of a lot of the a n activity that goes on um and one thing you'll see on the uh Eric Burns GitHub Eric Byrne that's uh e-r-i-k of d-e-r-n um in his a n benchmarks repo is a comparison of a number of these different platforms and algorithms not just databases just out you know pure algorithms um and how these databases in the terms of approximately nearest neighbors are actually uh benchmarked is queries per second for recall and the reason being is you know if you have a k n obviously your recall is perfect um if you have a in this case if you have an a n you have a lot of tuning knobs what's your in you know in the case of hnsw which are in parameter what's your EF construction what's your EF runtime all of those that you see in the hsw paper they all matter to determine what your recall is and what your QPS is in this case this is a single CPU Benchmark where they test uh the QPS for recall of a number of different platforms um just to shout out because obviously I work there if you see write a search is the purple one here um in terms of the far end of recall uh the three that are above us are um essentially our rooms um so things like scan uh that you see here on the top um and a couple other ones like I think it's Pi descent or pendescent uh which is this one I think um we do fairly well in terms of QPS recall that's honestly because redis is uh built as a cash origin I mean it's really a database that was popularized as a cache key value store popularizing with the cache um but it's very fast so QPS queries per second is something that um you know it has an advantage over in a lot of cases but um it all depends on your scale if you have uh you know 10 billion embeddings redis probably isn't your database of choice you probably want to use actually something not that's not a database you have 10 billion embeddings because otherwise you're going to be paying out the nose for it you probably want to use something like PQ IVF Vice or something like that where you're going to be able to condense that Vector space into a really small area and what you'll do is simply have I mean I've done this before I've actually databases were a thing you have like a Lambda function that loads it from S3 and every time you call it uh you just call it a laminar function and it's loaded up if you called in the last 300 seconds or something like that there are times and places for everything um and so just because QPS recall is important doesn't mean it's the only thing um and you should consider this when you're setting up these use cases you shouldn't just go oh such and such is the best so we're going to use it um it's there's times in place for everything and all of those things that I mentioned that were reported about models are important uh for Vector databases and which one you choose on uh you know in the same way so I'll go through some of this um and this is a very quickly changing space so if I left someone out that's like you know but I made my Vector database yesterday please don't you know comment my neck um I tried my best here and this is a lot to research you'll notice that everything is skewed to the right that's because not a lot of stuff is proprietary and or self-managed there's only two in that category essentially um Pinecone and GSI technology we'll skip over this and we'll go right to the new players so in open source and um in kind of the managed versions okay uh you have a bunch of new players you have chroma which is uh based on the calendar store click house you have quadrant which I think is based on rocks and and weaviate Bob's company and milvis um which has been around for actually a long time um and Vespo which is probably one of the better pure search Vector databases uh bald and Super Bass uh Super Bass being based on PG Vector um actually you know one quick thing I want to point out just so that everybody realizes that um why I sometimes take fun shots at PG Vector uh on Twitter is because if you see here PG Vector has some of the lowest QPS to recall in the game as I said that's not that important for every single use case but it's an important thing to point out that if that matters for your use case you you should know um and so do your research and look up what you uh what you know so going back a bunch of new players okay um and and honestly a lot of them are great um and a lot of them you know just like anything have trade-offs um there are also what I would call the incumbent databases because right now these are essentially Vector search platforms um some slightly different like super bass but um these are essentially Vector search platforms that are becoming better databases Civic based obviously being a little different here but all of these were uh they weren't necessarily making a database they usually wrap something like uh click house or rocks um and then they became better databases over time or they're becoming better database over time versus this category which are databases that are kind of becoming better Vector search platforms so there's an interesting dichotomy between the two um and that is something that should be considered when you're actually going to production with these kinds of things um there are a number of trade-offs that are encountered like for instance some of these platforms have PQ product quantization which makes your vector embeddings much smaller they maintain a cookbook when they're able to translate those embeddings into much smaller memory consuming versions of those embeddings themselves and those are that's really important um but then you know they might not have something like you know an on-prem version that can also active active replicate to something like a cloud version or they might not have you know a number of different deployment options or you know hosting options um or other features that may be important so I'll go through some of those considerations right now you have team sizability and budget that's kind of the first and most important one um it's important because if you don't have a large team and you don't have uh you know specific ability in the space or a huge budget you may not go for something self-managed you might want an easy to use cloud platform you just kind of click to deploy have it load up for you there's a number of options there for that um but then you know if you get into budget you know there are going to be some that sit lower on the cost spectrum because they may be newer and then some that sit higher on the cost spectrum because they may be more established and so you have you have a dichotomy there as well performance like I mentioned in the last page there there may be many use cases not to pick on PG Vector again where PG Vector is completely fine and it's in the QPS recall is totally okay for your use case but if your use case needs a better QPS to recall or even one of the other pieces of performance like pure latency then you may consider a different option and so it's important to also bringing that to mind when you choose what platform or search algorithm you're going to use capability so think about something like Feist Feist doesn't have hybrid search it doesn't have like a filtering of the recall set it's just a vector search platform so yeah you might be able to put four billion embeddings in it but maybe it'd be better to actually do a smaller number of bettings that also have an ability to filter all of that pre or post recall set which are offered by many of these like elastic has their pipelines you might need to run some type of inference if you're building some type of say recommendation or search service you might need to run some type of post-processing or some type of other feature that's not available if you're just using one of those off the shelf search algorithms you may need that capability in conjunction with whatever you're using you may need more database like capability like replication so that you get a high availability those are all things to consider when you're actually choosing what you're going to use you may want to look under the hood you may want to also have pure crud you may you know some of these like I mentioned Vice you know you can't and and some of these Services by the way are wrappers around twice um and so or custom variants of advice when you have something like that that trains on a group of embeddings you need to consider the fact that when you add a new embedding you essentially need to retrain it's not necessarily training it's more of a fit or a transform but you have to create that Vector space that's what allows for a lot of optimizations but it's not really pure crud you're not taking in taking out um whereas Something Like H and SW and people that are vendoring or taking that into their libraries you are able to essentially prune parts of the tree and actually perform more quad-like operations obviously with a flat index in k n it's much easier um think about it like list operations Integrations does that excuse me does that platform integrate with your favorite um you know integrator so something like Lang chain well you might need it to so and that might be one of the decision making things right there you might need it to have a chat gbt retrieval plugin if it doesn't you might not be able to use it you may want to be building more of the platform yourself use something like llama index or langjango like have you you may want to build a little bit less and need more quickly you know made functionality something like relevance or even further if you want to you know or or in this case like uh like metal for instance is a more I would say it's even lower it's between relevance and like a line chain where it's really just like a managed version of a vector database such that it's really easy to get started but you're still developing that yourself it's not really performing that integration layer and then going up even further you have things like you know you have the mantiums of the world that you know are doing a lot for you they have their own retrieval plugins they have their own platforms that are doing specific things and I think what you'll end up seeing seeing the spaces of proliferation at the high end some people doing more and more at the highest levels of abstraction so going up even further in the stack and saying this is the best one for this particular use case and that proliferation will get even stronger I think as we go on and the last one I was going to talk about today is also deployments do they have like a Kate's operator do they have Helm charts are you going to be deploying this for cell phone case or eks or something like that um do they have people that have deployed that before do they have offerings on every cloud Marketplace do they have HIPAA compliance do they have fbir or whatever the other one is that's for financial people um as you can tell I'm not a compliance person um those are all things to think about um because what you don't want to do is start your startup get a ton of jazz get a ton of funding and then realize that your whole architecture needing to be compliant suddenly realize on this platform that cannot provide that we've seen that a couple times we've had people um that have you know essentially foot gunned themselves for that exact reason um and you know they come to read us and they go do you have this and we're like yeah we do we're 15 years old um so that's another thing to consider but that's also so again not everything all of these considerations matter and more there's a ton of things to think about when you're actually choosing this architecture and this setup so everything from model to the database to the way they communicate to the integrating layers all of that matters and it's all going to change how your application performs so now just to talk about this let's go through some use cases um this one's pretty pretty easy uh you'll see some examples of this on resventures but uh I talked about this one because it's fun um so we have a pro there's a there's another version of this it's not quite as cool as this but it shows you how Vector surf Works um it's e-commerce.redditisventures.com um node Dash and e-commerce um and so you take a bunch of product reviews descriptions what have you um and this is actually really powerful um it should it almost feels like it shouldn't be this powerful but it really is even if you just take you go on Amazon you just scrape all the descriptions of products which may or may not be legal so don't want me to that but if you do uh go and do that and you scrape those descriptions those text embeddings are actually really good at providing a user search Service as I mentioned earlier that's not everything if you're going to go deploy this as an actual search service in production you should go look at uh Colin Jarvis's Notebook on open AI it's in the opening eye cookbook and apps and it's the Enterprise search you'll see how he implements things like re-ranking and whatnot that actually allow you to have like a true more enabled search service um but even off the shelves it'll make for a great demo if you have these reviews and descriptions create these semantic embeddings and index them in something like rust doesn't have to be you can create an application back end that also communicates with GPT so say we're using 35 turbo here we can take those Vector embeddings retrieve complex and chain them together with something like a laying chain that's an integrating layer we can then provide that q a loop that we talked about to provide a chat dialog from the same back end and this will look a lot like you see on that website that I mentioned this allows users to say something like I would like a stylish white collared shirt with a pocket and so in this case if you go and you look this up um on a number of different search services or e-commerce platforms this doesn't return many good results usually um sometimes search services are a little bit better um but saying this in natural language like this where you're saying I want uh oh look at the cute dog um I uh there's a cute dog in the zoo if everybody hasn't seen that um but you could say things that are more indicative of how you would speak and so imagine just talking to your phone like Siri which I imagine has to get better very soon um will be something like hey Siri or order me a stylish white collared t-shirt with a pocket you know size 36 yada yada yada you're not clicking through GUI anymore to say I want a medium I want this I want this I want this natural language will actually be able to do that filtering it will be to pick that out do that filtering and then actually select a t-shirt or in this case a color shirt that will actually represent what you said and so in that case doing that looking up those embeddings generating that uh response and actually looking up those products you can provide not only a chatbot like experience but also send back product numbers product links um build an affiliate platform if you'd like make some money on it um and actually return that affiliate link and make some money so that's just one example um I've actually seen someone make this and they're they're going to make it a product it's it's obviously more involved than this architecture shows but the point of this is to explain it so um this is also another one a very popular example right now this one is actually made by a guy named Fabrizio who's a gbb a global black belt which has always been something that I find interesting that Microsoft uses to say that someone's really good at programming because karate doesn't have anything to do with programming but that doesn't matter so in this case you could take your knowledge base which are essentially in this case PDFs documents texts you know what have you um and in this case instead of using something like llama index we're going to be using all of these Azure capabilities and so you can see form recognizer translator um and then the Azure open AI service you know Satya was smart enough to get 49 so he's able to host these models in Azure um and then you know redis is actually a first class platform on Azure as well so you can use the Azure Casper as Enterprise AKA acre um and he you can do so and deploy this whole thing either on Reddit stack on your laptop and just you know put in your Azure credentials or you can actually deploy it in acre itself and that same Loop that we talked about earlier with open AI models you can do completely on Azure and so in this case instead of returning products um and having that chat experience we're returning doc documents and you can Implement things like this you know we talked about long-term memory or semantic caching um in this case both of those can apply you can have people look up the same documents you can have people um return the same query and then you can also have uh people that over time think about how many times you've looked up the same thing in a slack search or or another service um but maybe you have a slightly different variant at this time um in in relation as well you may be hoping that you can chat with this service and have it remember those specific conversations at a later date without confusing the most recent context with the most relevant context all of those design patterns I mentioned before applying both of these examples and so even though these are two relatively lightweight I would say examples for uh what you can do it is still in incredibly powerful uh system to be able to put into production and that's really possible because of this intersection of large language models and Vector databases so that's it thanks awesome great unkind I was just about to uh give you a time check nice nice nice so uh yeah we have a whole lot of questions so we'll see uh how much we can get them insert uh you have yeah uh rest wake do you want to get started on the sliders yeah uh so the first question is is there an example of redis and hybrid search uh for example bm25 plus semantic search that we can try yeah absolutely so I would look at that Enterprise notebook that Kyle and openly I put together it's opening I cookbook apps and then Enterprise search or Enterprise something mostly because of the full app but if you just want like a jupyter notebook uh you can go and look at what my team has been doing in the open ad cookbook as well uh and it's uh like examples Vector databases redis I think it is and there's q a there's using Json there's hybrid um and if you don't uh like any of those as well you should definitely check out redis Adventures just the GitHub organization because everything my team's done is there and even the two apps that I mentioned which I may have included on here oh yeah here we go bang so I should have pulled that one up um that's the GitHub repo um there's a q a example uh but also the the product search which is the e-commerce link and then also the um doc search one both have examples of that hybrid search where like for instance on the archive search one you know archive papers we you know filter by year or by um a category of the paper and so that's going to be filtering the recall set um or excuse me pre-filtering the recall set so filter recall yeah uh they also have do you also have any advice for developing with redis locally uh for Speed yeah Reddit stack Docker run Reddit stack Google Docker Hub redistax super easy um and you know unless you're AWS and you're going to host retistax as a service you can get started with your startup on just read a stack um if you maintain that in the beginning slap it on ec2 instance you can get rolling um and you know some people do that it gives you the option to upgrade later it's great that's why we do what we do with red stack um so you know seriously just Docker run that Reddit stack example if you go look at someone like the Q a and whatnot you'll find Docker composes on redis Ventures if you'll find those Docker compose examples that allow you just Docker compose up it'll bring up the Jupiter notebooks you'll be able to just use it and one thing that's nice about Reddit stack which you should know is that when you do that Docker run and you open those ports it's going to start redis Insight which we've been putting just a ton of work into um and it's a lot better and you'll be able to see everything like hash Json you know Etc in the actual database and be able to click through and edit values and um it's a really good GUI so it's kind of like uh peering into your database you no longer have to just use Reddit CLI and thanks to all the people saying thank you good to see you Eugene um yeah I appreciate it Dan um but yeah so that would be the best way to get started locally just Docker Docker Run Red stack it's also an apps I think there's there you can just install it there um if you just yum get it um I'm pretty sure that you can do that as well we have examples in Google collab where you just run it all in collab and install and run it in the background I feel like I'm not sure if you should actually be doing but you can do that as well awesome I I know we are we are um we are at time where it's like the uh slightly overtime um so now we have a couple choices um depending on your preference Sam uh so one is we can take some of the rest of the title questions offline or if if you if you're willing we can uh continue to do this I think there's around yeah sure let's do a couple more let's do a couple more okay A couple more all right perfect yeah uh what's the difference between long-term memory and context augment I feel they are both similar yeah so they are um absolutely um the difference is one is more static and one's meant for more of an interaction based experience so uh in the context retrieval you're usually not updating that contextual database that's filled with your documents or what have you um every time the user you know interacts with the service it doesn't have to be necessarily dialogue um but when you use long-term memory approaches every single message is actually being created into an embedding and then put into the database specific for that user and so that user's interactions are actually stored per that user whereas in a contextual retrieval setting it's really more like one big knowledge base for all of the users of the service but they are similar I mean yeah that they do essentially the same thing it's just a different system how should we think about llm caching save money speed it up sounds good yeah uh that's that's what it's meant for um instead of just direct hashing you know think like you know MD some you know those things where it's it's like say if I have um you know one two three four five and I I do python the hash function hash and I hash that the only thing I mean you may have a hash button depending I mean let's not get really specific here but the only thing that's going to Hash that same thing typically without a collision is one two three four five but if I have one two three four I actually in some cases may want that value because I think it's semantic or that you know semantically or visually or what have you similar enough because of the embedding representations to actually return that same question or that same answer rather and so that's that's really the purpose of semantic quick caching and if you have more questions on that definitely you know what I'll just put my email in here definitely feel free to email me um with those questions because if you're like I want to check out you know I want to Pilot the Nvidia stuffer I want to check out you know the uh uh you know capturing just hit me up um you know happy to happy to answer questions all right and last question here what resources on Vector databases and llms do you recommend yeah so there's a lot um chip has some great ones so just check out kids block chip is a person in general uh he's really great um and if you know if you follow people like Harrison Chase follow people like um uh you know Jerry Lou or uh Simba Cotter who's getting into the space um and and you know also check out the writers Ventures repo the opening eye cookbook um and also there's this thing called the redis AI resources list which has a lot of them specific to redis but it's actually you know a really good place and some and sometimes to start um there's also some courses being put out uh by open AI so check those out um and you know I can't think of other ones on the spot but if you if you really are interested in the space you should just come to the uh large language models and production part two event last time there were like 5 000 people we're expecting like a thousand um and when I say we I mean mlaps Community I'm just a part of that as well uh there were like 5 000 people that came and this time supposed to be even bigger so definitely go check that out uh that would be a really informative talk like you have people like matai who like founded databricks coming or you know help he's the CTA CEO or what have you I don't know um but uh you know you have a lot of heavy weights in the space coming to talk with that so and not to consider myself one but I am also talking at that um so you should absolutely come through to that event it'll be a really good one all right thank you and thank you everyone for joining us um and thank you Sam for presenting yeah of course
Info
Channel: MLOps Learners
Views: 8,279
Rating: undefined out of 5
Keywords: Vector Database, Machine Learning, LLMs
Id: 9VgpXcfJYvw
Channel Id: undefined
Length: 62min 21sec (3741 seconds)
Published: Wed Jun 07 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.