Open Source LLM Search Engine with LangChain on Ray

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is Wale kadus I'm the head of engineering here at any scale and I'm really excited to talk to you today about the combination of line chain and rig what we'll be talking about today is showing you why link chain is such an awesome library and why it's so popular within the large language model community and then I'll show you a little bit about how Ray complements Lang Chain by allowing you to build indexes more quickly serving Lang chain on the web directly and allowing you to put models and the serving in the same place to keep latency nice and low so let's imagine now that I'm I'm trying to build a Search application and this is part one of three parts that we're going to have part one is this part where I'm just going to build like a little search engine in part two we'll look at enhancing that to be to use large language models to summarize those findings but really just even doing these basic things with the search engine shows what's possible so imagine that I'm trying to make docs.ray.io more accessible well the first thing that I can do is I can use the power of large language models to create a really nice semantic search engine that isn't just finding words but is actually finding meaning and we're going to talk through how we do this uh in a moment but it's really exciting just how easy this is I remember when you wanted to build a search engine you used to need to lots of machines and lots of organizing and you still wouldn't get great results it's just amazing to me how how Lang train makes this easy so the process that we're going to go through is essentially that we're going to start with um our docs.ray.io website just scraped it we're going to unload the files into memory and then each page we're going to divide into little chunks and it turns out that this is going to generate about 48 000 sentences so each page you can imagine breaking it up into center it's not exactly sentences they might be slightly longer slightly shorter then we're going to use this sentence Transformer that hugging face makes available to project every sentence into a 768 long Vector this process of taking a sentence and converting it to a long Vector is called embedding and so we're going to take those uh that row and then we're going to store them in an index and so once you have the index that allows us to then find the information so I'm now going to go to our code and walk you through it it shouldn't take very long the first thing we're going to do is just look at how easy Lang chain makes this so we're going to go here and we're going to specify where the index is going to go this is all that I need to specify with Lang chain to pull all the docs into memory and then I Define a very simple text splitter here that takes all of those pages and cuts them into little pieces and so sure enough we load these documents and we create the documents from them making sure to keep our metadata so all we're doing here is we're generating a stunks at the chunks using the metadata and then the final stage is for each of these individual little chunks we're going to embed them in the space put them in face which is a good Vector store and then we're going to save it locally now we used face but Lang Chang actually supports a huge variety of different systems all right so so this works really great the only issue is it takes a few minutes now obviously here docs.ray.io is not very large it's about 130 megabytes but having something this slow or operated scale could be an issue and this is where the power of Ray and Lang chain together becomes really impressive so what I'm going to do here is I'm going to switch over to my Jupiter lab I'm going to clear the screen and we're going to start a process here and while it's running I'll come back and discuss what what that process is doing so what we're going to do here is we're going to the first part of the pipeline is exactly the same but then what we're going to do is split those that effort into um eight different shots we're gonna do those embedding process which can be quite GPU intensive using eight gpus at once and then finally we're going to combine them into a single merge database that we're going to push to the face index and as you can see it's running very nicely here it's still loading the documents so it's loaded the documents into memory and now we've constructed those eight different charts and now each one of those eight shards is processing the data at once so now you can see that you know there are 48 000 documents or so each one of the charts is is processing the data and you can see that they are they're loading the embeddings from to disk but it's doing it once on each GPU and we can start to see that here if we go look at the cluster so as you can see those gpus are looking really hot and they're now chunking away trying to do this data this is a machine with eight gpus and so you can see all of those are working really hard and now very quickly you can see it finishes the work and that process is now complete some of the charts are finished some take slightly longer but from taking five or six minutes it's now uh completely solved so that's an example of how you can use Ray to speed things up all right let's go back so now that we have our index how can we serve it and this is really again one of those instances where Ray can really help you to do something without it being much effort so what we have here is that the webcast comes in and it's usually just a sentence that we'll see we transform that one sentence into a vector just like before using the power of embeddings and then we're going to try to find a set of four most relevant sentences or four sentences that help the user understand and return a response now let me just have a look at the source code for that so what we can do here is just show how easy this is as you can see here all that we're doing is we add an annotation for serve deployment in the beginning we load the database and we load our embeddings and then when the clutch comes in we simply do a self.search result here this gets the maximum marginal relevance and then just does some nice thing uh I think uh nice processing all we need to do to start the deployment and go through the code that we put together is just have a command that looks like this it names the file that we're going to use and has the the vector store and it just names that deployment that we've created sure enough you know it's starting up now it might take a few minutes because it's packaging up everything to make it ready for serving and now the Moment of Truth we've written a very small program here called query.py that simply takes a request at the command line and sends it to the web page that we've just loaded so let's see how that works I'm going to Simply ask the the engine does Ray support Reserve support patching and sure enough when we hit that it hits a number of web pages that show you exactly where you need the information about racer support for matching thank you very much folks I hope you see now why I'm so excited about the combination of Lang chain and Rey um if you want to find out more please come to anyscale.com and there's a lot of information there that you can find to do this as well as a blog post that has this information with links to actual code and look forward to talking to you next time

Info

Channel: Anyscale

Views: 12,084

Rating: undefined out of 5

Keywords:

Id: v7a8SR-sZpI

Channel Id: undefined

Length: 7min 36sec (456 seconds)

Published: Wed Apr 19 2023