LangChain Retrieval QA Over Multiple Files with ChromaDB

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
All right. In this video, we're going to have a look at using Lang chain, with multiple documents, and chroma DB. So in the previous one, we just looked at one PDF file and we weren't really using, any database we were just using, FAISS or FAISS. in memory. So in this one, we're going to be actually writing a database to disk. so this can be a criminal DB. We're going to use multiple files in this case, text files. we're going to get some source info. so that we can give some citation information back when people do a query. and we're going to also at the end throw in the new GPT-3.5-turbo API. so first off. Just basically set up a Lang chain, just like normal. You're only going to need the OpenAI key here. I in this one, I'm going to be using open AI for, the language model and for the embeddings. in the next video, I'll do a version of this with a hugging face embedding so you can see how it will turn out with embeddings. It's not a lot different. I just didn't want to over-complicate. this particular notebook here. so you can see what we're going to be bringing in. so the first bit is just loading the multiple documents and this is pretty simple. this is basically we just pass in a folder. So I've downloaded here. A set of new articles. these are basically articles. from TechCrunch. that I quickly scrape this afternoon, a bunch of, recent articles that were on TechCrunch. you can put any text file in there. I think I've got you. Over 10 of 10, 12 of them. Look. if we have a look in here, we can see, okay, we've got quite a few of them in there. that we're actually getting the information from. So first off, we're just going to basically set that directory is where we're going to get it. And we're just going to glob the files. So we just doing star dot text. if you're just doing one tech long text file, you would just do it like this. If you're doing PDF files, rather than use a text loader, you would use the PDF loader. You had changed this here, That's pretty simple for any of the files. If you're using markdown files. You just changed this to MD. Yeah. All right. So we bring those in. We then split up. our data into chunks. We've covered that before. And because he showed enough, we've got our documents. here. Where it's basically giving us a chunk of the info. that was in a particular article now. Alright, next up, we want to create a database. So here we're creating the vector store and we're going to store it in a folder called DB. So we need to basically initialize the embeddings first. like I said before here, we're using open API. We will swap these out for some local embeddings, in the near future. and then we're just going to basically go chroma from documents. We're gonna pass in the text, we're going to pass into the embedding. and we're just going to pass in the directory that we want to persist this in. once we've done that, it actually saves out. to DB and you'll see that in there we'll have an index. We'll have a whole bunch of different things in there. as well. And that's basically now coded all of the documents that we put in, so that we can actually just get rid of this. If we, if we just basically persist this out. and then we can re bring it in. so I'll show you at the end, actually deleting it all and loading it again. as well, but the idea here is just to show you that once we've got that on a disc, As long as we saved that somewhere, we can reuse that we don't have to go and embed all the documents. Now that might not be a big deal. When we're using, 10, 20, text files, but if you had, a thousand files that were quite long, you don't want to be doing that every time you, you launch your app, you want to save that somewhere and then just, use it later on. Okay. Once we've got this vector DB, we're going to make it a retriever. and just to show you once it's a retriever, we can just say, get relevant documents and I can just pass in a query here. so The queries I've gotten. the way I've come up with them is I'm just looking here. I looked at the titles. I sorted, they mentioned something about Databricks, something about CMA, generative AI something about hugging face in there. and, one of them was a Pando or something, so that's where I basically come up with the. Questions for those from. once I've got that, it will generate, just by default, it's going to return four documents. So in this case, I'm just going to use two. but you can play around with the number that you'll want for this. if you are querying a lot of information. we often find that around about five is a good number that you want to get, the top five. in the future, we'll look at things like multiple indexes where you'd bring in a different ones from multiple indexes as well. but here we're going to basically just set it back to two. So all I need to do is just basically, and the retriever. I can just set it. K=2. the search type I'm using is similarity search. And you can see here. If I look at the search, Arguments. I can see. Okay. Then I've got the key close to them. So at this stage, my Vector DB. and my retriever and that is all set up. Now, I just want to do the actual language model chain part. So here, I'm basically just going to make a retrieval QA chain. and here we're going to pass in our open AI. We're going to do a stuffing where we're just going to stuff it in because we know that. the two. In this particular case, two of the contexts. with them being a thousand characters, each we're going to be fine for a length and stuff like that here. we, so we then pass in our retriever. All right. And, we're going to return documents equals true here. Now I could set for both equals true. If we want to see what's going on in the background, as this goes on, in this case, I'm not doing that. but. You've seen me do that in a lot of the other videos and that's something you can put into any chain. If you want to see more about what's going on during the chain. Or during the agent. all I'm going to make a little, function here just to take the output, of these and basically just print it out nicely so we can see. the result that we're getting back from the query and also the source documents. th that what they are. So here we come along and we ask, our first query, how much money did Pando raise? Straight away. You can see that the two source documents it brought up. one's about, power supply chain startup Pendo lens 30 million investment. Hence why me asking this? Cause it's pretty easy to check it and sure enough, it says, Pandora is 30 million in a series B round bringing its total raised. to 45 million. so that one's clearly done it and we've got the sources here too. so originally these were just HTML. files too. So we could actually process this to basically just have a link back to the document. If we had 10,000 articles. and we wanted people to go back to see the original source. HTML page. We could put that in here. quite easily. if we look into this a little bit, so here, if I say, okay, what is the news about? Pando. And I don't run it through the. the function for. For tidying it up. We can see that. Okay. we get our result back. So the news is Panda raised, 30 million series Bre. the money will be used to expand. The global sales, tells us a bit more about even a bit more information. but we can also see then now we get the source actual source documents back here. and we can see that, here is where we're getting, this is the top document. In this case. and this is the second top document in this case. Now this one seems to have the 30 million, part. and it also says, who led the round, that those sorts of things in there. so if we asked that who led the round. sure enough, it's able to do it quite easily. picking some other ones just to see that it's not going to always return this. what did, Databricks acquire, K tells us Databricks acquired. Okera. a data governance platform form with a focus on AI. what is generative AI? so he we're getting, the answer back. from two different articles, in this case. And the reason why I came up with that was that there was one article about CMA is a generative AI review. So I was curious to see if that was come back and it didn't So when I ask who is CMA? sure enough here, I'm getting, that. Okay. CMA stands for the competition and markets or authority. and it's giving us back, the article about CMA then. if we look at this chain, we can see the chain retriever type is similarity, which we know just to show you that everything we set before has gone into this thing. and if we actually look at the, and we can see that the chroma date DB is our vector store there. if we actually look at the template here, we can see that, here is the template. it use the following pieces of context. So two things get passed in the context, which is, the two documents that were queering back. And then the actual question, which is the query, right? Use the following pieces of context to answer the question at the end. if you don't know the answer, just say so you don't know the answer to him. Try to make it up. I, so that's basically it. to just check that this is working, we can come along and zip our, DB up. delete it, get rid of the vector store, delete the actual, folder. I restart the runtime and you can see now when I restart the runtime and come in. and I unzip at first, I need to have to put in my open AI key again. this time though, I've gone for the turbo API for the language model part. So we set up the, the DB by just pointing at the persist folder, which is, which was named DB. we need to set the retriever here. We can actually just, we could actually just, put that on the end there. to make it easier, but anyway, that's just showing you what we're doing there. and then here, I'm sitting up the turbo LM. so that we could use that if we wanted to. setting up our chain again now with the turbo LM. so we using, GPT 3.5 turbo API here. Everything else. Exactly the same. running it, asking the same question. Sure enough, it's getting the answer back. Now, if we look at the prompts for the version, when we're using the turbo, API, you'll find that the just printing out the same prompt as before. Won't work we'll run into issues so here we basically have to look at the system prompt and the human prompt and this is the system prompt here right this is basically going through and the first message is the system message in there and this is used the following pieces of context to answer these questions and then we pass in the context And then we pass in the question in the human part So that that shows you using the turbo llm as well All right this sort of gets us up to speed a little bit more of using a a proper vector database not just storing it purely in memory but now Getting it on disk. In some future videos we will look at using pine cones If we wanted to put it as an api somewhere we can ping it like that and we'll look at using our own embeddings for the lookup rather than using the open ai ones for that. Anyway as always if you have questions please put them in the comments below I found this useful please click and subscribe i will see you in the next video Bye for now
Info
Channel: Sam Witteveen
Views: 60,089
Rating: undefined out of 5
Keywords: OpenAI, LangChain, LangChain Tools, python, machine learning, natural language processing, nlp, langchain, langchain ai, langchain in python, gpt 3 open source, gpt 3.5, gpt 3, gpt 4, openai tutorial, prompt engineering, prompt engineering gpt 3, llm course, large language model, llm, gpt index, gpt 3 chatbot, langchain prompt, gpt 3 tutorial, gpt 3 tutorial python, gpt 3.5 python, gpt 3 explained, LangChain agents, pdf, chat pdf, vector stores
Id: 3yPBVii7Ct0
Channel Id: undefined
Length: 11min 49sec (709 seconds)
Published: Mon May 08 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.