All right. In this video, we're going to have
a look at using Lang chain, with multiple documents, and chroma DB. So in the previous one, we just
looked at one PDF file and we weren't really using, any database
we were just using, FAISS or FAISS. in memory. So in this one, we're going to be
actually writing a database to disk. so this can be a criminal DB. We're going to use multiple
files in this case, text files. we're going to get some source info. so that we can give some citation
information back when people do a query. and we're going to also at the end
throw in the new GPT-3.5-turbo API. so first off. Just basically set up a Lang
chain, just like normal. You're only going to
need the OpenAI key here. I in this one, I'm going to be
using open AI for, the language model and for the embeddings. in the next video, I'll do a
version of this with a hugging face embedding so you can see how
it will turn out with embeddings. It's not a lot different. I just didn't want to over-complicate. this particular notebook here. so you can see what we're
going to be bringing in. so the first bit is just
loading the multiple documents and this is pretty simple. this is basically we
just pass in a folder. So I've downloaded here. A set of new articles. these are basically articles. from TechCrunch. that I quickly scrape this
afternoon, a bunch of, recent articles that were on TechCrunch. you can put any text file in there. I think I've got you. Over 10 of 10, 12 of them. Look. if we have a look in here,
we can see, okay, we've got quite a few of them in there. that we're actually getting
the information from. So first off, we're just going
to basically set that directory is where we're going to get it. And we're just going to glob the files. So we just doing star dot text. if you're just doing one tech long text
file, you would just do it like this. If you're doing PDF files,
rather than use a text loader, you would use the PDF loader. You had changed this here, That's
pretty simple for any of the files. If you're using markdown files. You just changed this to MD. Yeah. All right. So we bring those in. We then split up. our data into chunks. We've covered that before. And because he showed enough,
we've got our documents. here. Where it's basically giving
us a chunk of the info. that was in a particular article now. Alright, next up, we want
to create a database. So here we're creating the
vector store and we're going to store it in a folder called DB. So we need to basically
initialize the embeddings first. like I said before here,
we're using open API. We will swap these out for some
local embeddings, in the near future. and then we're just going to
basically go chroma from documents. We're gonna pass in the text, we're
going to pass into the embedding. and we're just going to pass in the
directory that we want to persist this in. once we've done that,
it actually saves out. to DB and you'll see that in
there we'll have an index. We'll have a whole bunch of
different things in there. as well. And that's basically now coded all of
the documents that we put in, so that we can actually just get rid of this. If we, if we just
basically persist this out. and then we can re bring it in. so I'll show you at the end, actually
deleting it all and loading it again. as well, but the idea here is just
to show you that once we've got that on a disc, As long as we saved that
somewhere, we can reuse that we don't have to go and embed all the documents. Now that might not be a big deal. When we're using, 10, 20, text files,
but if you had, a thousand files that were quite long, you don't want to
be doing that every time you, you launch your app, you want to save that
somewhere and then just, use it later on. Okay. Once we've got this vector DB,
we're going to make it a retriever. and just to show you once it's
a retriever, we can just say, get relevant documents and I
can just pass in a query here. so The queries I've gotten. the way I've come up with
them is I'm just looking here. I looked at the titles. I sorted, they mentioned
something about Databricks, something about CMA, generative AI something about hugging face in there. and, one of them was a Pando
or something, so that's where I basically come up with the. Questions for those from. once I've got that, it will
generate, just by default, it's going to return four documents. So in this case, I'm
just going to use two. but you can play around with the
number that you'll want for this. if you are querying a lot of information. we often find that around about
five is a good number that you want to get, the top five. in the future, we'll look at
things like multiple indexes where you'd bring in a different ones
from multiple indexes as well. but here we're going to basically
just set it back to two. So all I need to do is just
basically, and the retriever. I can just set it. K=2. the search type I'm using
is similarity search. And you can see here. If I look at the search, Arguments. I can see. Okay. Then I've got the key close to them. So at this stage, my Vector DB. and my retriever and that is all set up. Now, I just want to do the
actual language model chain part. So here, I'm basically just going
to make a retrieval QA chain. and here we're going
to pass in our open AI. We're going to do a stuffing
where we're just going to stuff it in because we know that. the two. In this particular case,
two of the contexts. with them being a thousand characters,
each we're going to be fine for a length and stuff like that here. we, so we then pass in our retriever. All right. And, we're going to return
documents equals true here. Now I could set for both equals true. If we want to see what's going on
in the background, as this goes on, in this case, I'm not doing that. but. You've seen me do that in a lot
of the other videos and that's something you can put into any chain. If you want to see more about
what's going on during the chain. Or during the agent. all I'm going to make a little,
function here just to take the output, of these and basically just
print it out nicely so we can see. the result that we're getting back from
the query and also the source documents. th that what they are. So here we come along and
we ask, our first query, how much money did Pando raise? Straight away. You can see that the two
source documents it brought up. one's about, power supply chain startup
Pendo lens 30 million investment. Hence why me asking this? Cause it's pretty easy to check
it and sure enough, it says, Pandora is 30 million in a series
B round bringing its total raised. to 45 million. so that one's clearly done it and
we've got the sources here too. so originally these were just HTML. files too. So we could actually process
this to basically just have a link back to the document. If we had 10,000 articles. and we wanted people to go back
to see the original source. HTML page. We could put that in here. quite easily. if we look into this a little
bit, so here, if I say, okay, what is the news about? Pando. And I don't run it through the. the function for. For tidying it up. We can see that. Okay. we get our result back. So the news is Panda raised,
30 million series Bre. the money will be used to expand. The global sales, tells us a bit more
about even a bit more information. but we can also see then now we get the
source actual source documents back here. and we can see that, here is where
we're getting, this is the top document. In this case. and this is the second
top document in this case. Now this one seems to
have the 30 million, part. and it also says, who led the round,
that those sorts of things in there. so if we asked that who led the round. sure enough, it's able
to do it quite easily. picking some other ones just to see that
it's not going to always return this. what did, Databricks acquire,
K tells us Databricks acquired. Okera. a data governance platform
form with a focus on AI. what is generative AI? so he we're getting, the answer back. from two different articles, in this case. And the reason why I came up with
that was that there was one article about CMA is a generative AI review. So I was curious to see if that
was come back and it didn't So when I ask who is CMA? sure enough here, I'm getting, that. Okay. CMA stands for the competition
and markets or authority. and it's giving us back,
the article about CMA then. if we look at this chain, we can see the
chain retriever type is similarity, which we know just to show you that everything
we set before has gone into this thing. and if we actually look at the,
and we can see that the chroma date DB is our vector store there. if we actually look at the template here,
we can see that, here is the template. it use the following pieces of context. So two things get passed in
the context, which is, the two documents that were queering back. And then the actual question,
which is the query, right? Use the following pieces of context
to answer the question at the end. if you don't know the answer, just say
so you don't know the answer to him. Try to make it up. I, so that's basically it. to just check that this is working,
we can come along and zip our, DB up. delete it, get rid of the vector
store, delete the actual, folder. I restart the runtime and you can see now
when I restart the runtime and come in. and I unzip at first, I need to
have to put in my open AI key again. this time though, I've gone for the
turbo API for the language model part. So we set up the, the DB by just
pointing at the persist folder, which is, which was named DB. we need to set the retriever here. We can actually just, we could actually
just, put that on the end there. to make it easier, but anyway, that's
just showing you what we're doing there. and then here, I'm
sitting up the turbo LM. so that we could use that if we wanted to. setting up our chain again
now with the turbo LM. so we using, GPT 3.5 turbo API here. Everything else. Exactly the same. running it, asking the same question. Sure enough, it's getting the answer back. Now, if we look at the prompts for
the version, when we're using the turbo, API, you'll find that the just
printing out the same prompt as before. Won't work we'll run into issues so here
we basically have to look at the system prompt and the human prompt and this
is the system prompt here right this is basically going through and the first
message is the system message in there and this is used the following pieces of
context to answer these questions and then we pass in the context And then we pass
in the question in the human part So that that shows you using the turbo llm as well
All right this sort of gets us up to speed a little bit more of using a a proper
vector database not just storing it purely in memory but now Getting it on disk. In some future videos we will look at
using pine cones If we wanted to put it as an api somewhere we can ping it
like that and we'll look at using our own embeddings for the lookup rather
than using the open ai ones for that. Anyway as always if you have questions
please put them in the comments below I found this useful please
click and subscribe i will see you in the next video Bye for now