LangChain101: Question A 300 Page Book (w/ OpenAI + Pinecone)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello good people and in this tutorial we are going to be querying a book what does that mean it means that we're going to be asking questions to a 300 page book and we're going to get answers back to us that are directly from the source that are contextually relevant from there now the way we're going to do this is we're going to use open AI we're going to use Lang chain but then a really cool feature about this is we're going to use a vector store that is stored in the cloud meaning up until now I've shown you uh We've stored our embeddings uh locally but in this case we're going to use pine cone and this is an external Vector store that Lang chain supports natively which is going to be really cool to see now let's check out the book that we're going to be looking at here today it is the field guide to data science now you'll notice here that this book was made back in 2015 so this is when data science was still a cool term uh when it was just not just coming out but when it's relatively still new and there's a lot of really cool dense information on here and you can see if we go further down there's just a ton of really good information let's see if we can see the last page on this book uh 123 Pages let's just say it's ballpark around there so what we're going to do is I want to load up this book into Lang chain using the unstructured integration that they have to read the text on this PDF and then we're going to ingest it and do a bunch of cool things with that as always I want to start with the diagram to kind of show you what this looks like without code and then we'll jump into the code because I think it's a little bit easier to understand here so oh uh nice so let's let's take a look at this here um over here we have our PDF that we're going to be ingesting this is the field guide to data science and then what Lang chain is going to help us do is it's going to help us split this up into a bunch of documents and you can think of documents as just pieces of text or chunks of text and you can have these be small chunks big chunks whatever you'd like we're going to use open AI they're embedding engine engine to change these documents into embeddings now embeddings are literally just a list of numbers you can think of it as a big fat vector and in fact the open AI engine that we're going to be using has about 1500 different numbers to represent the semantic meaning of each one of these documents which just means uh not what are the words say per se but what does the document mean in this case and the cool part about this is we're going to store it on Pinecone so instead of you storing it within your local database on your computer we're going to store externally on Pinecone so it's easy for us to retrieve leave later and not only that it'll persist sessions too and you can actually make this database available to other people to query as well so you can start to see how you have some shared memory about how to uh query this book here then what we're going to do is we're going to ask a simple question like what is a data scientist that question is going to get turned into and embedding we're going to use our Vector really and we're going to use open ai's engine for that and then what Pinecone is going to help us do is it's going to say hey based off of the embedding that you just gave me here are the two documents that are most important to answering that question and technically it's here are the two documents that are most similar to that question that you asked so in this case embedding one and embedding three and then link chain is going to help us take those two embeddings use open Ai and actually answer the question for us so a data scientist is spoiler we're gonna go we're gonna go find out there all right so that's the diagram portion let's jump over some code to see how we're actually going to do this let me zoom in just a little bit here great so up at the top we're going to import some document loaders to start us off and I'm going to do the unstructured PDF loader as well as the online pdf PDF loader just to show you how we can do this here and then I'm going to use the recursive text splitter in order to split up my documents so I'm going to go ahead and let me run this cool those are reloaded and so next I'm going to call loader equals unstructured PDF loader now you'll see that I'm calling from a local file here I've commented out how you can actually do this from online as well but because this is a large book I'm doing the local one just to speed this up a little bit and I know from experience that let me load this up uh cool and I was gonna say I know from experience that once I actually call the load on this this is going to take quite some time here so I'm going to jump ahead and pause the video so we can jump to the cool part see ya okay now we loaded up our data over here and what I want to show here is I want to show the length of the data to show what Lang chain is actually loading here and then I want to show uh how many characters are in uh our first piece of data here so for how to run this you can see here that I only have one document and there are about 700 I mean 176 000 characters now this is way too much to be putting into a prompt with openai and so the first thing that we need to do is we need to not just have one document that's huge we want to have a bunch of small documents broken up here so I'm going to use the recursive text splitter I'm going to set the chunk size to a thousand you can play with this around play around with this to see what works best for you chunk overlap equals zero don't need any of that and I'm going to pass in the data that we loaded up here okay so that was pretty quick and then let's see how many now we have we have 228 different documents so we went from 1 to 228 and if we actually want to see what this looks like let me just show you what it texts uh in fact let me just show you the first one it's a document here and you can see that we have the field guide to data science second edition looks like there's some funky formatting I don't think that this is going to be an issue for me um so if you're doing this in a professional setting it might be a different story um okay so what we're going to do now is we're going to create embeddings of our documents of these 228 documents and get them ready for semantic uh search so I'm going to import from Vector store I did chroma beforehand here just as I was messing around but we're going to use Pinecone today and then we're going to also import open AI embeddings this is the um the model that is actually going to take our documents and change them into vectors for us and then we're also going to import Pinecone go ahead and do that fabulous make sure that you have your API Keys all set and ready in your environment and then I'm going to Stage my embeddings which is what we do right here just call it and we pass our openai key and then I'm also going to do Pinecone dot init so I'm going to initialize pine cone and so we have you're going to pass your API key you're going to pass your environment that your Pinecone instance is in and then you're also going to pass your index name now I called mine Lang chain one here but what I want to show you is how to do this on Pine Cone it takes a couple clicks which is why I wanted to set it up with you so I'm going to create my first Index this is not actually my first index I'm going to say Lang chain two and with the open AI embeddings I know that they have 1536 different dimensions that come with it and with regards to the metric that we pick here I'm going to pick cosine for now you can play around with these and see if there's a better one for your use case for the Pod type I'm going to go faster queries I'm on the free tier anyway I'm going to go ahead I'm going to create index here I'm going to take this name of Lang chain 2 as my index I'm going to go ahead and pass it into my index name over here let's go ahead and run that cool looks good now um what we're going to do is I'm going to actually pass in uh the embedding I should say I'm going to get the embeddings and then pass them over to Pinecone I'm doing a couple that a couple of stages of those in a one-liner here so I'm getting the texts from all my documents I'm passing the embeddings engine that we made up above with the open AI engine and then I'm going to pass the index name um sweet all that looks good let's make sure that we're ready yes we are and so let's go ahead and run those up so what this is doing right now is it's taking all those texts it's creating embeddings about them and it's take and it's passing those up to Pinecone which is the r external data source which is sweet here now if we were to go over here and we're going to look at uh metrics there's no data for the selected range because we it still hasn't uploaded all the information that we wanted to get but let's see if we can get this going here Vector count uh it's kind of difficult to see but you can see here we have 160 I we just did this okay so now you can see it's starting to populate with some data here but basically we have as many vectors as we had documents so we had 228 documents we have 228 vectors here as well let's see if this will give us a better time here anyway you can see how it went up because we just loaded those up there so now what we're going to do is we're going to ask questions to our book and so we're going to say what are the examples of good data science teams and then when we do doc search which is the document store that we were querying beforehand it's going to do a similarity search we're going to pass out our queries and you don't have to include metadata but I did it here just to just to show it here so let's go ahead and do that and then what it did is it went up to Pinecone and said hey what are the docs that are similar to The query that we just did here and so it passed back one two three four uh five documents and I believe it did five documents here yeah because the similarity search by default it will return the five documents for you and so these are the documents from the PDF that Pinecone says have the highest cosine similarity with um with our question that we had there now this is cool but we want to get this question answered to us in natural language so I don't want to do anything with these docs or I'm not going to do anything with those docs what I'm going to do is I'm actually going to import openai I'm going to import the question and answer chain from link chain I'm going to get my open AI ready and then I'm going to have the uh the stuff chain type which means it's going to take all those documents put them into the prompt and then it's going to finally answer the team or answer the question for us and let's go ahead and run that so we're going to query what are good examples of data science teams we have our just like we had above and then here's the magical part we're going to say chain dot run and we're going to input the docs which are the relevant docs that we had before and we're going to input the query which is the question that we answered here and then this is the natural response that we get from openai good data set data science teams are multi-disciplinary teams of computer scientists blah blah blah blah blah cool well what if you wanted to ask a different question what this is pretty uh kind of esoteric to the or nuanced to the book itself so I think this would be a good question what is the advise stage of data maturity this might be tough if I just ask this to open AI in the first place it might not know what I'm talking about but because I'm querying straight from the book the advice data did to maturity is when organizations can Define their possible blah blah blah so it answers it correctly which is cool I could say the collect stage go ahead and run that The Click stage is focusing on collecting internal documents or external data sets so then all of a sudden what we just did is we just asked a question to our book itself very very very cool and you can do this across multi-disciplines so whether you have more books that you want to load up keep in mind that it will be expensive the more data you do or your internal documents or if you wanted to do some sort of chat bot and you wanted a specific question and reference some external embeddings or some other documents in there as well awesome let me know if you have any questions and please don't forget to subscribe and thumbs up this video if you want more thank you very much
Info
Channel: Data Independent
Views: 80,788
Rating: undefined out of 5
Keywords:
Id: h0DHDp1FbmQ
Channel Id: undefined
Length: 11min 32sec (692 seconds)
Published: Mon Feb 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.