LangChain 101: Ask Questions On Your Custom (or Private) Files + Chat GPT

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on good people today we have an extremely cool tutorial because we are going to learn how to take custom text files pass them through GPT open Ai gpt3 and ask questions of those text files specifically now the reason why this is so cool is because we're going to use your private and local text files this means that you can upload anything that chat GPT can't already see like your personal documents or notes or whatever and you can actually go ahead and ask questions for these now the reason why I love this so much is because you can see how this is the starting point of some really cool capabilities today it's just text files that we're going to go over but you could imagine connecting your Google Drive your Google Slides your text messages your WhatsApp whatever you may want and it can interpret that information for you and start to give you results back so let's jump through this example here I have a a phone one for us today okay so what we're actually going to be looking at is we're going to be looking at a series of Paul Graham essays and if you don't know Paul Graham he is a famous entrepreneur here in the states and he is also an author he has a bunch of essays he has something like over 200 so I I web scraped these and you can see that what they look like is basically just pieces of text and so here it's it's not formatted the best way but he has a date up at the top and uh he has his text down below now what I want to do is I want to ask questions using against these essays using chat gbt and I'm going to do that via the Lang chain Library so you can see here we have the essays what I went ahead and did is I just took a subset of those essays so I took a um okay well it's not it's not being nice to me I took a a subset of those essays and I just have uh the smallest ones right here only because it gets kind of uh data intensive and heavy and it can take a while which I don't want to do for you so we're going to take these essays like well that's that's a bad one um these short essays here and run them through so let's go ahead and do that so if we take a look at our IPython notebook here the first thing that we are going to do is we're going to import a whole bunch of things now Lane chain has some very helpful packages that um help with this we're going to get some open AI embeddings now I'm going to go over what embeddings are in another tutorial but think of them as a way to compress data so that we can go and search our text a lot easier later on we're going to get chroma which is part of our Vector store which is where you store the embeddings we're going to get a character splitter which is going to split these documents up into individual chunks we're going to get our Vector database question answer from Lang chain which is the easy way to go ahead and question these things and then we're going to get a directory loader now what's really cool is in document loaders you can load a whole bunch of things and they're adding new support every day a directory loader is how we're going to load a directory of files you can also do just text files if you want also going to import some magic some OS and some nltk um for some supplemental help here now I will say as I was doing this right before this video I had a heck of a time making sure that all my packages were up to date so if you have problems try upload or try loading any of these sub packages because I ran into issues for all these and as always make sure that your open AI key is set and ready to go so the first thing that we're going to do is we're going to instantiate our loader name loader is not defined that is great let me get rid of that let me first load my packages here um so let's load this directory loader and the where I'm pointing it at is I'm pointing it out for this IPython notebook the relative path to the Paul Graham essay small now this glob thing means that you're or you're only going to take the dot txt files now if you're using an IPython notebook it might throw some other files in there which is causing some problems but that'll be your fixer for you and then what we're going to do is we're actually going to load these documents so here we just kind of got them ready and staged them here we're actually going to load them so if I were to let's take a look at what this looks like if we're to load this you can see that these are the individual essays that were written which is cool so it's a whole bunch of text but this text is still too long and we need to do something about that so we're going to instantiate or initialize a character splitter and a character splitter is just going to chunk up our essay into meaningful parts now I wait I heard the way Harrison described this was imagine if you had a page and there was way too many characters on it well if you're going to split that page up you might split it up by chapters well if there's too many characters for the chapters you might split it up by paragraphs you might split it up by sentence or you might even split it by word if you really need to go down that deep but here we're just going to do a chunk size of 1000 and we're not going to overlap our text because that's not important for our use case here let's go ahead and run that and then so this instanti or initialized the character splitter now we're actually going to go do the character splitting so that was fairly Oh shoot no no no no no no that's not good um that was fairly quick and so let's take a look at what this looks like and it's the same thing but these uh these documents are going to be a whole lot smaller now because they're only going to be at the Thousand character limit so now we have individual chunks of documents that we can go ahead and then reference cool and then the next thing that we need to do is we need to create embeddings of these and embeddings means that you're going to change it from a string and you're going to turn it into a vector space and that is just a fancy word of a list of values that represent different um that represent the different documents inside of your um documents you're looking at and so right now we just initialized our embeddings and now what we're actually gonna now what we're gonna go do is we're going to use chroma here and we're going to say hey we need you to make a Doc or a vector store from these texts using this embedding engine so let's go ahead and run that and now what this is doing is it's actually going in querying the open AI API and there's an embeddings API and so you can go ahead and search that settings API open AI and so you can look at embeddings here and you can read up more about it the um the interesting part well there's pricing Associated to these but you can see here what it's going to return is it's basically going to return a vector which is just a list of which numbers but you can go and check that out on your own time um Okay cool so it did this for us and unfortunately we can't see this right now I wish we could I wish there's an easier way to see this if anybody finds out a way to see the actual vectors over there let me know okay cool now we're going to do is we're going to initialize our model and so we're going to pass it our language model which is going to be open Ai and we're going to do a chain type of stuff this means what it's going to do is it's going to find the most relevant prompts from all the ones that we just broke up and it's going to stuff those into open ai's prompt now this isn't so bad because you only make one API call and you only do the prompt there once but imagine you had really really long um set of documents what you may want to do is you may want to pass those to open AI have them refine a little bit and then use that output to pass it again and then refine even more and look for your answer I'll encourage you to go look at other different other chain Types on Lang chains documentation if you want a video on that let me know I'm happy to step through it and then here's the really important part because this is the information we're going to pass our docs that we just made from chroma and we're going to pass those in as a vector store into here so we'll save that into QA now we did all that work got our data prepped got everything ready just to do this one thing here so if you remember we had our series of essays that we wanted to go and query a question against now I'm finally ready for a question and I know that in the essays there speaks about a fellow named McCarthy who came up with the lisp language now what's really cool is I can go ahead and run this what did McCarthy discover and then in the background um Lane change is going to go query open AI with our personal private docs we've just uploaded and we're going to get some answers McCarthy discovered how to use a handful of simple operators and notation for functions to build a whole programming language which he called lisp that's pretty cool so say this was even more private data that wasn't on the web anywhere you can go ahead and use that in open AI to go get your own questions now I wanted to show you one more cool thing which is when you can start to attribute sources to your answer now the way that this works is remember we split our documents into chunks well what Lang chain will help do is it will say hey these chunks were most helpful in getting you the answer and then once you have those chunks you can start to understand why it used what it did and the information that it pulled so I'm going to pull the same exact function call that we did up above the only difference is I'm going to say return Source documents equals true and in this case when you do that there's a little bit of a different syntax to doing the query but we still have the same query here and I'm going to say um using the vector dbqa that we had above here's the query and I'm going to store that in a result so it's doing the same thing it's going to go get those great and so if I look at the result it is not the same thing but it's a little bit different than what we had above because there's a little bit of proper probabilistic probabilistic outputs here but the really cool part is when you say Source documents and so now I just asked it for the source documents and it it's telling me that hey these are the documents that we used in order to answer your question now let me convince you about that if I'm looking I'm searching for McCarthy you can see here that these different documents talk about McCarthy now we uploaded like 20 essays maybe not 20 maybe 10 essays but it knew that these were the best ones to go ahead and pick which is pretty cool so um that is where you can show your Source documents now I hope you had an awesome time with this one please let me know what other videos you'd like to see but you can see how this would be the very beginning of some really cool functionality now in other videos we're going to show how you can upload different data whether it be Google Docs whether it be notion whether it be Excel spreadsheets whatever the heck you'd like in order to answer your own questions from there thanks crew
Info
Channel: Data Independent
Views: 65,961
Rating: undefined out of 5
Keywords:
Id: EnT-ZTrcPrg
Channel Id: undefined
Length: 10min 11sec (611 seconds)
Published: Thu Feb 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.