Talk to Your Files: Conversational AI for Any Folder of Documents with Langchain in Node.js

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'm going to be showing you how to use a lane chain in node.js to have a natural language conversation with a folder of documents so by the end of this you'll be able to Simply drag whatever documents that you want within this folder run this script and you'll be able to query it for questions so this is sort of a step two to an initial video that I did on Lang chain so if you are new to Lang chain I encourage you to check out this video which I'll link here and in the description but if you've watched it or feel a little bit comfortable with Lang chain and you're just curious how to use document loaders or how to calculate costs for the embeddings endpoint just continue along so the first thing that we're going to do to get this up and running is we're going to head over to the openai API website so if you go to platform.openai.com make an account go to the top right hand corner view API keys and generate an API key so once you have that we're going to go into our DOT EnV file and even before we do that we're just going to touch a couple documents here so we're going to say touch index.js dot EnV so I'm not going to run that I already have it set up on my end but just go ahead and do that to get a couple files set up and then while you're there you might as well mpm init Dash y just to get our package Json um finally while we're in the terminal we might as well head over to our package Json and we're going to install these libraries so if you're just npmi all these um into our uh terminal here just like this just as you see click enter it will install all the packages and finally while we're in the package Json just make sure to add this one line here of type module you might have to add a comma and then type module because we're we're going to be using Imports in this example so once you have the API key head into your dot EnV file and we're going to create a variable called open AI underscore API underscore key and paste that in once you have that you can close out your dot EnV file we're not going to need it again in this example and then we can also close out um or I'm going to at least close out my package Json you can take a look in it just make sure you have the four libraries that we're going to be requiring and like I mentioned the the type module so once we have that we're going to head over into our index.js so you can also if you want create this documents file if you already have some files in mind that you want to query with natural language you can also leave that to the end I'll leave that to your discretion though so in my example I just have a few files I'll show you but feel free to use this with whatever you want so I just had some um bits of information about Lang chain generated from gpt4 that will be able to interact with by the end of this so I'll just close those for now but just so you know so we have multiple types of data but we're going to have an application that is simply going to go through whatever's in this documents folder and make it work so the first thing that I'm going to have you do is I'm going to have you import the document loaders for the different files so the one thing to note I'm going to be using Json text and CSV and PDF there are a handful of others and there's more that are being added over time so I took a quick look before this so there's things like ePub that you can add and a handful of others the one thing to note with some of the other Imports is you might have to import an additional library to do so so I think for docx it requires something called Mammoth and there's a host of others for some of the other the file types of different things that you'll have to npm install as well so the next thing that we're going to have after this after we have all our Lang change set up I'm going to have you require a handful of things that are related um sort of separated from that document loader just so you can see visually what's going on we're going to use the open AI model for our llm in this example we're going to use the retrieval QA chain to actually query what we embed and flesh out at the very end of this bit of code we're going to use this hnsw lib Library this is going to be how we're going to store our vectors locally so if you do have tried to set up an external data by base you understand that you know it's a couple extra steps so in this I'm just going to show you how to get down and and working in this example just locally but you can swap this out for an external database it if you want or use it in memory if you might not be um embedding very large documents or what have you so just know that's an interchangeable part so if you don't like that they're local you can swap that out later so then we're going to be using open AI for our embeddings we're going to be using the recursive character text splitter so what we're doing here is we're going to be splitting chunks of text because the embeddings endpoint can only handle so many characters at once so if we have a large document you can imagine that it's not just going to take megabytes and megabytes of of of files and instead we're just going to send it small chunks so once we have that we have this tick token library and this is the this is a node version that is built off of the Python version that openai references for how they calculate tokens so we have a handful of things here I'm not going to spend a lot of time on this I don't want people to get hung up on this part basically what we're going to be using this for is simply to calculate the cost of what we send to the API so before we even query it we're going to have an approximate cost of how many tokens that we're about to send across so like I mentioned if you want to limit it I'm going to limit it to a dollar in this example if you want to lower or increase that in your use case feel free to change that once we get to that section so from there we're just going to Simply import our DOT in v and configure it dot EnV is how we're going to reference our open AI API key as you might imagine and then FS to write to our read and write to our file system so what once we've done that we're going to initialize our document loader so the thing with the document loader there's sort of two pieces here so there are the Imports that I had at the very beginning so we have these document loaders here and then we're going to reference them here as well so one thing I notice in the documentation is there are you might see something like this or slash slash text in this example we're just going to remove that second argument to make this work without issues but if you wanted to add another one let's say you wanted to add a DOT ePub you could add it here add the Epub loader and just remember to add it at the top and then if there's any additional dependencies make sure to install those two so once we've done that we're going to load documents from our specific directory so this is actually going to be invoking it it's going to go through this directory and load everything that's here so I just have three things here but hypothetically you could have hundreds once you have this set up so once you have that up again I don't want to get hung up on the calculation here and going into the nitty-gritty of it but the sort of tldr of this is we're going to reference the embeddings model that we're going to be using with open API or open AI rather and then we're going to just put in the rate per thousand so this is what it's shown in their documentation of how much it costs for the the embeddings endpoint so from there we're just going to declare a vector store path you can change this to whatever you want so say if you want to have a this is essentially like your your database you can think of it as so if you want to call this say if you're querying the Lord of the Rings series or something you know whatever you want here you can you can be a bit more specific about what you want to do here then from there we're just going to create a function that we're going to use a little bit later here I'll get to this but we're essentially going to be normalizing the documents of what gets returned from our loader here so the loader will return this Json format and we're just going to normalize it to remove where it creates new lines and make sure that it's um that we're just sending a string across to the the embeddings endpoint so you could send Json across but this is a way to help save on on tokens by just normalizing that and I just want to Circle back so we so they sort of misspoke there we can actually send Json across you could stringify jsons and send it across but probably a better practice just send don't send all those extra characters if you don't need me okay so the next thing that we're going to do is we're going to set up our run function so in this example I'm going to throw a lot of our code just within here you could break this out and make it a bit more modular if you want but this is going to be where we do uh the Lion's Share of the work so once we have our run function declare we're going to await the Calculate cost function that we wrote above that's leveraging that tick token library and from there we're going to get a sense on what the cost will be for what we're about to do so once we have the cost declared we're going to set an acceptable limit so in this example I I used a dollar but you can use another value maybe you want to go up or down or whatever you want to do so we're going to have a condition for essentially running it or saying the it's too expensive so once we have that set up we're going to initialize the open AI a language model you can use other models if you want um excuse me um but we're just going to set this up using open AI in this example so from there we're going to declare a variable for our Vector store and depending on whether we've already declared the vector store and it's run the embeddings it's going to run the code so what do I mean if the vector store already exists with that name it's going to ask that question of the local version without going and embedding again so if you wanted to say it just have this documents.index as your vector store and then just continually ask questions here changing it you can do so without having to embed and run up that extra cost okay so from there we're going to be checking uh like I mentioned for an existing Vector store and then have separate conditions of if it's local or if it notices there's a vector store that's local with that path use that one otherwise go ahead and do use the embeddings logic that we're going to go through in a moment here so this is just running through how you're going to actually reference that local Vector store once you've loaded it in foreign so in our separate condition and this will run after the first occurrence so while I'm here I'm just going to delete this um so we don't get confused so this is going to run the first time so it's going to essentially create that folder there and those documents within it once this runs so it's going to create a new Vector store it's going we're going to declare the chunk size of what we're going to be sending to the API and then from there we're going to call the normalize documents function like we went through we're going to just clean up that Json and then we're going to actually invoke that splitting so once we have that set up we're going to actually generate the vector store for the documents down here for the first instance and then once we have that we're going to save that locally and one thing that I should have mentioned here actually just sort of going through this pretty quickly so in this this is actually going to be where we query the Ada uh to embeddings endpoint so we're going to send our split docs here to the endpoint and this is going to be how we're going to embed and create those vectors so once we have that we're going to have our retrieval QA chain and this is going to be actually how we query the the documents that we have within our Vector we're going to pass in our openai model and then we're going to use the vector store as a retriever so once we have that we can actually query it with our questions so we'll call and query our question like we have all the way up here tell me about these docs you can obviously be more much more specific if you want and then finally circling back to the condition of if the cost exceeds a dollar or whatever you have you can just exit out of the program um so once we have that we're just going to run it so I'm just going to go ahead and um do exactly that so we're going to just say node index.js and we'll start to see those console logs that we went through here so it's going to load the docs we'll ignore some of these warnings on some of the methods that we're using in node.js but so it's loading the docs the docs have loaded calculating the costs the cost is as we can see nominal a fraction of a penny it's creating the vector store now we see this document.index being generated on the left hand side here it's creating the retrieval chain and then finally querying the chain and we see our information here so Lang chains document retrieval capabilities etc etc so like I mentioned you can really uh get creative with this so throw all sorts of things in here um I just threw in a an energy star um a document that I had in here before this is just something from Apple that I just happen to have locally on my machine so I could ask questions about that potentially I think that came up when I accidentally undid that deletion clicking uh command Z there but hopefully you found this useful if you did please like comment subscribe I plan on creating quite a bit more Lang chain content over the course of the next month at least so if you enjoyed this type of content let me know in the comments and if you have any ideas or things that you'd like to see let me know in the comments most of the ideas I have for video come from users I'm just trying to triage and try and be able to get through as much as I can with what individuals are looking for in terms of content so without further Ado until the next one
Info
Channel: Developers Digest
Views: 5,328
Rating: undefined out of 5
Keywords: Langchain, Node.js, Conversational AI, Document Retrieval, Document Directory, AI Chat, Natural Language Processing, NLP, Machine Learning, Artificial Intelligence, Document Analysis, Text Analysis, Information Retrieval, Document Management, File Management, Intelligent Search, Knowledge Extraction, Data Science, Automation, Interactive AI, Language Model, Document Processing, Content Analysis, AI-Powered Search, Directory Conversations, Document Interaction, AI Dialogues
Id: EFM-xutgAvY
Channel Id: undefined
Length: 18min 4sec (1084 seconds)
Published: Tue May 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.