Build a "Chat With HTML Docs” app using Langchain(TS/JS), AI SDK, Pinecone DB, Open AI & Next.js 13

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

there has been an explosion in the AI based semantic search apps all over the Internet we also built one such app in one of our content using typescripts with that being said you can extend this AI based semantic search to other contents like HTML documents and that's what we're going to build today we are going to build an AI based semantic search app to talk to our HTML documents for the purpose of the tutorial we will use shaty and UI document section that is an HTML document and we will build our app in a way that we can simply talk to the HTML document and ask anything that we want from the document just a side note we're not going to build this app from scratch like I did with my chat with PDF app rather we're going to clone the code of the PDF chat AI SDK and build on top of it if you want a deeper understanding I would highly recommend to watch the other video first and then come and watch this video with that introduction let's get our hands dirty let's break down the video into subsections first we're going to talk about the problem that we have with existing AI based semantic search app and then we're going to build the architecture of the app and after that we're going to build the data collection layer for our app and then we will build the first iteration of the app and see how it works then we will make a small code change in order to improve the search results and then finally we will build our last iteration of the app to chat with our htmo documents the problem with AI based semantic search apps lie around data quality which directly impacts the search results when you build a semantic search app you typically prepare a knowledge tore with the data that you collected from your PDF documents or your websites this data is usually collected by loading or scraping your file and splitting the text content based on some fixed number of characters although this type of text splitting can work for text Heavy content if we do the same type of text splitting on HTML document we may not get the desired result so what's the answer instead of splitting the text on fixed number of characters we're going to split them based on title or subtitle and the text content that comes underneath the title or subtitle this allows us to store the entire chunks of text with their context intact and these are called contexta aware text chunks context aware text chunks offer better data quality directly improving the search results of a vector database we will see more details about this in the next section if you have seen my chat with PDF content you would have seen this architecture where we have the first step to split the chunks and prepare our knowledge store and store the chunks inside the knowledge store once our knowledge store is ready we use our Lang chains llm chain sequence to talk to our database and use the information from the database as a context and make the llm answer the question this app that we are building is going to be around the same architecture with an exception that we're going to use a python Library called unstructured doio for data collect and chunking so why are we using another language because this unstructured iio library is not available on JavaScript yet so that is a reason why I'm using python this python Library would enable us to scrape our HTML documents and split the documents into context aware chunks once done we will pass the information to our Vector database and do the same llm chaining to first get the information from a knowledge store and then use that information as a context and make the llm answer the question as I've said before for the data collection we'll be using Python and I myself am not a python expert so I'll be using python only for this step where we will prepare a Json data to feed into our next J's app I've used a Google collab notebook so that you don't really have to set up a python environment locally you will also have link to this notebook in the description below for those who don't know what Google collab is is a free python Notebook online and you can run python code easily we can split the code into two parts first we will load the website and collect all the URLs from that page and then we will use those URLs to load all the HTML pages and collect the data from all the HTML Pages here you can see that I've installed some requirements so I've got two libraries that I'm installing the first one is called bs4 uh so this Library will load the HTML for us and then another Library liary to make the requests so what this code does is it loads the doc's HTML page and collects all the anchor tag links and saves them in a set so that it can remove all the duplicates and also filters out the URLs that points to external page like Facebook Google Etc once done we dump them into a text fire and if you run this code you should get a text fire that contain all the urals under this document and should look like this the root document is going to be docs and all the other files are going to be segments on top of this path in the next tip I'm using this text file that we prepared and loading all the HTML content and splitting the chunks by title and I'm actually converting the chunks into the format that is required by L change document type and also storing it inside an all groups array as we store the chunks I'm also using URL segments to prepare a title information to put inside the metadata which we will use later this is how we convert our URLs into titles so for example if the URL is docks theming then it's going to be theming and if it's docs components ACC cod in then it's going to be accordion components and if it's just docks then we're going to keep the title list summary once we processed all the urals we have an additional step to collect the titles remove all the duplicates and store them in a variable called unique titles and we take this unique titles and create a new chunk under the title summary and add it to our all groups array why do we do this we do this so that we have a map of all the available pages in this htmo document it's a simple way to improve the data quality so that the language models have all the information to answer a question finally I use a titles array and the all groups Json array to prepare a data. TS file this data. TS file will be used inside our nextra app to populate our Vector database I've already run this piece of code to prepare my data. ts file let's quickly copy this file and format it and see how it looks as you can see it's got the titles information which we will use later and we got the Json data information that we prepared from our hdmo content and what we will do is we will take this Json data and we will pass it on on to our embedding llm and store the information into our knowledge store and now that we have our data ready let's see how we use this data to build our app the app is going to be fairly similar to our PDF chat AI SDK app so we will have to Simply use the same code base uh that we use for the PDF chat SDK and the text tag is also going to be similar apart from the python library and the data collection step that we have I'm I'm going to quickly clone the code base that you see right here and copy the data. TS file into the scripts folder of the code base after we clone the code and copy the data. TS file inside our scripts folder it should look pretty much like this so this is the data. TS file and we have the data and our code base should look pretty much like this so the first thing that we're going to is we're going to adjust the AI experimental and turn this into an AI so for example we no longer need it to be AI experimental it can just be this AI what I'm going to do is I'm going to do a quick npm install and also I'm going to do npm install AI in the meanwhile I have an index called test index I'm going to use my data. ts file to populate this index so for that I'm going to go into my script folder I no longer need this PDF loader so I'm going to take this down completely and I'm also going to take it down from my lib so this is not required anymore and I'm going to [Music] import the data as docks from data . ts5 so this docks is a new data for us what we will do is we will simply run this script and see if it works right before we do this we have to poate our n file so I'm going to copy this n example and I'm going to rename it to n in that n file what I'm going to do is I'm going to remove the pine con namespace and PDF part so I no longer need need these two items the name space is no longer supported on the free tier so that's a reason why I'm going to remove that and the rest I'm simply going to copy all the information that I have at hand so these are my personal API keys so you can also fill your own API keys I'm going to switch them up so don't worry you can't really copy them so this is how theend file should look like with this I think we can go ahead and run our npm run prepare data command all right so there is something wrong and that's a good thing that's not complaining so now if I go ahead and run this command seems to be working so I'm loading 216 chunks my pinec con index and I've got a name space and I've got 216 chunks great so this is working as expected so now let's clean up the code base a bit all right so the next thing that I'm going to do is I'm going to do some clean up so for that I'm going to go into my components and inside chat I'm going to remove the AI stream experimental and change it to AI cuz all the latest updates are available on the AI SDK itself so there is no need to install an experimental SDK I also have this inside L chain TTS file I'm going to update all this information now I come back to chat and I've got initial messages that should be inside utils that should be aligned with our document so I'm going to update that that looks good now I want to go back to chat line. TSX and instead of a formatted text message I want to support markdown and what I'm going to do is I'm going to use react markdown and instead of trying to format the text I'm just going to Simply pass the content and I no longer need this formatted message method and also uh I will no longer be needing this converted lights method as well so in order to support a proper markdown colors on the code blocks and tail Bend we have to install something called as typography plugin so what I'm going to do is I'm going to quickly install the typography plugin from tailman and once done I need to add this to my plugins list all right now what I can do is I can simply add some extra classes in order to support both pros and dark Pros classes cool so this is number one and number two is that I don't want a cording anymore rather I want badge links so I want to be able to click on the badge and then it should take me to an external link so for that we have to first install a batch component from shats UI all right so that's done I'm going to add a simple batch component and it will pass the source onto the href and also use the same Source in the anchor tag all right so this looks pretty good now we have to adjust the data that is streamed from the Lang chain so if you go to the then method inside Lang chain. TS file we have to adjust what data we send back as part of the stream data so right now we stream the page contents and this will be shown in the accordion for the PDF chat a SDK so for us what we're going to do is we're going to Simply send the source URL so what the source information so this is the source information and and we user Source information and stream this data back to the client side we also remove all the duplicates that we have and we send back the source information now if you go back to chatlink component The Source information would be an array of source urals and we no longer need this component so I'm going to remove them everything seems to look good so we have to see if this works is expected so I'm going to do an npm run def so this is going to be our first iteration of the app all right so we're going to ask a simple question how do I install and use alert component we would assume that this would fetch the data from from the knowledge store and pass it as a context to the llm and give us back the information yeah so looks pretty good you can see that everything is working is expected and we also get the external links if we click it's going to take us to the external link so we have all the information that we need and we're going to ask another question yeah it seems to be working is expected now we're going to make it a bit more tough we're going to start asking a bit more ambiguous questions can you summarize the document so let me refresh and then ask this question again so you can see that it's not trying to pick the summary rather it's trying to pick the information from typography somehow some questions we don't really get the right answers because of the fact that it's not really targeting the right places so this is a real problem right you may not face this problem in all cases but in some cases you may face this problem so for example if I ask this question is this a component Library Library yes it's a component library because it's trying to get the answer from about but it should not rather it should get the answer from introduction so here is where we're going to implement another technique in order to get the right answers in order to improve the search on a vector database so we have another interesting thing called filtering with the metadata what happens is that when you request something from a pine con Vector data base you can also pass the metadata as part of as part of the query that's exactly what we want to do right now for that we're going to add an extra llm chain in order to get filter information from the question so this is my Approach I mean you may have your own approaches in order to improve the search but what I've done right now is that I have implemented a simple step that tries to get the metadata from the question so so we are going to use the language model to do the job for us so if I go back to the code base I'm going to add an additional template in our prompt template and I'm going to use the all titles let me explain you what this template does you are an expert text classifier your job is to generate an array of strings that are within the context that best match with the input question and I'm asking you to check all the titles that we have that is all the titles that we have here and asking the language model to check the question and also match which titles suit the best for the question I think this is called fot prompting in the few shot prompting you kind of give the information some examples to the language model and it's going to give an output based on that for example if the input question is going to be can you summarize the document then the output is going to be summary about and if the input is going to be how to install alert component then the output is going to be alert components and installation so I am actually trying to guide the language model to give an output for me so this would so this would essentially give back the title in Array format and also trying to guide hey if the user is trying to ask something irrelevant then you give back an empty array and my input question is going to be on the same format and the output would be based on these inputs so this is not a foolproof way but it's still my interesting technique to give back the answers that I want what are we going to do with this information we're going to go back to the vector store I'll be using some thing like this so for example I want to look for the matches in the title and in the title I'm going to check for these two matches summary or about simple so that's what I'm going to do so this information that I'm going to get from that llm and I'm going to pass this information as part of the metadata filter so this would give me the answer that I need let's set this up I'm going to go to Lang chain. TS and use the prompt that I've just made for example I have this prompt that I have to import from prompt templates and also I have to import llm chain I'm trying to prepare a prom template from the prom template that we just made in the template file and we're going to pass it to prompt and then we're going to make a change so we're going to name it as metadata filter chain and what I'm going to do is I'm going to call this chain where with the quest so what I'm going to do is I'm going to pass the Quest inside this chain and then I'm going to try to get the answer from the chain so let's see if this actually works obviously this is going to make your chain a bit slower cuz now there is an extra request that needs to be made to chbt now I'm going to ask a question can you give me a summary so this would essentially take this question and pass it onto the new chain that we've just made and give us back the output in an array format it's still giving the all answer but then if I go back to the code and then check if there is a metadata filter I'm going to check for the metadata yeah you can see that it gave us back summary and about we're going to pass the metadata. text and inside our Vector store store we're going to get we're going to go into the vector store I'm going to call it title filter and I'm going to use a title filter to prepare our filter so now you can see that we are passing the title filter using json.parse and then passing it to this method once that's done I'm going to pass the filter down to my Vector store right before we go further I also want to tell you one really quick thing we have also updated the QA template a bit after we've updated our QA template we're going to give our chat a buffer memory so let me go ahead and give this a buffer memory so I'm just going to add a simple buffer memory and I'm going to import the buffer memory from Lang chain memory so yeah so with this our app is complete so now in the next section we're just going to run the app app and see how this works now that our app is complete we're going to test the app just to see if this works as expected so I've got a list of 10 questions that are a bit hard to answer and see if it does the method data filtering properly based on the questions and also give us the right anwers so since there is an extra chain there's going to be slightly time consuming let's see how it works so first question is going to be Give me a summary it's able to pick the right information uh from the database so let's see what's happening so if I show you what's happening it's picking these two filters summary and about so that means it's going to go through the data and check for summary information and then just get all the information that's under the summary so what it's going to do is it's going to just pick up all the information that's under the title summary and all so it's going to pick up about both let me ask another question that failed in the past this is a component Library here you can see that no this is not a component Library blah blah blah and it's a collection of components so how is it able to answer that because of the filters that we just applied so for this one what's the filter it's going to be summary and it's just taking the answer from the summary and then giving us the answers I'm just going to ask what's this document about so yeah it's uh it's still going to hit the summary and then get back the information from the database and if you if I click on the link it's just taking me to the right place now I'm going to ask a bit more tricky question so what are the FAQs and it's it's giv me the right answer and also where I can find the FAQs so now let's ask a bit more easy questions how do I install alert component all right it's answering the alert component question properly now I'm going to ask how do I build a login form so this one should typically look for the form component let's see if this answers properly okay so it's kind of answering as expected let's see what's happening here so after I ask for form components it knows that all right so this guy is asking about F components it's using this F components as the metadata filter and getting all the information from the F component so that's a reason why this works let me ask a question is there a frog component so the Frog component is not a part of the picture and it should just tell me no there is no mention of the Frog component blah blah blah so I'm going to ask another question about build an e-commerce card component and it should give me an answer cool so it's working is expected I'm going to give you an example of one Edge case where it may fail what are the list of um available components so here you can see that it's giving me a list of all the titles that I've added but then the problem is um it doesn't really know what are components and what are just normal files but that's one problem that I can you know think of but for most Parts it should be able to give you the answers that you expect and also it should be able to understand um you know what's happening yeah so that's practically it so this is the AI based HTML document search that's built on top of Lang chain and ni Chase thanks for watching this video if you like the video please leave a like And subscribe for more and I will talk to you soon

Info

Channel: Raj talks tech

Views: 1,053

Rating: undefined out of 5

Keywords: chat with html, langchain, nextjs, ai app, chat using ai, typescript, pinecone

Id: S3S64iEjRzs

Channel Id: undefined

Length: 25min 55sec (1555 seconds)

Published: Thu Oct 12 2023