Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone Logan here from llama index back with another video on our series of Bottoms Up development with llama index uh we're on a quest to build a chat bot using the Walmart index documentation but in a sort of like Bottoms Up uh low-level way and so last episode we covered how to use loms in lawmend X and in this video we're going to cover how to load data and create document objects completely customized in lava index so in loam index there are documents and nodes they're essentially the same but the main difference is is that a document is intended to represent the entire document whether that's an entire page in a PDF the entire PDF entire Word file Etc um and when you insert these documents into an index they are broken down into nodes which are basically smaller chunks of that original document used for retrieval and question answering and whatnot and so these documents and nodes can have a few different attributes such as metadata which can be something like a category or a file name as well as relationships which can be links to other nodes or documents so for example when you put a document inserted into an index uh the nodes contain a reference to the parent document ID and so that's an example of a relationship and we would see the actual usage for creating a document is super simple we can import it create it give it some text super easy we can also import a simple directory reader and read a directory of data and get a list of documents so if we want to you know load all the data from our documentation to create a chat bot we would just use the simple directory reader and could load everything in our Docs we can also customize the documents quite heavily actually some Advanced usage here like I mentioned before we can add metadata so giving it in this case a category we can also tell llama index to only use certain metadata for certain parts of LOM index so we can say you know we only want the embeddings to look at this metadata or we only want the llm to look at these metadata and we can see here that we're saying we don't want the LOM to read the category metadata so when this document or its nodes gets sent to the llm the llm doesn't know the category and on top of that we can customize the representation of the metadata when this document gets transformed into a string so we can set the separator between each metadata field in the dictionary we could see what each key value is formatted as here I've customized it with a little arrow and then on top of that when the metadata gets inserted next to the text we can add a template for what that looks like and here I've added a little label for metadata a little line divider a little label for the content super simple but this allows for some pretty I would say complex representations and customization in Lama index so that's how to create documents and now we're actually going to cover how we can you know create Dom documents for the Llama index documentation so I have a notebook here to cover that what I've done ahead of time here is I've created a custom document loader for the markdown docs in lawmandex so we're only going to worry about the markdown documents because we just want to keep the scope a little as a little narrow and we should still be able to get a useful chatbot with just our markdown documents in our in our documentation um and I've just made a function here to basically go through and parse the markdown which is like a super structured format we know where the headers are we know where the code blocks are uh and there's just a bunch of handling that for that in here now while llama index does have a built-in markdown reader uh building this on your own gets you one more familiar with how to create documents and two if we ever need to customize further how these documents are being loaded now we have straight access to how that's being loaded for instance here one kind of customization that I did make was all the code blocks keep track of the paragraph above it because often that paragraph above it is introducing that code block so it's kind of like an extra piece of reference text that we can use later on maybe we'll see to help the question answer process work better so just a quick demo of how this actually works I'm going to append to my path here so I can actually load from my little folder here with my script um I made a quick helper function here to load the markdown documents and here I'm just saying exclude everything except for markdown documents we only want to we only care about those and then here I set a custom loader for that I was just showing uh so basically what this does is it sets it up for DOT MD files we'll use my loader and then lastly I set recursive equal to True our documentation is a little nested so it's going to recursively go through every folder and file and find every markdown file that's as simple as that um in I have a folder of all the documentation uh every folder from our documentation um I'm loading it into separate lists of documents because each folder kind of captures a very specific part of llama index so if you're building you know a chat bot over the documentation it can be helpful to sort this ahead of time to make question answering easier later so load those documents it's pretty quick and now we can just kind of investigate a little bit what this actually looks like so I'm going to grab the agent docs there uh you could see you know it's printed out uh all our metadata at the top here so you can see we're keeping track of file name content type is going to be either text or code and the header path which is like how nested the markdown headers are so obviously right now we're at the top level it's just module guides if I go to maybe five here now you can see the header path is data agents concept tool abstractions so this is just a way to help the LOM understand what documentation it's even looking at right now and again you can access the metadata directly just by you know including that little dot metadata attribute and we can see here that it's formatted as a dictionary and that looks good but we can actually customize this even further we can set up our own text template so like before we saw on the slides we could add a little header for the metadata a little separator and then the content we can customize the metadata template and then we can separate them by spaces so now basically the metadata will be comma separated list rather than line by line so we can go through and apply this to all the docs in the agent docs list and then we can go through and print uh the content from that dock you'll notice here I've added a metadata mode and this is what llama index uses under the hood to get the text for different parts of LOM index so there's a metadata mode for embeddings and for llms and then also all and so you can see here that it fetches all the metadata in a single comma separated line and then we have our content following our template that we specified up here now we can get a bit more advanced with this customization we could say that for the llm metadata so the metadata that the llm reads we don't want it to see the file name so now when we call get content with the metadata mode LOM after applying this exclusion we can see that the file name is no longer present in the text which is what we wanted and we can apply the same thing to the embeddings and we see that also the embeddings no longer see the file name and so that's basically it in this video we've covered how to create documents the different ways you can customize them and I'll provide links down below so you can read my markdown reader it's not perfect but it works and it's kind of an example of how you can build your own loaders they're not scary it just takes a little bit of time to write and then you have full control over how your data is loaded and what your documents look like hope this video was helpful and see you in the next

Info

Channel: LlamaIndex

Views: 10,083

Rating: undefined out of 5

Keywords:

Id: nGNoacku0YY

Channel Id: undefined

Length: 8min 41sec (521 seconds)

Published: Sat Jul 15 2023