LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we are going to take a look at what we need to do and what we need to consider when we are chunking text for large language models the best way I can think of or demonstrating this is to walk through an example now we're going to really go with the what I believe is kind of like a rule of thumb that I tend to use when I'm when I'm chunking text in order to put into a large language model and it doesn't necessarily apply to every use case you know every use case is slightly different but I think this is a pretty good Approach at least when we're using retrieval augmentation and large language models which I think is where the chunking question kind of comes up most often so let's jump straight into it in this example what we're going to be doing is taking the Lang chain Dots here literally every page on this website and we're going to be downloading those taking each one of these pages and then we're going to splitting them into more reasonably sized chunks now how are we going to do this we're going to take a look at this notebook here now if you'd like to follow along with the code you can also run this notebook I will leave a link to it which will appear somewhere near the top of the video right now now to get started we're going to be using a few python libraries Lang chain is a pretty big one here so not only is it the documentation that we're downloading but it's also going to be how we download that documentation and it's also going to be how we split that documentation into Trunks and another dependency here is the tick token tokenizer we'll talk about that later and we're just going to visualize and make things a little bit easier to follow with these libraries here so in this example first thing we're going to do is download all of the dots from line chain so everything is contained within this is the top level page of the line train dubs we're going to save everything into this directory here and where we are going to say we want to get all of the dot HTML files okay so we run that and that will take a moment just to download everything that there's a lot in there my internet connection is also pretty slow so it will probably take me a moment but let's go ahead and just have a look at where these are being downloaded so if we come over to the left here we can see there is the RT dots repository there and inside the RT dots we have this line training with Doug's en latest so it's just kind of like a path of our dots and Okay cool so in there you can see everything's been downloaded we have like the index page which I think is the top level page obviously it's just it's HTML okay so it's kind of like when we're not going to process this we're going to use long chain to clean this up but if we come down a little bit I think maybe we can see something okay so this is like the the first page welcome to langjain l m is our immersion is a transformative technology so on and so on okay and we have some other things other pages yeah we're just going to process all this so back to our code uh it's done downloading now we can come down to here and what we're going to do is use the light chain document loaders and we're going to use the read the dots loader so read the dots is a specific template that is used quite often for documentation for code libraries and Lang chain includes a document loader that is specifically built for reading that type of documentation or that those HTML pages and processing them into a nicer format so it's really easy to use it we just point it to our directory that we just created and what are we doing here so we're loading those dots and here I'm just printing out the length of those dots so that we can see okay we have 390 HTML pages that have been downloaded there some reason okay so when I when I ran this about an hour ago they they actually had 389 now they have 390 pages so already updates cool all right let's have a look at one of those pages and so we have this we have this document object inside that we have page content which is all about tips all right if we want to print that in a nicer format we can see this okay all right it looks looks pretty good there's a lot of you know there is some kind of messy parts of this but it's not really a problem that the we could try and process that if we wanted to but honestly I don't really think it's worth it because the large orange model can handle this very easily so yeah I personally wouldn't really bother with that I'd just take it as it is now at the end of this object we come right to the end if it lets me we see that we have this metadata here okay inside the metadata we have the source which is in this case like the file path but fortunately the way that we've I set this up is that we can just replace RT dots with https and that will give us the URL for this particular file so let's come down here and you can see that's what I'm doing here replace RT duct with https cool and then we can click that and we come over to here now this is where we start talking about the chunking of what we're doing when we are thinking about chunking there are there are a few things to consider okay so the first thing to consider is how much text or how many tokens Can our large language model or whatever process is we're doing how many tokens Can it handle what is optimal for our particular use case the use case that I'm envisioning here is retrieval augmentation for like question answering using a larger language model so what does that mean exactly it's probably best if I draw it out so we're going to have our large language model over here and we're going to ask you a question so we have a question over here supposed to be a queue it's fine so we have our question like we're going to say uh what is the LM chain in Lang chain right if we pass that strain to our large language model at the moment using GPT 3.5 turbo even GT4 they can't answer that question because they don't know what the line chain library is so in this scenario what we will do is we'd go to Vector database you know we don't really need to get in too much detail here we go to about the database which is where we store all of the documents I'm that we're processing now so all those line chain dots they would end up within that space and they would be retrieved and we would pass in like five or so of these chunks of text that are relevant to our particular query alongside our original query okay so what you'd end up with is rather than okay let's say this is your prompt you typically have your your query rather than just a query you'd have your query and then you'd also have these five like bits of relevant information below the query okay and that would all go into the large language model and you would essentially say to it you'd probably have some instructions and at the top and those instructions would say I want you to answer this question you'd maybe give the the question there to give it a bit later on using the context that we have provided and you would basically in front of each contacts you would write like context okay and the large launch model will answer the question based on those context right so that that's the scenario we're envisioning here and in this scenario if we want to input five of these contacts into each one of our retrieval augmented queries we need to think okay what is the max token limit of our large language model and how much of that space can be reserved for these contacts so in this scenario let's say that we're using gbt 3.5 turbo the token limit for GPT 3.5 turbo is something like 4 0 96. so this includes both all right so you have your large language model I'm gonna put that here this is standards to your large language model this 4096 includes the input to the large language model so all of your input tokens and also all of your generated output tokens okay and so basically we can't just use that for 4 000 tokens on the input we need to leave some space for the output and also within the input we have other components right so it's not just the context but we also have the query I mean that's supposed to say query and as well as that we might also have some instructions I don't know why am I right it's so bad and as well as the instructions might also have a bit of tracked history if this is a a chatbot okay so basically uh the amount of context that we can feed in is actually pretty Limited in this scenario let's just assume that we can we can pass in a context of around half of the 4 000 tokens so we'll say 2000 is going to be our limit okay if 2000 is our limit we that means we need to divide that by five because there's 2 000 tokens need to be shared by our five contacts which leaves us with about 400 of these tokens per context okay so that's our maximum chunk size now one question that we might have here is could we reduce the number of tokens further and for sure we can okay so I would say the minimum number of tokens that you need within a context is for you to read this context does it make sense right if you have enough words in there for that context to make sense to you as a as a human being then that means that it is probably enough to feed as a chunk of text into a large language model into a betting model and so on so if that chunk of text has enough text in there to have some sort of meaning to itself then the chunk is probably big enough so as long as you satisfy that that should be the criteria for your minimum size of that chunk of text naturally for the maximum size of a jungle tapes we have the 400 tokens that we just calculated now so with that all of that in mind we need to take a look at how we would actually calculate the the size of these chunks okay because we're not basing us on character length we're based on this on tokenlab so in order to do that we need to look at how to tokenize text using the same tokenizer that our large language model users and then we can actually count number of tokens within each chunk so getting started with that we are going to be using the tick token tokenizer now this is specific to open AI models obviously if you're using cohere hug and face and so on this is going to be a slightly different approach so first we want to get our encoding so there are multiple tick token tokenizers that open our users this is just one of those now let's initialize that and I'll talk about a little bit about where we're getting these encoders from so you can actually find details for the tokenizer at this link here so this link is in their GitHub repo tick token tick tokenmodel.pi okay so I'm going to click through to that okay so this is in the opening hour tick token repository on GitHub and you can see we have this model to encoding a dictionary here and within this you can see that we have a mapping from each of the models to the particular tokenizer that it uses we are going to use the GPT 3.5 turbo model which uses the CL 100K base and I would say I think most of the more recent models like the models that you'd be using at the time of recording this video they they all use this encoder okay so the the embeddings model that is the most up-to-date uses cr100k base the you know track gpts uh GPT 2.5 turbo uses cr100k base GT4 also uses it the only one that is still kind of a relevant model is the text avengers003 model and that is the only relevant model that doesn't use that encoder so this one uses a p50k base all right so in reality you don't even need to go there to find out the encoding that you need to use you can actually just see this so take token encoding for model and you can you can run this right so you get the CL 100K base that's how we know okay now anything else I think that is pretty much it so okay so actually here I'm creating this tick token length function so that is you're going to take some text it's going to use the tokenizer to calculate the length of that text in terms of tick token tokens that's important because we we need to use that for our line chain splitter function in a moment so we create that then what we can do is just first before we kind of jump into the whole chunking component I'm going to have a look at what the length of documents looks like at the moment so I'm going to calculate the token counts the tick token length function come to here we can see the minimum maximum and average number of tokens so the smallest document contains just 45 tokens this is probably I don't know this is probably a page that we don't really need it probably doesn't contain anything useful in that maximum is almost 58 000 tokens which is really big I'm not sure I'm not sure what that is but the average is a bit more normal so 1.3 000 there so we can kind of visualize the distribution of those of those pages and the map tokens they have so the vast majority of pages have a very like they're more towards the 1000 token range so we can sort of see it here all right cool now let's continue and we'll we'll start and look at how we're going to chunk everything so again we're using line chain here using a text splitter and we're using the recursive character text button now this is I think probably one of the best like chunkers or Tech Splitters that line chain offers at the moment it's very general purpose they do also offer some text Splitters that are more specific to like markdown for example but I you know I I like this one it you can use it for a ton of things so let me just explain it very quickly so basically what it's going to do is it's going to take your length function inside the tick token length and it's going to say I need to split your text so that each chunk does not go over this chunk size here so this 400 and it's going to split based on the separators okay so the reason we have multiple separate is is that it's first starts by trying to find double new lines so this is a double new line separator it's going to try and split on that first if it can't find a good split using the double new line characters it will just try a single new line then it will try space and as a very last result it will just split on anything okay okay cool and then one final thing that we have here is this chunk overlap so this chunk overlap is saying for every Chunk we are going to overlap with the next chunk by 20 tokens okay let me let me draw that out so it makes more sense okay so imagine we we have a ton of text okay there's loads of tapes here okay now we are going to get a chunk of is 400 characters right so let's say that chunk takes us from here all the way to say here okay so we have 400 characters in this truck then the next chunk if we don't have any chunk overlap would be 400 characters from this so that would be you know let's say it's to here okay but this comes with a problem because we don't know what this information here and this information here is about so they could be related right so we might be missing out on some important information by just splitting in the middle here so it's important to try and avoid that if possible and the most naive way or naive approach for doing this is to include a trunk overlap so what we would do is let's say we take the 20 tokens behind this okay so we're gonna go back 20 tokens which maybe comes to here okay so that means that this space here is now going to be shared by the last or the the first chunk and the next chunk which will also bring back the next chunk to something like here right so now we have chunk one here okay which goes from from here up to here and then we have chunk two which is from here to here then following on from that we would also add another like trunk overlap for number three so number three would go from here to let's say here and finally for number four we'll go from like here to here okay so the chunk overlap is just to make sure that we're not missing any important connections between our chunks okay it does mean that we're going to have a little bit more data to to store there okay because we're including like these chunks of 20 in multiple places but I've I think that's usually worth it in terms of the better performance that you can get by not missing out that important information like important connection between chunks okay so we initialize that and then to actually split the text we use the text splitter split text okay we're going to take docs5 I'm going to take the page content okay which is just the plain text right So based on how the parameters that we set here Trump size of 400 and trunk overlap of 20 using the tick token lamp token we get two chunks let's have a look at the length of those two trunks okay so the first chunk that we get is 346 tokens next one 247. so both within that Max upper end limit of 400 okay so you see that it's not going to necessarily split on the 400 tokens specifically because we have these specific separators that we would like to use okay and it's going to optimize preferably for this separator okay so we're not going right up to that limit with every single chunk uh which is is fine that's kind of Ideal we don't want to we don't we don't necessarily need to put in a ton of text there okay so that's it for a single document and what we're going to do now is we're going to repeat that over the entire data set and the final format that I want to create here is going to look like this okay so we're going to have the ID we're going to have our text I'm going to have the source where this text is actually come from okay now one thing that you'll notice here is the ID okay so we're going to create an ID and that ID will be unique to each page okay but we're going to have multiple chunks for each page so that means we're also going to add in this like chunk identifier onto the end of the ID to make sure that every ID for every chunk is actually unique so let's let me show you how we're going to create that essentially so we have the URL here okay we're going to replace the RT dots that we have here with the actual https protocol and I'm just going to print out so you can see what it is and then we're going to take that URL we're going to add it to this hashlib md5 so this is just a hashing function that is going to take our URL and hash it into kind of like a unique identifier right so this is useful because if we are updating this text at some point in the future or this data set sorry we can use the same hashing function to create our unique IDs and that means that when we update this particular page it will just overwrite the previous versions of that item right because we're using the same ID but of course we don't we can't use the same ID for every single chunk so we also need to add in this here which is like the the chunk identifier right it's just it's just a count of the number of chunks so we can see that being created here so these are just two examples from the previous page that we we just showed so you can see we have the chunk identifier and indeed the chunks are different so this says language model Cascades ice primary books Socratic models okay whatever let's take a look at what is at the end of the item and it should be something similar so there should be the overlap that I mentioned right okay so yeah you can see language model Cascades ice Prime boat Socratic models right same thing cool so there is the overlap right now what we need to do is repeat this same logic that we've just created across our entire data set so to do that same thing that we just did we're going to take the URL out we're going to create our unique ID we're going to take the chunks using the text splitter and then we're going to append these all to our documents list here okay that's just going to be where we store everything okay and now so let the documents an hour ago was was a little bit less now it is 2012 documents so sorry 2212 documents cool we cannot save them to Jason lines file at you that we we just do this so Json lines it's basically it's what you can see here right so if we take a look at documents look the first five it's this but it's just in a Json lines file okay so you can see it here yeah same thing right okay and then once you've saved it and you've created your jsonnl file you would just load it from file like this okay so you with open Train Jason now wherever you saw it and you just load it iteratively like that okay you can take a look yeah okay great so that's how you would load it now a couple of things here the reason that we're using Json L and the reason I'm calling this train.jsonl is because this makes it very compatible with hugging face data sets which is essentially a way of sharing your data set with others or just making it more accessible for yourself if you set to being a private data set so what I want to do is just show you how we can actually go about doing that as well so the first thing that we need to do is go to hookingface.co and that will bring you to the the first page of Hogan features it may look different to you and because you you may not already have an account on active face so if you do need an account or you need to sign in there will be a little button over here that says sign up or log in so you would follow that create your account or log in and then you will see something like this at which point you go over to your profile click new data set we give our data set a name I'm going to call it langchain dots you can obviously call this whatever you want you can set it private if you want to keep this data set private for me also I'm going to just leave it as public and you create your data set right so on here this is like the the page of your data set like the home page of your data set you go to files you go to add file upload files okay and then you just need to drag in the train dot Json L file to here so for me that is here I'm just going to go and drag that in okay we go down commit changes to main okay so we have now uploaded that we can go click on files here and we'll be able to see that we have the train.jsonl file in there now to actually use that in our code we would need to install data set so this is a like the library for hugging face data sets and then we would write this so do firm data sets import load data set and then our our data would be a load data set here we need the name of our data set so let's go back to the to the data set page okay we can find that at the top here so it's James Kellum line chain dots we can just copy it add that into here uh split is the training split so that's where the train.json hour comes in and then we can view the data details there okay and once that has loaded we will be able to see we can just kind of extract things so dates zero we can see that we have our text in there so it's super easy to work with and that's kind of like why I recommend storing your data on hook image data sets if you're wanting to share it and even if you you're wanting to do the private approach you can you can do that as well you just need I think it's like an API key and that's pretty much it so that's it for this video I just wanted to cover some of the approaches that we take when we are considering how to chunk our text and actually process it for large language models and also see how we might saw that data later on as well which you know both of these items I think we kind of miss a lot in the typical videos about really focusing on the large language model processing or the retrieval augmentation or whatever else right so this in reality is probably one of the most important parts of the entire process but we miss it Miss it pretty often anyway that's it for this video so thank you very much for watching I hope this has all been useful and interesting and I will see you again in the next one bye
Info
Channel: James Briggs
Views: 15,359
Rating: undefined out of 5
Keywords: python, machine learning, artificial intelligence, natural language processing, bert, nlp, Huggingface, semantic search, similarity search, vector similarity search, vector search, langchain, openai, llm, chatgpt, gpt 4, gpt-4, hugging face, langchain chatgpt, langchain python, openai api, gpt 4 api, chatgpt 4, gpt 4 python, gpt 3.5, james briggs, langchain split text, langchain tutorial, retrieval augmentation, openai chat, chatbot, chatbot python, llm gpt, language model
Id: eqOfr4AGLk8
Channel Id: undefined
Length: 29min 48sec (1788 seconds)
Published: Thu Mar 23 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.