LangChain: How to Properly Split your Chunks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today I'm starting a new series where I will cover and break down concepts related to llms Lang chain and generative AI it's going to be beginner friendly every video in this series is going to be focused on a single tool or technique the goal is to not only teach you how to use these tools but also explain how things work under the hood that way you can better utilize these tools and Concepts in your own applications today we are going to be starting with recursive text splitter in leg chain so let's look at recursive character text player and understand how exactly this works if you are trying to extract information from your documents you probably have seen this you need to divide your documents or text into smaller chunks in order to process them now how is the size of the chunks determined so first and foremost you divide the available text based on a list of characters so you're dividing it on splitting it by characters not by tokens the second thing to remember is that the chunk size is defined by the number of characters not by the number of tokens this is a very important distinction to make because a lot of people confuse the chunk size in these text Splitters with the number of tokens which is incorrect it's actually looking at the number of characters so in order to understand this let's define a chunk size of 200 characters then the recursive character text splitter uses these special tokens or characters to divide our text or documents now the way it works is that the text is first divided into paragraphs the paragraphs are divided into sentences those are divided into words and the words are divided into characters so initially we start with paragraphs and see if each paragraph exceeds more than the given chunk size or the number of characters in that paragraph right so if a paragraph is smaller than the chunk size we keep that if a paragraph is larger than the given chunk size we further divided into sentences and see if we can combine multiple sentences within that paragraph to create another chunk this will become very clear when we look at code examples okay in order to explain everything let's look at this Google collab so I have installed a link chain and from Lang chain text Splitters I am simply importing recursive character in text splitter now keep in mind this recursive character text splitter plays a very important role when you're extracting or retrieving information from your documents so for this a simple example we are considering the text here it's regarding Paris now the title in itself is going to be considered as a single paragraph then you have three more paragraphs so in total we have four paragraphs here so here I created a variable called text and I'm assigning the text that I've shown you above now if you check here here we have split by paragraphs then uh this backslash n indicates a new line okay to start we look at a very simple example so here I'm defining the chunk size to be 500 characters there is no overlap between in the chunks and we are looking at each chunks based on the length or the number of characters now we pass our text to this character splitter and as a result we are getting a total of three chunks so if you recall there were a total of four paragraphs so how did we get a total of three chunks now in order to explain this let's look at the chunk size so the first chunk uh has a size of 438 characters the second one has a 436 characters and the last one has 445 characters and here is how the actual chunks looks like so what happened was that the first chunk or the first paragraph which is the title of the page it has around 26 characters and if you combine it with the second paragraph then the size is still less than the chunk size that we have defined so that comes out to be 300 438 characters in total so initially we are defining it based on the paragraphs and then combining the subsequent paragraphs and see if the combined size or the combined number of characters in these two paragraphs exceeds the given chunk or not so it means that you can combine multiple paragraphs together if the total size does not exist exceed the chunk size that you have defined now if you look at the second chunk here this is basically the second paragraph in the third paragraph actually provided in the text and that is around 430 characters long now you cannot combine it with the last paragraph because for that itself the size is around 440 characters if you combine these two together that will exceed chunk size that we have defined okay I hope this makes it clear how the initial pass works but now let's look at some more fun examples so in this case we reduced the chunk size to 250 characters now as a result we get a total of 10 chunks and if you look at each chunk individually now none of them is actually reaching 250 characters that we have defined all of them are much less than that limit so what exactly is happening in here so for that we need to look at the individual chunks now the first chunk is actually the first paragraph which is the title of the provided text now the second chunk is a sentence uh from the second paragraph if you look at here each one of the chunk ends with a DOT so that means it's a complete sentence so you might be thinking what exactly happened here now the way it works is it took the first paragraph and looked at the length of the first paragraph or the number of characters so it has a total of 26 characters Which is less than the chunk size of 250 characters right so it keeps this then it tries to combine it with the second paragraph but the second paragraph it itself has more than 400 characters right so as a result it will simply take this as a single chunk now we actually need to further subdivide this paragraph into smaller chunks and for that it looks at sentences right so each of the sentence in itself is less than 250 characters but let's say if you've combine the first sentence with the second sentence that will exceed the chunk size so as a result it has to keep these individual sentences now there could be a case in which you combine two sentences sentences together uh and the size is less than two for 250 characters so that is going to be fine but uh in this specific case the second division is based on the sentences okay so what happens if we further reduce this chunk size so let's say we reduce the chunk size to 50 characters as a result we are getting 32 different chunks and this is where things gets very interesting so if you see there is a lot of variation in the chunk sizes for example the first one is 26 characters that's the title but there is even one with only 11 characters again to understand we need to look at our chunks so here is the first chunk that's the title then the second chunk is actually a sub sentence so basically the third level is it will start dividing sentences based on the characters and as a result you see a a situation like this where when it tried to divide a sentence uh so there is just one word left in one of the chunks and that's the one which is 11 characters long now the question is why the chunk size is important uh and it comes down to information retrieval systems based on embeddings so for example if you're returning only four chunks and you have very small chunk size then it might be simply returning sub sentences and you may not be able to extract information from that so that's why it's very important to understand how this works under the hood and what type of implications it's going to have when you play around with different chunk sizes now a couple of things to keep in mind you need to select the chunk size based on the data you are playing with so play close attention to the type of data you are providing so you might be actually thinking that having a large chunk size might be the solution but that's not always the case and let me explain that with a hypothetical scenario so if you look at this text all these paragraphs talk about relatively different things now if you define a large chunk size and all of this is put together in a single chunk then it might confuse the llm that is looking at this and deriving information on top of it and that is why you need to pay very close attention to what these chunks contain when they're returned by let's say a semantic search performed by an embedding model I hope uh this was helpful now in this introductory video we looked at the default list but you can actually modify this based on your own applications and needs now if there is interest from the community I will create a subsequent video on how to do that if there is interest I will create these videos on other topics I have seen some confusion around the embedding sizes as well for example I like to use instructor embeddings that have a size of 512 a lot of people confuses that whether it's the number of characters or number of tokens so if there is interest I can make a video on that or if there are any other topics that the community wants me to cover I would definitely make videos on those as well if you found this video useful consider liking the video and subscribe to the channel thanks for watching and see you in the next one
Info
Channel: Prompt Engineering
Views: 12,328
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, ChatGPT for PDF, langchain, recursive text splitter, text splitter, recursive character text splitter, how to split text for langchain, langchain in python, langchain tutorial, langchain text splitters
Id: n0uPzvGTFI0
Channel Id: undefined
Length: 10min 41sec (641 seconds)
Published: Sat Aug 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.