The Trie Data Structure (Prefix Tree)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're talking about tries that is the tri data structure we're going to talk about what it is how it works and of course we're going to build one hey everybody welcome back sorry i'm a little groggy this morning it's really early today i want to talk about another data structure and a bunch of you have requested that we talk about tries on this channel so i thought why not so that's our topic for today source code is available through patreon as usual also some of you may not know this but i do typically post the source code for my videos about two or three weeks in advance so in addition to getting access to my source code for my videos those that support this channel through patreon also get a little bit of a preview into what is coming up just because it shows up in the source code a few weeks in advance and of course thank you so much to all of you who help support this channel we've recently talked about both trees and lookup tables links below in the description if you missed those videos but if a tree and a look-up table had a baby the try would be it also apparently the tri was initially pronounced tree by edward fredkin back in 1960 long before i was born because it was supposed to be the middle syllable of retrieval which of course was a terrible idea because then you'd never know if someone was talking about a tree or a tree like the general tree or the more specific tree so that's a total mess but we gotta it sorted out and now we just call them tries so we're fine we're all we're all fine so basically a try is a fun little data structure that we use for retrieving things for retrieving strings of symbols so i want to make it clear when we talk about strings we could be talking about strings of characters but we can also be talking about any sequence of symbols so any sequence of symbols can do the trick here and we'll work with a try this could be strings of binary bits but it could also be ascii characters which is what we're going to use today so tries are also sometimes called prefix trees or digital trees so if you hear those terms we're talking about the same thing so the main thing is a try is going to allow us to efficiently check whether or not a particular string is in our set of strings so let's look conceptually at how this works like i said before a try is a tree so it has nodes and each node is going to represent one symbol in a sequence and the node's children are going to represent all of the symbols that could come after that symbol so let's say i want to store these strings in a try so if i stored these strings in a try it would look something like this you notice that each node represents one symbol and we can move down the tree to see if each word is contained in it also notice that i have bolded the terminal nodes that is the nodes that represent the end of a string so we have to mark those nodes or we wouldn't know whether or not say cat c a t t with two t's isn't also one of our words now sometimes you will also see tries shown like this with one node representing a substring this representation naturally produces smaller trees with fewer nodes but it also is a little more complicated and so for simplicity today we're going to stick with one symbol per node i'll leave the substring nodes as an exercise for you or maybe a topic for a future video but so why is this a good way to store a set of strings well the main reason is that some key operations inserting new strings for example deleting strings and most importantly searching to see if a string is in the set are quite fast and they also scale well so look at an example say that i want to test to see if cat is in the try well i start at the root and i can see that c is one of its children so that's great and i can see that a is one of c's children and t is one of a's children and this t node is marked as terminal so i know that cat is stored in my try and that operation is going to stay the same regardless of how many other strings i store in this try i mean i could store 10 000 strings in this try and it would take more memory to store of course but finding cat in this try would not change it would take the same number of operations so that's really nice and that's also true when we insert stuff and delete stuff to and from our try now another nice feature of a try is that it's also very straightforward to print out strings in the try in sorted order and that's basically just using a simple depth first search so we'll talk more about that later but this makes the try a solid data structure for people who want to store sets of sorted searchable symbolic strings wow try to say that fast sets of symbolic searchable sorted strings or sorted strings of search searchable sorted sets of strings anyway lots of s's that's definitely trouble but enough about the conceptual stuff let's get into the code and see if we can build one of these so i'm going to implement this try in c though it would be fairly straightforward to report this to c plus plus java python ruby pretty much any other language now let's just start out up here by defining our node just like our other tree examples we're going to have a struct and it's going to let's type def it and it's going to define our node so we're going to call this trinode okay now inside of our struct one thing that's going to be a bit different about this node is that i'm actually not going to store the character value of the node in the node which may seem a little weird we did that before before we had an actual value that we stored in the nodes today i'm not going to do that and i i could do it but as you're going to see i don't really need to because it's going to be implied by the links from the parent so first of all i'm going to add a boolean value here and this is going to call this terminal this is going to mark the node to see whether or not this node is the end of a word so when i say terminal all that means is that it's it's the terminal node of a particular string and now let's look at adding the children so we need child links for all the different things that could come after and our children are going to be an array of trinode pointers so i'm going to do here trinode and we'll call it children and then let's just say we're going to do the number of different characters in our set of symbols that we want to consider and each of these pointers is going to tell us where to find the node for that particular character and these pointers will also be set to null if there's no child for that character now this array is basically going to be a lookup table it's going to act just like a lookup table again if you haven't seen anything about lookup tables check out my recent video on lookup tables but note that using a lookup table means that we don't have to search through the list of children to see if a particular letter is there we just put the value of the character into the array and we instantly know whether or not it's a child of this node and yes that does waste a bit of space but that's pretty common in computer science we often trade space for speed and searching for children in this case would be slow and so we definitely want speed in this case so we do have to decide how big to make this array um for today i'm just going to let's just set i'm going to pound to find num chars to be 256 and that's because that's how many different values a single byte can have so the upside of that is that basically each of the different characters in my string i will have a next link for i don't have to worry about bounce checking on my lookup table of course that is going to waste a bit of space if you wanted to save space and only include the letters or letters and digits that go along with the language that you're trying to support really whatever symbol set you want then you could do that and it would give you nodes with fewer children that would use less memory but you would have to do bounce checking and so for today i just wanted to be able to store any ascii string that you throw at me so i'm going to use 256. so this is our node struct now let's come down here and start building some operations so the first one i want to make is going to create nodes i know i'm going to need that so let's have it return a pointer we'll call it create node so this is just going to run and it's going to create it's going to allocate a new node and return a pointer to it so call we'll make a new pointer new node and malloc size of new nodes that'll give us the a chunk of memory that can store the new node that we have that we want to create and then let's make a quick for loop that goes through the children basically just gonna go through all of our characters and we're gonna set new nodes child so each of the children basically the ith child we're going to just set to null so this is starting out all of our children links as null and then we're also going to come in here and set it to be not a terminal which makes sense and then we're going to return our new node so anytime we need a new node this is going to allocate space for it and it's going to initialize it to default values and then we'll change those later on okay so that's all we need to create a new node now let's look at inserting things into our tree and then we're also going to print out the tree okay so let's talk about insert so our insert function is going to return a bool and call it try insert and the boolean is going to tell us whether or not we actually inserted the new word so if it's false that means the word is already there the string is already inside our set and if it returns true then it means that we actually inserted it and the function is going to take a double pointer so let's make china double pointer which is a pointer to the root now you may wonder why it's a double pointer i'll get to that in a second but the second thing i'm going to add is a pointer character pointer which is going to be the text value that we're trying to insert so this is the string we're trying to insert okay so a couple things that may be a little strange you may wonder why we're passing in a double pointer here the reason is is that under certain circumstances we may want to actually change the root and i'll show you where that actually comes in is like right here if you say that if the root happens to be null meaning the tree is empty then i need to actually set it to a new node i need to we're just going to call our create node like this and this is actually going to create a new node and set our root to change the actual root pointer so that it points to that new node if i pass in a single pointer that's not going to work because c uses pass by value and so i'm basically just changing a copy of the root and that's not going to have the same effect after this function returns that change would be null and void no pun intended but if we pass in a double pointer then we can change the value and so that's what i'm doing here now you'll also notice something strange i called this signed text instead of just calling it text now one of the challenges is on some machines char you know the character data type is signed meaning that it can be negative and positive and i really don't want any negative indexes coming into my lookup table so i'm passing it in as signed because that's just the normal way that it's going to be passed in but i want to come in here and have an unsigned character call it text and it's just going to we're just going to cast our signed text to be our text value and then this is what we're going to actually use internally and this just basically forces all these characters you know same number of bits but we're forcing them to be unsigned rather than signed and that just prevents us from having issues on some machines where we end up getting negative indexes into our lookup table so also up here i'm going to create a temporary variable it's going to be a temporary pointer to a node and i'm going to point it to the root to begin with okay i'm going to use this for traversing my tree and it's going to start at the root that's why i'm setting it to the root and then we're going to use that to go down through the tree also one more variable let's create a length variable and this is going to be the length of our signed text so this basically just tells us how far we need to go so at this point we just need to make a for loop okay we know how far we need to go so a for loop is going to fit we basically know how long our string is and so we're going to go i'm going to take i starting at 0 we're going to go the length of the string and then each time through here what we're going to do we got to check a few things okay so we're inserting so anytime we come in here and we notice that the children look at the child based on the symbol that we see in our string okay so if this symbol if the child for this symbol is null then simply what we're going to do in here is create a new node okay so children text sub i equals create node using that function we made before and this basically will just create a new node for this child now if there if it's not null then we didn't need to create this new child at all and we can just set temp equals temp children text sub i so we're basically just moving temp to its child so we're basically moving through the string and as we move through we move our temporary pointer to point to the appropriate child so we look at the character that tells us where which child pointer we need to look at then we look up that child pointer and we follow the pointer down to the next node so this just is traversing the tree according to the characters that were in the string and if we ever run into null pointers then we create new nodes for those and fill out that part of the tree so at this point once we're done with the for loop we know we have all the nodes we need okay so now we have to basically try to figure out whether this word is already in the try and we can do that simply by checking the terminal value of the node so if temp terminal so in this case we just return false meaning that yeah it's already here otherwise then what we're going to do is set the terminal value to be true because so it wasn't meaning meaning that we had a node for this word already in our try but it wasn't a terminal so that's that string wasn't actually in the try even though it had all the nodes it needed and so we're going to set terminal in that node to be true now that tells it now it actually is in here and we're going to return true okay so that's pretty straightforward and unless i did something wrong that is going to work that's going to insert a node into our try so before we test it we will test it just here in a second before we test it though let's run down here and actually print out our try i always like to add a print out function pretty early because it gives me a pretty convenient way to test out my data structure just to make sure that the basics are working so let's make a function called print try it's not going to return anything because i just want to print it out we'll pass in a single pointer to the root of our trinode and for this function i'm going to use recursion i've talked about recursion with trees before it just it's so often so nice when working with trees and the try is no exception so if my root is equal to null so this is kind of just a special case we're going to say if it's null then we're going to print out try is empty you can of course adapt this to whatever you actually want to print out oops and we'll return so nothing to do here and if it's not null then i'm going to go through the children of this particular try and print out any words we find here so for this we're going to make a second function i'm going to call this print try wreck because i'm lacking creativity today so this is going to be the recursive version and i'm going to pass in my root and right now there are no prefix symbols so i'm just going to pass in null i'll explain this in just a minute and then the last argument is going to be the number of symbols we've seen so far so that's going to start out as 0. so if this seems a little weird don't worry about it i'm about to explain what's going on so let's come up here and we're going to create this new function this recursive print try wreck function and as you could imagine we have a tri node pass that in that's going to be the current node we're looking at i'm not going to call it root because it's not always the root i mean it's the root of a subtree but i find calling it node is a little bit less confusing to people reading my code we're going to pass in a prefix and this is basically just the history this is telling us the characters we've seen so far to get to this point in the tree and then of course we also want to pass in a length now i could figure out the length each time but this is just going to make things a little more efficient okay so now as we get started we know we're going to need a new prefix when we call down to the next level so let's make a unsigned character here new prefix and it's gonna be length plus two now why is it two is because i need i know how long the previous prefix is and i need space to add one more symbol one more character and i also need space for the null character and let's just copy the old prefix over to the new prefix with mem copy and we're going to copy the number of symbols that we had in the length so this still leaves those last two characters empty and then for right now what i'm gonna do is just set new prefix length plus one to be equal to zero okay so we're gonna null terminate it that's important just to make sure that we don't run off into nowhere actually the fact that we're passing lengths around that's going to be key when it comes time to actually print this stuff out which we're going to do right here so our base case is again a recursive function so you typically deal with your base cases first your base case here is going to be if my node that i'm currently looking at is terminal then we're just going to print out the word okay so we're just going to print f in this case i'm just going to say word colon and we'll just print out the word that we're finding here and the word is in this case going to be the prefix right this is the characters it took to get to this point so we're just going to print out the prefix and that's great now at this point we might be tempted to return and we often do at the end of base cases but there might be more children below this node so we actually need to recurse even after we've handled this case so we'll do this with a simple for loop um we'll go from zero to in this case the number of possible children so the number of characters and then we're going to check to see if the ith child if it's not equal to null meaning there's actually something down there to look at because we don't want to recurse unnecessarily i don't necessarily need this check i could have moved it as one of the base cases but that would actually waste a lot of time in a lot of extra function calls and i don't want to do that so we're gonna check it here and then we're gonna say new prefix length equals i okay so this basically is just setting that last symbol the next symbol in our prefix to be the value that we're checking right now and then we're going to recurse and we're going to call print tri-rec and we're going to call it with the child node this time so children i and we'll pass in the new prefix and we'll pass in length plus one because we added one symbol to the prefix and so at this point we're really done this is going to continue to go through our tree and print out any words it finds and it will print them out in alphabetical order well let's call it ascii medical order because it's going to just go by the ascii codes but this is nice now let's come down here and make sure that it actually works because it's possible that i screwed something up so let's make a trinode pointer called root set it equal to null because we're starting out empty and let's call our try insert function a few times and we'll pass in the address of our root and let's pass in yeah we'll just get a few words we'll do some of those ones that we saw before so let's do kit cattle can cat and let's add one that doesn't fit with these others let's do happy and then let's do print not print f print try and print try the root okay great so this should work um with any luck i do have a make file over here really nothing fancy it's just like the make files you've seen in all of my other videos if you've never seen a make file before please check out my other make videos they'll give you a quick rundown of how this works um but this is going to allow me to come down to my terminal and run make which let's see yes it did work okay great and then we're going to run it and you can see sure enough it did print out all of our different strings that we had in the try and it printed them out in alphabetical order or ascubetical order now notice and this is something to keep in mind when you're testing your own code notice that i intentionally put these in out of order the reason i did that is i wanted to make sure that the actual ordering was working properly i wanted to be able to see that so that's just something to keep in mind when you are actually testing your code you do want to make some effort to try to break it you want to try to give it scenarios where it might not work i find new programmers spend a lot of time going way too easy on their own code they test it without the without really giving it a challenge and the consequence is that you often don't know whether your code actually works or not okay so now we have the basics of our try working and this is where i'm gonna need to end for today just because i'm out of time at this point we can create and print our try i know we haven't gotten to searching and deleting nodes yet and we'll get to that in next week's video and that video will come out next tuesday unless i get some extra time and end up posting it earlier but i hope this is helpful i hope it gives you a clearer conceptual idea of what a try is how it's used and how you could start implementing one we'll finish this example in next week's video but subscribe to the channel so you don't miss that video in other future videos like the video if it was helpful and until next time i'll see you later
Info
Channel: Jacob Sorber
Views: 10,909
Rating: 4.9699249 out of 5
Keywords: prefix tree, prefix tree datastructure, trie datastructure, trie tree, trie, trie tutorial, trie in c, trie implementation, prefix tree implementation, programming tutorial, trie prefix tree, prefix trees, understanding tries, understanding prefix tree, diy prefix tree, data structures, Data structures in c
Id: 3CbFFVHQrk4
Channel Id: undefined
Length: 21min 7sec (1267 seconds)
Published: Tue May 18 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.