Knuth–Morris–Pratt(KMP) Pattern Matching(Substring search)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello friends, my name is Tushar and today I'm going to talk about KMP Substring Search so what is a substring search? Suppose I have a text and i have a pattern. I have to tell that does this pattern exist in this text or not? So, here my text is "abcbcglx" and my pattern is "bcgl" so this pattern does exist in this text at this point here so in this case my substring search should return 3, the index of the point from which this pattern exists. so what is our usual Algorithm to find the substring search what we usually do is we start from this from the 0th index of the text and 0th index of pattern and we compare 'b' with 'a'. Since they are not a match, we go to the next index of text which is 'b' so we compare 'b' with 'b', since it's a match we compare 'c' with 'c', again it is a match so we compare 'b' with 'g', at this point it's not a match so what we do is we start from the next letter to our start point. So then we start from 'c' and then we compare 'c' with 'b' again it's not a match so again we start from the next point. then we compare 'b' with 'b' so this is a match and then 'c' is a match, 'g' is a match and 'l' is a match so in this case we would return 3 which is the index of the point at which text exist and if our pattern was something like 'bcgll', in that case this pattern would not be found in the text and we would return an exception or -1 or something like that. so how much time would this algorithm take in the worst case, this algorithm would take O(mn) time where m is the length of the text and n is the length of the pattern so this is where KMP Search comes into the picture. the KMP Search can do the substring search in O(m+n) time and let's see how KMP Search works let's understand KMP algorithm using this text and this pattern. So we start from here. 'a' and 'a' is a match, 'b' and 'b' is a match 'c' and 'c' is a match. 'd' and 'x' is not a match so this is the point where we find no match so our aim is to not go backwards in this text so what we are checking is right before 'd' in this substring, is there a suffix which is also a prefix of this substring. Since all the characters are unique there is no way there can be a suffix which is also a prefix so what that means is our next comparison will start from 'x' and 'a' again we'll understand it little bit more in this example so we compare 'a' with 'x' it's not a match so we move to the next character this 'a' and this 'a' is a match this 'b' and this 'b' is a match this 'c' is a match, 'd' is a match, 'a' is a match, 'b' is a match 'x' is not a match so again we check in this substring before 'c' from 'a' to 'b' Is there a suffix which is also a prefix so if you look here, this "ab" is both the suffix and the prefix of this pattern what that means is, since we have matched till 'x' till this point . The character right before 'x' must be "ab" which is "ab" so what that also means is that since this is also a prefix we don't have to match "ab" again and the next match can start from 'x' and 'c' so what happened here is we didn't have to look backwards in the string to see where to start the, where to start the next match from. we can start from right here and here since this suffix is this prefix and since this suffix is this prefix and since this "ab" exist here, we've already matched them and so we start comparing 'c' with 'x'. So 'c' and 'x' is not same so we chevck, in here, is their a suffix which is also a prefix? it's not, so finally we start matching 'x' with 'a', so 'x' and 'a' is not same so we go to the next character, 'a' and 'a' is same, 'b' is same, 'c' is same, 'd' is same, 'a' is same, 'b' is same, 'c' is same and then, here we have a 'd' and here we have 'y', so again we check in this substring before 'y'. Is there a suffix which is also a prefix? so this suffix is also a prefix what that means is, since we've matched till 'y' this "abc" must be right there that's "abc". so what also means is that, since the suffix is also a prefix there is no reason to compare "abc" again and the next check can start from right here so this 'd' has to be compared with this one So this 'd' and this 'd' is same then 'a' is same, 'b' is same, 'c' is same and 'y' is same so this pattern exists in this text from this point onwards. Hopefully this helps you understand how the KMP Search works using suffix and prefix information. So the only thing is how do we do this efficiently? so in the next section, let's look at that so let's see how we'll efficiently compute if suffix is same as prefix and what point to start the check if there is a mismatch in the character between text and pattern so here we have a pattern "abcdabca", so first is always zero so this is my i and this is my j. if there is a mismatch if i is not same as j so we put zero here and we increment our i, again i is not same as j, so we put zero here and we increment i.i is not same as j, so we put zero here and increment i now i is same as j so what we do is, the value we put here is whatever is the value of j which is zero plus one so this becomes one and then we increment both i and both j so what this one means? So let's suppose our text was "abcdax" and so on and our pattern is "abcdab" so there will be a match till this point and this will be a mismatch right here between 'x' and 'b'. so point is where do we start the next comparison in the pattern so since the value here at the index of 'a' is stored one it means that there is already, there is a prefix which is same length as the suffix one, so we have already compared this 'a' and this 'a' so the next point to check will be this 'x' and this 'b' so this is where this one will come into the picture so again 'b' is same as 'b' so the value stored here will be the j which is one plus one that is two and then we increment i and we increment j. 'c' is same as 'c' so we store value two plus one three and then i goes here and j goes here so now 'a' is not a match with 'd' so this is little tricky so what you do is you go to the value on the previous character so that's zero so j will become zero and then you compare, is this guy same as this guy so those guys are same so the value here will be whatever is the value at j which is zero plus one so this is our temporary array and this array will help us doing the substring search efficiently. The Time complexity to build this array is O(n) and the space complexity is also O(n) next let's look at another example of how to build this array so here i have a pattern "aabaabaaa" so first is always zero, so i is here, j is here, since this 'i' is same as 'j' the value here will be the value of j plus one so this is one and then we increment both i and j so i comes here j comes here this value is not same as this value so the next point of j will be the value in the previous character which is zero so j becomes zero this is not same as this so this value becomes zero and we increment i. this is same as this, so value here will be j plus one, so this is one and then we increment both i and j. Again this is same as this, so this becomes two and we increment both j and i. again this is same as this, so the value here will be two plus one three and we increment both i and j again they are same so value here will be three plus one equals 4 and we increment i and then j. Again they are same, so value here will be five and we increment i and then j at this point 'b' is not same as 'a' so the next point j will be the value in the previous character which is two so j jumps directly to two, again this 'a' is not same as this 'a' so j will jump to the value in here so that's one so j comes to this point. At this point this 'a' is same as this 'a' so the value stored here will be whatever at j is one plus one, so this is two so this is how we build a temporary array for this pattern so again the time complexity is O(n) and the space complexity is also O(n) In the next section let's use this array to actually do the substring search so finally let's apply the substring search on this text for this pattern so i've prebuilt my array using the logic which we discussed in the previous section of this video so now we compare 'a' with 'a', so they are same then 'b' is same, so this 'c' is not same as this 'x' so then what's the next point, so since we are not going to look back in the original text we check what is the next point we compare this pattern and this text so the value, we go back to 'b' and check the value, the value here is zero so the next point of comparison will be this 'a' and this 'x' because the index because we're start looking from index zero which is 'a', so this 'a' is not same as this 'x' so we move forward this 'a' is same as this 'a', then this 'b' is same as 'b','c' is same as 'c', again 'a' is a match, 'b' is a match and then at this point, it's not a match so again we go back here and then two so what this two is saying is that there is a prefix of length two which is also a suffix which is clear, this "ab" and this "ab" are prefix and suffix. what that also means is that there must be an "ab" in the original text, this "ab", so we can safely ignore, we can safely ignore this two "ab" and start comparing this 'c' with this 'c' onwards which is how this two is useful so we start comparing from the position two in the substring. So 'c' is same as 'c' so we move forward, 'a' is same as 'a', 'b' is same as 'b' and 'y' is same as 'y' so this pattern exists in this text the run time complexity for this algorithm will be O(m) and where m is the length of this text the time to build this array, the temporary array is O(n) and so the total time for this algorithm will be O(m+n) and the space complexity will be O(n) the code is simple and it's in the description section of the video so this is all I have to talk about KMP Search I would ask my viewers to like this video, share this video and then comment on this video, also check out my Facebook page and my Github link github.com/mission-peace/interview/wiki Thanks again for watching this video.
Info
Channel: Tushar Roy - Coding Made Simple
Views: 842,889
Rating: 4.8458486 out of 5
Keywords: Knuth–Morris–Pratt Algorithm, KMP, KMP pattern matching, KMP algorithm, KMP substring search, KMP string matching, yt:cc=on
Id: GTJr8OvyEVQ
Channel Id: undefined
Length: 12min 49sec (769 seconds)
Published: Fri Jun 12 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.