Knuth–Morris–Pratt (KMP) Pattern Matching Substring Search - First Occurrence Of Substring

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so today we are going to talk about a topic called KMP substring search this is an algorithm invented by Donald Knuth von Prats and independently by a third guy named James Morris so basically what's the whole point of the algorithm is is we're given a pattern and we're given a string and our goal is to find whether the pattern exists in the string we see that a a a B exists here in the string so the whole point is we want to find this indicee and return that value or it could be different variations of that but the whole point is find the pattern in the street what we do is the normal way that it's done in the naive approach is to look at the first characters these match so because these match we're going to we're going to begin to search on a so these aids match so they match so move on to the next characters to those match yes advance forward one do those match yes advance forward one and notice we're still doing a search just all aim we're just saying does if the substring started here is it a match so advance one and what we see is a and B do not match so our substrate pattern that one it does not exist it does not exist from this indicee so therefore we advance forward one from here and so we check does do a and a match yes they do so then we go into a search so does an a match yes continue on with the search do an a match and notice where we're saying does this if this substrate started from indicee one I should have numbered them but if it starts from indicee one do Kent those will this pattern exists and maybe it's a match and then finally we notice that these don't match and so what we do is we revert all the progress that we made and then we go back to the beginning and we move we move our search pointer we know it cannot start here so we move our search pointer and so now do a and the first character of this pattern string match yes they do so go into a search do those match yes do those match yes and finally do these match yes and when we go over the pattern stream then we know we've matched the whole pattern and that means from indices for indices - we know that this pattern exists in the string so this is the naive approach but notice notice that for each of these for each ending if we call if we take the length of s we could potentially do length of P work so what happens is we have length of s times length of P as our time complexity so the KMP algorithm is meant to be an improvement on it n times m complexity this is a demonstration of the worst case where for each of the N characters in s we do a potential of M work all right so the key thing that the KMP algorithm takes advantage of during our subs research is to take advantage of these successful comparisons that we make between s and P and we know that every character up to the character that mismatches are a match and therefore we're going to make we're going to look for a suffix that is also a prefect while we'll see in our example so first we're going to look at a MZ so - a and D match no so we advance forward in our in our string so do d NZ match yes so advance forward so do s and s match yes do G & G match yes do W and W match yes do a and a match yes d ND match to s and s match yes do X and G match they do not match but notice something so what we've done so far is what I've just underlined in green is what we have matched so far we know that this this straight the substrate if this part of the pattern string exists in our spinner string starting from this search index so what the campi algorithm does is we want to prevent ourselves from going backwards in this in this string and reverting our progress in matching the pattern so we know that these match so what we need to look for is we look for a suffix that is also a prefix in this character before the mismatch so what we see is that des des D has this suffix is also a prefix what do we know we know that des exists in this string from here we know that the S exists before our mismatch so what the algorithm does is it says hey how far back do I need to go backwards in my pattern so that I can just not revert my progress in the string that I'm matching you the key place that we jump back to is we jump back to this G so we've mismatched x and g g + x did not match and we know that everything before it did match so if we can find a suffix to this substring that is also a prefix then we are able to continue our search from where the mismatches because we know these two match in sequence so if we can find a beginning of the string that also matches in the same sequence then we can just continue our search from after what we know won't match so we know ok we know des will match and then we have a mismatch so that we know if this is a suffix to this the the substring before a mismatch then we know that our prefix is going to also exist in this string and then we know we need to continue our search one two three indices forward from the dns and as you can see as you can see the s does exist in the string because we literally just match that so this prevents us from going backwards in our in our in our search string it allows us to continue our search from the dan the s which is a prefix here and it allows us to just continue from the G so we continue from the G so now we're going to compare G and s now we're going to compare G attacks and we have a mismatch so again what we do is we try to prevent going backwards in progress so we look here we look at the substring of the pattern des is there a suffix that is also a prefix and what we see is no there is not and so what we do is we have to refer all of our progress we have to go all the way back to the beginning of our pattern string because we were not able to find the suffix Isles of prefix that could save us from going backwards in our work so we compared the dfx they don't match so we just continue to make our chests so we find D and D match and then now we're going to go into our matching s and s match gng also match W and W match and any match D and D match s and s match G and Z match and then finally znz match and we found a pattern in our string and our answer is right there so the key is every time we find a mismatch we're going to ask so ourselves we're going to say behind the mismatch so say the mismatch is at Z behind the mismatch behind the mismatch can I find a pre can I find a suffix that is also a prefix so since we know that we matched dsg and then we found a mismatch is Z we know that we can just go back to the beginning of the string we know we managed DSG we know it was matched and then we can just hop we can say let me start matching at W again so it's saving us these checks because we know all of these match up to the mismatch so it would make logical sense that to prevent backwards motion we want to find the characters behind the mismatch that are also at the beginning of the pattern string because we can skip to these checks we've already matched these up to the mismatch we can skip these checks at the beginning of the pattern and we can just continue on to where did it does not match which is after the prefix that is also a suffix this is why prefixes and suffixes matter it allows us to not repeat work in the pattern so we don't have to revert progress in the string so what we do is we built a prefix suffix table so each of these boxes represents a substring so if we see box two it represents the substring from 0 to 2 inclusive so this up string which is the s and G so each of these boxes tells us the length of the longest suffix that is also a prefix from here to here here to here whatever that cell is it indicates the substring we're taking the merge is looking at this the longest suffix that is a prefix up to 5 we know it's going to be 1 because the subjects D is also a prefix which is d but we can't go longer a T is not a is not a match as a prefix what we're going to do is fill out this table so we're going to have to iteration pointers I and J so we're going to compare I and J are they equivalent if they're not equivalent then advance J not equivalent advance J these are not equivalent so advanced J advance J again so every time that we find that these are not a match initially after no match we just put a 0 so what this is saying is for each of the sub strings so for the substring of just deep the longest prefix that is also a prefix is length 0 length 0 for DMS late 0 for d SG legs 0 for D sgw length 0 for D SG w a so that's what these zeros mean it means that our longest prefix and suffix does not exist each of these is telling us if we find a mismatch yet I plus 1 if I find a mismatch here give me information about the prefix suffix data for I minus 1 the substring behind me which is literally what we were doing before that's what this table is about so anyway continuing with the iteration we see D and D match so we do I which is 0 plus 1 into this cell and then what we do is we advance I and J by 1 and now we see that I and J match so we do I plus 1 into where J is so we do 1 plus 1 which becomes 2 and again this tells us so what that 2 in that cell tells us the dsg waes the longest suffix that is also a prefix is length 2 we can see that D as TS so if we have a mismatch here plus 1 if we have a mismatch here if we have a mismatch here we know we're going to have to look at this substrate and when we look at this substring we we've already mounted es up to the mismatch so we can just go des I think we can just match here G which to 0 1 2 this tells us the index to continue our matching gap so that's what this is all about that's a match so now we need to increment I and J so now G & G match so we're going to do PI plus 1 2 plus 1 is 3 so 2 plus 1 is 3 and what is 3 mean let's do this again if we have the substring dsg w80 SG if I find a mismatch after this substrate if I find a mismatch over here whatever that is I need to know about this substrate this is this cell gives me information about this string it tells me these something's the longest suffix dsg that is also a prefix DSG is something like 3 0 1 2 3 so I continue my iteration I continue my matching from dubbing why do I do that I already matched DSG if I have a mismatch here I know I can save that progress I can say the SG is at the beginning I can save that progress this table is all about how do we save that progress so now we match I we need to advance I and J so now we have a mismatch and this is kind of weird but what we do here is we look at I and we look at the value at I minus 1 and that is what I becomes I becomes the value at I minus 1 so it doesn't become the 2 it doesn't become the index at I minus 1 it becomes the value so I become 0 here so D and Z are not a match here so what we're going to do is put 0 so if you look at the whole string the longest suffix that's also a prefix does not exist so it's a 0 if we tried to increment J we're going to run over the array it's going to go over here so this is our prefix suffix table and now what we need to do is we can walk through an example and look at this how is this table going to help us know where to jump back when we have a mismatch so let's see how this helps us alright so now let's go back through the matching I've numbered the indices in our pattern this is going to show us how our table influence how we go back and our pattern by indicee so let's do our comparisons again we compare a 2d no match we compare D to D that's a match s is a match G is a match W is a match is a match D is a match as is a match and then G so we have a mismatch at embassy seven so what we need is we need information we've matched what we've matched is we've matched the substring from 0 to 6 and remember we have information on that we now understand that the longest prefix that is also a suffix the longest suffix s also a prefix that goes from index 0 to 6 the substring of the pattern we know the longest pretty suffix s also a prefix is to the length of it is 2 and notice it is 2 D as the best so right before miss match or mismatch happened at embassy 7 and we know we need information on this so we know now we can start a search from 0 1 2 we can start a search from 2 because the length of this is 2 we know this matches we know DSG mismatch we know that matches come back to the start match match mismatch so we know we need to start from here start from the G so now we've prevented backwards progress in our string s our search string and this is what makes this allow us to run in linear time with respect to the length of s now we compare G and X it's a mismatch so now what we need to know is for my substrate where my mismatch happens here I need to know about the substring from 0 to 1 I need to know what happened right before my mismatch what about this substrate what is the longest prefix slat that is also a suffix what is the longest suffix that is also a rethinks that goes from in x 0 to 1 0 so now we know we need to come back all the way to the start 2d we were not able to save ourselves work now we need to come all the way to the start and notice we're still linear with respect to s we're not redoing work we're not going all the way back this is saving us progress we compared D to X it is a mismatch so we come here T to the SSG w ad sgz that's a match so this is our substrate this is what km p substring search is all about this is not the only linear time substring search algorithm there's also Boyer more rapping Karp these are other substrate algorithms and these others linear time algorithms take different approaches so let's do an algorithm analysis a time analysis on this algorithm so now let's look at each of these steps of this algorithm this algorithm consists of two basic steps which is building the table and then performing the traversal with our table and M we wanted to find them first so Edna is the length of the string we're searching in M is the length of the pattern we're searching for building the table is going to take us the length of P time which is going to be we're just calling that M the length of the pattern so it's going to take us all M time it's going to take us all of M space because we're keeping a track of the longest suffix that is also a prefix up to that point for the pattern string so that is the building table step for the matching traversal step we're going to be doing all of at the time it's we're going to balance by o of n because that is the length of the string we're searching in and we define that up here and then the space is going to be constant because we're not creating any extra space we're just performing comparisons during the traversal so overall this algorithm is linear with respect to each of our inputs and the plus n we get all of em plus after time and we get all of em put for the space and remember M is the length of the pattern n is the length of the string we're searching so this is the KNP algorithm this is actually not that bad to understand when you really understand why we look for prefix and suffix is why does that matter to us so it becomes simpler when you understand that intuition and then from there it's just a matter of seeing the implementation in code if you liked this video hit the like button subscribe to the channel if you're not already subscribed I want to do a computer science interview question video everyday to help people engineers students get ready for the interview just become computers saw better computer scientists and yeah so yeah thanks for watching don't know why you are watching [Music]
Info
Channel: Back To Back SWE
Views: 101,788
Rating: undefined out of 5
Keywords: knuth morris pratt substring search, substring search, substring algorithms, substring search algorithms, Back To Back SWE, Software Engineering, computer science, programming, programming interviews, coding interviews, coding questions, algorithms, algorithmic thinking, Google internship, Facebook internship, software engineering internship, swe internship, big n companies, strings, strStr()
Id: BXCEFAzhxGY
Channel Id: undefined
Length: 17min 24sec (1044 seconds)
Published: Fri Jan 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.