Linear-time pattern matching. Z-values and Z-algorithm

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey we're gonna shift to a new topic that's related and it looks in one way to decipher ability and it involves strings and this is a string matching problem and string matching and what I mean by this is that you're given a text which is some string the string of characters and let me give an example well you can really take any example a a b c a b x ay AZ okay which is Superman's middle name or something anyway if you follow those comics and attacks and then you have a pattern for example BCAA and the question is to find all occurrences of the pattern we usually call the text e and the pattern P all the constants of P inside T so in this example BCA a occurs right there and nowhere else was one occurrence but in general it could be several or there might be none for example if you had ba a that's not in this pattern at all so the general problem is you give it a tax to give it a pattern and you want to find all the occurrences or determine that there aren't any then a pattern well the length of the pattern will give that a name later a symbol but and same thing for the length of the tax okay so what's the primitive operation well most of the algorithms that we're gonna think about it most of the Uggams that people have thought about for this problem involve comparing individual characters in the pattern to individual characters in the text so primitive is character comparison and the other kinds of operations that are done in the algorithm tend to be fairly simple relative to the numbers of those relative to the number of character comparisons you have to spend some time and various algorithms deciding which characters you're going to compare and so on but usually that's that's fairly that work is is small in relation to the actual character comparisons so that's a primitive and we'll be counting the number of character comparisons so this pictorially here's the text the text tends to be very long in most applications compared to the length of the pattern and I probably don't have to motivate this problem I mean how many people have solved this problem in practice and I saw that I mean you run an algorithm but you actually did it today okay none you did yeah and what were you doing it would search those word puzzles but yeah word searches are fairly common if you're using an editor and you're writing something and you're looking for some word in your text that's that's an example of this or you're looking up something and you know Wikipedia or whatever or I guess I maybe I should have said how many people have had this done for them today by something you used you know yeah I mean you as Google used whatever it's it's ubiquitous I I just point that out because I think the first time I taught this I don't know how many years ago way before the internet but I was introducing this problem and somebody impressive why would you know why would you ever want to solve that problem that's what's that about it just seemed really straight and then it seemed strange but anyway it's simply a very natural question all right so what's the most natural kind of algorithm you can think of for this you know here we have characters and let's just say well I'll just that many characters in the text and that many characters in the pattern and what is the most obvious way to look for occurrences in the pattern inside the text yeah yeah basically if I must you you put it in a way that doesn't have a particularly ring to it which is fine also but let's just think of it this way so as I put down the pattern first at the left end then I would want to compare the first character in the pattern to the first connector in the text if they agree then I move on to the next one if they agree then I move on to the next one if I end up at the end of the pattern and all the characters have agreed or matched up to that point then I found an occurrence of the pattern in the text if they don't if I get out let's say the third character and it doesn't match it mismatches then I know that that's not a place where the pattern occurs in the text and I can stop I don't have to look further in the pattern for that occurrence and then I move the pattern one place over and I start again at the left end like I start again at the right down there I get start I can look randomly but ultimately I have to look at all the characters as long as they're matching or stop the first time I find a mismatch so the point is that in the most direct kind of way of doing this you would place the pattern down basically at every possible starting position in the text and then for that position you would look at the characters in the pattern to see if they match the corresponding characters in the text and that's you could write that up very simply and it's it's certainly correct okay so this is the most naive algorithm is place the pattern against the text in all possible places and then for each one compare the characters in p2 the corresponding characters in T until either a mismatch or all characters match and again if you if you put the pattern down in a place where all characters match then that served me an occurrence of the pattern inside the text in otherwise it's not an occurrence and if you've covered all the possible starting places at the end it's just that one then you've certainly found all the occurrences of the pattern inside the text now if counting if the number of carry individual character comparisons is our is our measure of complexity what's the worst-case running time of this algorithm well a bound on the worst case running L yeah that's even more precise than when we need to be it's just all I was haven't had in mind was just the length of the pattern times the length of the text basically the number of places you can put down the pattern is bounded by the length of the text and when you've done that for each place you put it down the work you do a number of comparisons is bounded by the length of the pattern okay so this is certainly an upper bound what was suggested here is we could be a little more precise the number of prices you can put down the pattern is the length of the text minus the length of the pattern because at the end you can you can only put it in that one position if you push it beyond there it's not possibly a match but the length of the text is usually assumed to be very large in comparison the length of the pattern so that small modification doesn't change things and then if we say that for every time you put down the length every time you put down a pattern the certainly the worst case number of comparisons is bounded by the length of the pattern but typically it's not going to be that much but still we don't know so this is what we take as the worst case okay so this is a P times T and in fact you can come up with examples where the worst case amount of work in this algorithm is in fact that much okay what's a good example of that well if your pattern and your texts are all just the same character repeated over and over again let's say the text is all A's and the pattern is always so every place you put down the pattern you're going to compete with this algorithm you're going to compare all the characters in the pattern against the corresponding characters in the text and then you're going to move it over but you can find a match and you move it over by one you start again now you think about a little bit in that very special case you can come up with a modification this algorithm which is much more efficient but the algorithm generally doesn't know that you're in that case so and in fact in general we can come up with algorithms that are more efficient than the silly will but this algorithm certainly has theta P times T running time that's because now we've seen an actual example or examples that are we can make them any length we want and have this worst-case running time okay so I'm going to rewrite that it's just a little bit for for my next doll right over here this is P times T okay and let's try out some some numbers here if the text let's say is three billion that's the length of DNA although DNA databases are massively larger than just that the pattern could be in the thousands so you have thousands of billions because of the multiplication here that's you know real money and if you make this algorithm up it again this is a very simple thing to code up you try it out on some non-trivial length strings and really you can you can feel the time you can see the time so that's not desirable 1973 rounded out there maybe a little bit before then I think the the first tooth more Spratt came up with an algorithm that was much more efficient than P times T did something very non-trivial they replaced this symbol with this symbol okay just turn that thing 90 degrees there's a small transformation so a very minor operation the symbols changes the running time from thousands of billions to just millions okay but you can really feel the difference between between those two so knuth-morris-pratt showed that you can find all occurrences of the pattern inside of the text in time proportional to the length of the pattern plus the length of the text and as we've said this before this is this is linear time linear time means that's just proportional to what it takes to read in the data in the first place you have to spend something on the order of the length of P just to read in the pattern and something on the order of the length of the T just to read in the text and yet you can find all the patterns on the occurrences of patterns in text in that time that's really quite amazing now that the knuth-morris-pratt algorithm and since then several others many others achieved that running time and knuth-morris-pratt is is what's often in textbooks not this textbook that's strangely the textbook we're using doesn't talk about pattern matching at all so I'll put some not have put some notes already on the class website that we can use in place of the textbook because most algorithm textbooks talk about pattern matching and some with some approach usually couse most pratt but this book doesn't at all anyway we aren't going to talk about kenneth more spread because i think it's a little too complicated we're going to about a simpler approach that I call the Z algorithm and this typo stood notes on the class website about the Z algorithm so in fact the Z algorithm is a little bit more general than pattern matching the Z algorithm is going to solve a problem which will allow us to do the pattern matching problem in this time bound but it actually does other things as well so I have to define what Z is what Z values are and then we'll get back to pattern matching and then we'll go on to look at it at the Z algorithm itself in detail so this is a definition given a string s so s is just going to be a generic string at the moment I won't tell you whether it's the pattern of the text but given the string s Zi of s is defined to be the length so it's a number of the longest substring in s starting at position I at position I that matches a prefix of s okay prefix is a substring at the beginning of a string suffix with a substring at the end of the string to be talking about suffixes last time now we're talking about prefixes okay so let's just do some examples to make that definition concrete a B now a ABC a a example to be able to remember that a a b c a b x okay okay so Z 2 I'm going to drop them the explicit s in here this is just extra writing what is Z 2 just look at the definition given a string s Z now in this case 2 is the length of the longest substring and s starting a position to that matches a prefix of s so here is position 2 it matches the patches initially it matches prefix a it's just will have two two fingers for this on two different hands so two matches one but then three does not match two so z2 is equal to one right and one z3 to be zero zero I hear - I hear zero what is it zero zero the length of the longest substring of s starting at position three that matches a prefix of s so you start at position three there's one two three I guess I didn't tell you that we're starting indexing at one okay not at zero eight nine ten eleven I wouldn't I wouldn't have to even say that except for the fact that you all program in C or some other languages were they start at zero for some absurd reason but any rate we're starting indexing at one because we're at the moment saying and so in position three is here and the substring that starts at three the one is substring starting at three that matches a prefix well it doesn't match so that's length zero okay z4 is zero what about z5 we anybody on the back three three okay because a matches a a matches a B matches B let's see an X don't match so that's three okay why didn't I talk about z1 by the way its what no prefix or you could say the lungs created the longest substring that starts here that matches a prefix is itself okay so this is just going to be the length of s and so it's it's predictable it's trivial it's not interesting okay so we have these definition of what a Z value is okay now let's try to relate this ultimately I'm going to show you an algorithm for computing the Z value for every position and that algorithm is going to is going to run in time proportional to the length of the string yes okay so it's going to be linear time in linear time of the length of the string s we're going to be able to produce the Z values for every position the question right now is question if you have an algorithm to find all the zi values how do you use it the pattern-matching question how do you find how you use the algorithm it has the Z Artemis to find all the occurrences of a given pattern inside of a given text yeah that's that's closed it's closed not quite right anybody else have a suggestion okay the suggestion was that you create s equals P followed by T so it's the pattern concatenated with the text okay and then find all zi values and s okay and any Zi well you said equal to the length of the pattern okay so this one this is what was suggested Zi equal to P identifies an occurrence of P and T okay is this correct complete and correct or yeah greater than along just greater than or equal Yeah right so this actually is true that any Zi the equals P doesn't notify an occurrence of P but it doesn't necessarily occur identify all of them so you want it to be greater than or equal to P and you can just sort of seen this pictorially or just from the cartoon if we have a if we have here a position where let's say it's position K position I Zi is equal to length of P plus 10 something like that it means that from here in the length of P that matches that because P is at the beginning of s and then the next 10 characters also match but the at the main point and so Zi at that point is is bigger than P but the main point is that the first P characters number character is equal like the P definitely also matches P and so this finds all occurrences now sometimes what people do is when they think of problems of this type just to facilitate the algorithm and facilitate the thinking about it and slightly speeding up things is that they'll put a character let's say a dollar sign between the two concatenated strings where this character is guaranteed not to be in either of these strings so you you put some character that's outside of the normal alphabet that's used in the pattern of the text so you know if we're talking about strings of DNA then that's a four letter alphabet and there are lots of characters you can you can use but anyway you have available to you with some character that's not in the normal alphabet that's used and then if you did that how does that change this statement well I just let's break it down is this statement still true there's a statement still correct yeah it is still correct but it can be sharpened a little bit which is it should now just be what you originally suggested which is that Zi equals P the point is that nothing in teeth this character is not in C so you can't ever have a position that matches the prefix for more than the length of P because the comparisons out here would run into a comparing that character and that character is not anywhere in T so if we put that in there then we can make this equality but either way they both statements work all right so if I can find all the Z values in time proportional to the length of s then I can use this idea to find all the occurrences of the pattern P inside of T in time proportional to the length of P plus the length of T okay so if we can find a Z if if the Z algorithm that's coming up actually finds all the Z values in time proportional to the length of the string that's operating on and we use this idea P followed perhaps by a dollar sign right followed by T then we can find all occurrences of the pattern inside the text in this time bound okay now it'll turn out to the Z algorithm is useful for other applications as well but this certainly shows that a linear time algorithm defined all the Z values can solve the pattern matching problem in linear time you give a solution that solves the pattern matching a linear time by the way what would happen if instead of forming s equals P followed bytes suppose you reversed that and you know s is equal to T followed by P would that work for for solving the pattern matching problem No so cross it out it's not symmetrical we're looking for the pattern inside the text and it works to make the pattern the prefix and not the other way around okay so now I want to show you the actual algorithm that will find the z values all the individual Z values in time proportional to the length of s and I need some additional notation so the Z algorithm okay let's say that this is a position where Zi is greater than zero so what does this mean it means that starting from position I there is some there is a string a substring that matches a prefix why how long is it oh it's that long okay I don't know how long that is but that definitely matches a prefix of s so these two match so if I give that name alpha that string inside their alpha this is also an occurrence of alpha right and by definition this is the longest substring that starts a position I that matches a prefix so if this is connector x the next one after that occurrence of alpha is x what do we know about the character here it's not X that's that's all I can say I'll call well I say where Y is not equal to X because this was the longest substring starting a position I that matches a prefix okay so I'm going to have drawings that have little boxes like that to represent the actual substring that that matches and give that a name of a Z box okay all right so Z boxes if we look pictorially what's happened in the algorithm the algorithm gonna found some of these sub strings that match prefixes and those will define Z boxes so we'll have pictures that have those little boxes in there and that's the name now I should have told you that the algorithm is going to compute the zi values from i equals 2 up to the length of us so the algorithm computes Zi from I equals to 2 I equals the length of s so in order so Zi 2 followed by Z 3 followed by Z 4 and so on alright so as the algorithm proceeds it's finding Z values in order and it's finding Z boxes and so I want to define a variable programming a programming variable that the algorithm uses as the algorithm proceeds the name of the sphere but our that's a variable u the rightmost position in any sea box found so far so the algorithm is proceeding let's say this is we're up to this point if this is the picture of all the Z boxes that have been found so far this is our it's the rightmost one it's the rightmost position any rightmost position that's in any of the Z boxes that we found so far then the algorithm might proceed and it may look at if this position Z and compute Z K and see that that defines the Z box that goes out to there in which case the programming variable R changes from this position to that position okay so all right just you know what's what's the farthest position to the right that is included in any of the Z boxes that we found and so you can incorporate that into the into your algorithm and then L is the left end of the ZBOX whose right end is our so in this picture if we found these these two z boxes and this is the current R then this would be the current L okay so the left end now I actually I think at the level that I'm going to be talking about the algorithm we're not going to need L but if you're ever going to really implement this algorithm and program that you'd want to keep alive both an L and an R so as the algorithm goes goes along it finds additional Z boxes and whatever z box it finds that defines are the end of that the left end of that defines L okay as the algorithm goes along so we have those two variables that are being changed as the algorithm goes okay so let's get into the core of the algorithm now that we have our notation set up the first thing is we want to compute Z to okay and that's going to be done in the most explicit and naive way direct way so here we have position one here we have position two and remember what is it we're trying to do we're trying to find the length of the longest substring that begins a position - that matches a prefix of of the string yes so we just do the two fingered approach which is to look yeah one finger goes at position 2 1 2 position 1 and we ask whether those are equal if they're not equal then Z 2 is equal to 0 but if they are equal then we move over then we ask whether the character in my right finger matches the character of my left and so on when we keep doing that as long as they keep matching until we find finally a mismatch ok there's no text there's only a string else ok we showed over here that if you have an algorithm that can compute Z values you can then use it to find the all occurrences of a pattern inside of a text but right now we're just talking abstract lis about finding Z values in a in a string yes did that answer your question ok so a a a a B X let's say this is the string 3 4 5 6 so the two-fingered algorithm at this point character at position 2 matches the character position 1 and then the character of 3 matches 2 and the character of 4 matches 3 and then the character of 5 does not match 6 so Z 2 is equal to 4 a simple homework exercise is that the value of z2 is actually equal to the number of identical characters at the beginning of s you can convince yourself of that that you didn't really have to do well effectively you are doing it but you didn't really have to do this two-fingered thing when you want to compute Z 2 you can just say ok the first character is a and how many additional characters are equal to a and that's what Z 2 is you can convince yourself that that's correct ok but certainly we can always go back to the explicit two-fingered approach of actually looking at the compare looking at the characters and for any position we want to compute a Z value we can always do that if I want to compute ZK out here I can always put one finger at K a one finger at one and then do those comparisons but we don't want to do that because yeah question oh three I'm sorry yeah one two three see two equals three okay now how many character comparisons did it take to compute Z to in worst case what can I say for a bound on the number of character comparisons it takes to compute Z - okay Steven won't be a length of s because if they're all the same character again the length of s all right now the element of I'm looking for the argument claiming exists is going to have a runtime battered by the proportional to the length of s and just to compute Z - we've already used up s so I'm really cooked right what's the answer of that so in this case just to compute Z - I've had to use the number of character comparisons equal to s and yet I'm claiming that the whole algorithm will only use a number of character comparisons proportional - what can you conclude from those statements yeah well if Z 2 Z 2 is equal to s and they're all the same character and so you know that this evaluates each one after Z 2 is just going to be one less well that's true very closely and saying if we really did spend an ass on the first at the beginning we wouldn't have to spend much for the rest of it because we were pretty much finished and that's true we won't even have to be that clever the only answer I was looking for is yes we spent Assam to compute Z 2 but the advertised time bound for the whole algorithm is proportional to ask so I've got several lessons left over to use okay but it cannot be that I do this at every position I can't use the two fingered approach at every position because if I'm counting s as my worst case bound for computing each Z I value then if you do that times and we're talking about s squared not s okay so I better do something that's more clever so even though the beginnings of the algorithm are very direct very explicit comparisons we better be more careful and clever as we go on all right so let's just say we have going on and we've already computed ZK so we've computed Z to see 3 all the way up to Z K and now we're trying to compute ZK plus 1 so there's going to be a couple of cases that we're going to look at that are going to relate K and R so case 1 let me make sure these are the same I'm using notation of what's in the notes yeah case 1 K is bigger than R so that means that there's no there's no Z box that goes as far as okay I did yeah we're trying to compute Z K so we have computed ZK minus 1 then we're trying to compute Z K right so if K is bigger than R that means that the rightmost Z box is a box it ends farthest to the right does not go as far as position K so K is strictly to the right of our and what we do in that case is explicitly find or compute ZK by the two-fingered algorithm thing finger algorithm okay so in that case we're back to what we did for z2 that doesn't look so good because you can't do this every position because then you're as I was mentioning your time down is going to be s squared naught or your else but anyway it's certainly clear that when K is bigger than R if you do the explicit comparison approach the algorithm is correct because you're just looking directly there's nothing there's nothing tricky going on there so there's no no need to to worry that's case one peace - is that K is less than equal to R I drew a picture like that here's a Z box that defines our and we're back looking at position K and we're trying to compute Z K one question that often comes up when I start talking about this case people are a little confused how R got to be how did I get to be bigger than then okay okay well the point is that you are got to find when you are computing Z what subscript here in terms of our notation we've been talking about what oh yeah back when we were computing Z L we found that the longest substring begins a position L that matches a prefix went all the way out to R so we know Z L at that point but we don't know all these other Z values in between and then we're going to compute them one at a time and here we are now looking at Z K and it hasn't yet caught up to R so certainly K can be less than or equal to R okay so we're in that case let's draw the picture a little bit this whole string is going to be called alpha and by definition there's got to be another alpha at the beginning of the string this is a substring that matches the prefix of the we have an alpha both of these places we've talked about this before the next character over here is X next character over here is y where Y does not equal to X Y it's not equal to X okay all right now just to to facilitate the the argument I'm going to call the substring inside here beta so alpha is the whole thing okay alpha is this entire substring and beta is the substring that begins at K and run star okay so beta is the substring beginning at K letting tar and there is another occurrence of beta in this picture yeah at the end of that output beginning right because alpha is equal to alpha so this has got to be another beta okay and you can figure out if you're actually trying to program this you can figure out what this index is if you know what R is you know what L is you know what K is you can certainly figure out what this index is okay I'm just going to call it K Prime without getting into all the messy details of how you would actually compute what K prime is but it's it's the beginning of this occurrence of beta where K is the beginning of this occurrence of beta and we can certainly figure out what K prime is from all of all the values that are known all right so in case two I'm claiming that claim now let me just go to a sub case case two a Z K prime is less than the length of beta okay so starting at K Prime the length of the longest substring that matches starting at position K prime that matches a prefix that length is less than the length of beta so let's draw that picture beta up here okay so this is K prime plus ZK prime plus or minus one who knows how to do arithmetic anyway this is length here the length of this thing is ZK prime okay the length of this is length equals Z K prime case to a claim in this case ZK equals Z K Prime well I'll just very quickly and I'll come back to this next time here's beta what's the we're looking for the longest substring that starts at position K that matches a prefix okay ZK prime has to match it prefix I mean that whatever this card Gama there has to be an occurrence of Gama there right by the definition of ZK prime this is gamma this is gamma let's say this one is the next character is P what's the next character over here q not equal to P over here this is beta this is beta so this must be gamma inside there okay what's this next character he it's P okay so starting from position K we have a substring gamma that matches a prefix the next character over is P which we know does not match the next character over because that's Q 2 is not equal to P so there we have explicitly that ZK is equal to Z K Prime okay ZK equals Z K Prime so in this case to a how many character comparisons do we use to compute Z K this is the audio must compute ZK it's finds that K is less than equal to R it figures out what K prime is it looks up what K prime is but no it knows what the length of beta is it compares the K prime to the length the beta finds we're in case to a compute CK conclude ZK is equal to Z K Prime and how many character comparisons did that use 0 so it did some arithmetic but a constant amount of arithmetic independent of what positions we were it use the lookup but only one that's constant so this is a really good case where we do 0 character comparisons and if you can do that for every position then the whole algorithm will run in time 0 but ok it won't happen every position alright next time we'll finish the algorithm the fantasy analysis etc so
Info
Channel: UC Davis
Views: 17,169
Rating: 4.9542856 out of 5
Keywords: algorithm, pattern, matching, Z-algorithm, Gusfield, UC, Davis
Id: MFK0WYeVEag
Channel Id: undefined
Length: 51min 46sec (3106 seconds)
Published: Fri Sep 30 2011
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.