Algorithms Lecture 19: Dynamic Programming, Longest Common Subsequence and Longest Common Substring

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the two problems that we will be looking at today are the longest common subsequence and the longest common substring so last time we introduced dynamic programming the question is what how would you describe dynamic program what characterizes the dynamic programming algorithm a technique how would you characterize it yes so the idea is to solve smaller subproblems first starting with very easy base problems of base sizes and another key idea is storing the results of the solved subproblems in a table and what's the purpose of storing those in a table yeah reusing them instead of recomputing them that can save a lot of computation time as we have seen with the fibonacci example you know if you recompute that can result in huge explosion in the number of operations but if you just store the results of the subproblems and reuse them that can save you a lot of computation and the overall running time is going to be much better okay now let's introduce the longest common subsequence and the longest common substring so basically if you have a string two strings string X which is a b c d e and string Y the the a e now a common subsequence of these two strings is any sequence that is common to both that appears in both and the sequence doesn't have to be contiguous so a common substring is obvious right you find a common substring like here and what's the longest common substring between these two de de de well D de so is V de a-- longest common substring or a longest common subsequence subsequence yeah so b de is a longest common subsequence subsequence so in a subsequence we can skip we can skip characters the sequence does not have to be contiguous while the string has to be contiguous so in this case in fact this could be a bad example do we have a a common substring that is longer than one no I didn't think so but you know for example you know if we have you know CD for example or here we have there so we can we can add for example if we add C here and C here then this will be a longest common substring so the thing has to be contiguous but the sequence does not have to be contiguous now which problem will we try to solve first using dynamic programming first we will try to solve the longest common subsequence so given two strings try to find the longest common subsequence where you can skip okay so now if you know before we start talking about the dynamic programming what would a brute force solution to this can you describe a brute force solution each what each what I've got the Shorter's so you'll have look at eggs you find all possible sub sequences so how many possible sub sequences are there if there are M no it's not the enemy M factorial what's the total number of sub sequences no it's not M fact given that X has m characters or in symbols how many sub sequences will we have yeah to power my yeah because each character is either taken or not taken in the subsequence so the subsequence is just a subset a subset of these characters right so you have to power M so this is not a good start at all so this is already exponential but this is not enough right we'll have to find all possible sub sequences and then for each sub sequence we'll have to find it in the other string to see if it's there or not and then we take the longest so obviously this is going to be you know at best exponential in fact it's more than exponential because you'll have to do you'll have to search for each subsequence in this string now let's think of this in terms of dynamic programming in terms of smaller subproblems and larger subproblems so I think the best way to think of a dynamic programming solution to a problem is to assume that you have solved the smaller subproblems and see how solving smaller subproblems can help you solve bigger problems okay those cases are not a problem because you know you can always find easy base cases like in this case if the things are empty then the longest common subsequence is empty that could be a base case or if each string has one character then the longest common subsequence is what it would be that characters it's the same and it would be empty if that character is different anyway so you can easily find base cases but what is not easy to find is thinking and you know dynamic programming where you assume that you have found the solutions to smaller subproblems and you are interested in finding solutions to bigger problems so for example what if I have these strings a b c d and a c HB d now let's assume that you know this is string x and this is string y and at this point you are trying to find the longest common subsequence between and then there are other characters but this is character I and this is character J so now you are trying to find the longest common subsequence between X sub I and Y sub J now the way you think of this in dynamic programming is to assume that you have found the solutions to the subproblems so the sub problem here is for example this alright you assume that you know the longest common subsequence between so this is I minus 1 and this is J minus 1 dynamic programming wise you will assume that I know the solution to this and the solution to this all right so now what's the solution to the big problem well in this case obviously through the solution to the sub-problem is what so you have ABC what's the solution a B or AC a B or a C so the solution here is not unique there are multiple solutions multiple optimal solutions so for example let's assume that it's a a C now we assume that this is known dynamic program you assume that this is known given that we know this what's the longest common subsequence now between X I and widely so now we can add D so we have you know the longest common subsequence is AC now we can add D so this is clear you know if you are talking if you are at index I in string X and at index J in string Y then you can look at the subproblem that you have sold previously and you just add this character to it under what condition they're the same so if they're the same if this and this are the same then we just add the character to it so we can express this as a longest common subsequence of I and J equals is more space longest common subsequence of I and J equals the longest common subsequence of I minus 1 J minus 1 plus 1 so earlier is the length of the longest common subsequence is the length not the actual characters this is the length of it is plus 1 if what if X of I equals y of J if X of I equals y of object but now what if they are not equal so let's look at this example X I equals a b c d and YJ equals ba da now well we assume that we know the solution to this sub problem in the sub problem of the longest common subsequence between this and so this is I I minus 1 this is J this is j minus 1 now if these are not equal what's the length of the longest common subsequence between X I and YJ is it going to be just the you know X I sorry is it gonna be just the solution to this sub problem the solution which is longest common subsequence of I minus 1 J minus 1 okay so now let's look at this let's look at the longest common subsequence between I minus 1 and J minus 1 what's the length of this subsequence so in this case you have a a B so the length is gonna be 1 right between this and this what is the length of the longest common subsequence between this 4 character string and this 4 character string is the length one here is the link length one what's the length of the longest common subsequence between know between the 4 and 4 4 2 2 between so in other words we cannot say that the longest common subsequence of I and J equals longest common subsequence of I minus 1 J minus 1 if X of I is not equal to Y and J + 2 y ab J so this is wrong right this cannot be right because if this is true then it will give us here 1 while the right answer is 2 so what should the right answer then be how can we compute the right answer in this case think of solutions to subproblems so the subproblems do not necessarily have to be you know are X sub I minus 1 and J minus 1 think of other subproblems yeah exactly so you can think of the longest common subsequence between I and J minus 1 and the longest common subsequence of I minus 1 and J these are two subproblems that are relevant so I and J minus 1 this is going to be what this is going to be these for this is I and this is J minus 1 what's the length of the longest common subsequence here - so this is this and this so this is I and J minus one now what is what if we do it the other way around what if we do J and I minus one it's gonna be the same so this is J and this is I minus one what's the longest common subsequence one so now this is not true if X of Y is not equal to i FJ the longest common subsequence is not going to be the longest common subsequence of I minus 1 and J minus 1 so this is wrong but what should it be what should we write here what's the how can we get the right answer generalize this X of this - yeah the maximum of these two the best of these two so looking at I minus 1 and J minus 1 would not be enough if we look at this and this which will give us 1 then we look at this with this and that will give us 2 2 is better than 1 so we will take this so to generalize this longest common subsequence of I and J is the maximum of longest common subsequence of I minus 1 and J and longest common subsequence of I and J minus 1 if what X of I is not equal to I of J so we have two cases if X of I is equal to I and J that's easy we extend the previous and add one if they are not equal we look at these two possibilities and take their the best and of course we need a base case so the base case is easy so if one of them is empty then the longest common subsequence is empty so this is an easy base case so this is going to be 0 if I equals 0 or J equals 0 so we start numbering from 1 here so this is going to be 1 2 3 4 so if one of them is empty then this is a base case the longest common subsequence is going to be empty ok so this is this is the algorithm so let's write the code for this the code is going to be the heart of the code is going to be this this formula which we call the recurrence for this for this dynamic programming solution so basically the key idea in any dynamic programming solution is how to define the solution to a big problem in terms of the solutions to smaller subproblems so that's the key idea in any dynamic programming solution so now well yeah let's well let's let's go through an example first let's go through an example so so the example is here and then we have X is a b c a b c d D this is X so this is X of I and this is y of J so J goes j j is the column number ABC b d and this is b the see a BDC a B so this is gonna be our table obviously here do you know that the cell T of I and J is the longest common subsequence or the length of the longest common subsequence between X I and YJ X I is the string through I and YJ is the string from the beginning through J okay so in the beginning we can fill this throw with zeros because when X is empty this is going to be empty then now we have a now we can also fill the you know the first column with zero because when y is empty that longest common subsequence is going to be empty now when we go through each cell and we compare the two characters if they are equal we apply this if they are not equally apply this so these are not equal so we'll apply this so where is longest common subsequence of I minus 1 and J I minus 1 and J that's gonna be above the row above so I minus 1 is the row above same column this is same row the column before so we look at this and this these two guys and take the larger now if they're equal we'll have to systematically pick one of them so let's say it doesn't matter you know if they're equal you can take either one let's systematically take the one from the previous row they are equal okay so now this is going to be 0 again not equal we look at the row above and at the column before and take the larger they are the same so we take the row above and that's going to give us what 0 same thing here because they are different so we'll get the same thing now they are the same when they are the same we apply this so where will we find n of I minus 1 and L of G sorry longest common subsequence of I minus 1 and J minus 1 so it's going to be diagonal right yeah so it's gonna be diagonal this so we look at this and we add 1 to it okay now these are different so again we compared the row above with the column before now which one is greater the column before so we take the one now these are equal so it's diagonal because they are equal and that's going to be one different which one is greater the column before so this is going to be one different again this is greater so this is going to be one different so both are equal right so we take one of them so let's systematically take the row above equal we go diagonal and that's going to be two okay now DNC different so we take the one different we take the one similar diagonal we put two different we take the two different we take the two similar diagonal different we take the one different we take the two different but these are the same so we take the two similar we take the abinell this is three okay last row different we take this similar we go diagonal different two and different two and different we take the three so this is how we fill in the table so now we know the answer what's the length of the longest common subsequence three we know that the length of the longest common subsequence is but now which characters did we take now we have to do something to some kind of trace back like we did with the with the zero-one knapsack problem trying to figure out which one's got taken and which ones did not get they did not get taken so in this case what indicates whether a character got taken or not yeah whether it's diagonal right so we can store these you know diagonal or not diagonal we can store this but without storing it we will know right so let's assume that we that in this case we store you know we store in each cell whether we went diagonal or we took from the previous row or the previous column so here since we took this from three right so we just go up we did not pick the D we go to the previous row so since this is not diagonal we did not pick the D now here there is diagonal so we picked the B then that leads us to this square now is this diagonal no it's not diagram in fact you know we can you know without storing diagonal or not diagonal how can we tell by the way you know whether it's diagonal or not how can we tell without storing compare the characters right so for example in this last column we know that it's not diagonal because this is a D and this is a D so if the corresponding characters are not equal we know it's not diagonal anyway so now we are here and the characters are different this is an A and this is a C so this cannot be diagonal so we'll go to this so this means that the C did not get taken now this is diagonals what does it mean the C is in okay so now we go to this again B and D are not the same so it's not diagonal so we're going to the one so we go to the one now D and B are the same so we are here then we go to diagonal to this and now we are at zero so we know that well you know that this is the base case so we just stop if we reach a base case then you know we can stop okay so now we know that what the longest common subsequence is B BC B okay so this is based on what we have taken okay okay and in you know in many cases the longest common subsequence will not be unique there may be multiple longest common subsequences so is it clear how we trace this so basically the diagonal whether we start diagonal in each cell or we just compare the two characters they are the same then we take them if they are not the same then we go to the row above or the column before depending on which one is greater and if they are equal we can go to either one okay now let's translate this into code longest common subsequence of X and Y so let M equals X dot length and M equals y dot length then we create create a two-dimensional array call it t with M plus 1 M plus 1 rows and n plus 1 columns is this problem symmetrical so what if we put X in the columns and Y in the rows will it still work yeah there are just two strings right well they are indistinguishable to just two different things you can treat either one as X or Y it doesn't matter so we create a two dimensional array then fill the first row and the first column with zeros okay then what kind of loop would we have here this is a two-dimensional array so we go so we have a nested loop right one that goes through the rows and one that goes through the column so just a two-dimensional array so we'll go for I equals 1 to M and for J equals 1 to N in fact we just put that formula that we that I have just erased so if what X of I equals y of J we apply the rule of T of I and J equals what is not X T I minus 1 and J minus 1 plus 1 okay else if they are not equal then T of I and J equals what maximum of T of I minus 1 and J T of I J minus 1 and that's it for building the tape okay of course you know if we were to expand this max what would this max expand into if we were to expand it what would this translate this Z equals max how would you if you were to code this how would you code it yeah so it's going to be an if-else so if this is greater than this take the answer as this otherwise the answer is that okay so in fact the details of coding this into an if-else will determine if when they are equal it will determine what will happen when they are equal right so if you say that for example if T of I minus 1 and J is greater than T of I J minus 1 then T of I and J equals T of I minus 1 and Jay else T of I and J equals T of I J minus one now if we if we implement it like this what will happen if they are equal to the else we'll execute the else and the else goes to the previous row or the previous column previous so this is the same role previous column so this is previous column so in fact if we implement it like this it will be the opposite of what we were doing we were going to the previous row so here we'll go to the previous column no no this is still there so we are replacing this thing we are expanding this so this is this F else is there as an expansion of this so this F is still there this is still there we're not touching this so we're only replacing this line with this with these three or four lines okay in the library or how we choose to implement backs well you know max is you know too easy to require a library or just you can implement it yourself you know you can implement the max yourself because it's trivial so if you want to control it you write your out your own next so if we want to take the previous column it's like this but it's going to take the previous row you can for example do this if you do this then the Equality will go to the previous row right so you control but what are correct right when they're equal you can go either way now what's the running time of this this is theta up and now is this polynomial so this is this like the knapsack or it's how is this like the knapsack are m and n are the numbers or their actual sizes of objects that you are passing as input so there are actual sizes right so there's the size so the actual input is going to be in a string so it's going to be this thing so this is the input this string and this string so m and n are just the sizes of these so the input has N and the output has sorry this dimension of the input is M and this dimension is n so the mathematical relation between the input size and the number of iterations is linear this the number of iterations of this loop is a linear function of the input size which is the number of these characters and here vary mathematical relation between the number of iterations and the number of characters in the input is a linear so this is this is polynomial okay so now what what remains is writing code - no no it's nm so it the input is defined by two parameters so there are two strings at the input so it's n M so it's it's a function of two variables it's a function of two variables but it's polynomial it's a polynomial function of these two variables we don't have no no no it's not so this is not linear it's not linear it's not a linear function of the overall input it's just it's linear the number of iterations of this is the linear function of this but this overall running time is not linear thank you for asking so this is not linear but it's polynomial the overall running time is not linear but it's polynomial now we need to write code for tracing back to tracing back to identify the the characters that got taken so now we have we have we'll have to start from the last squared right like we did here so we just say okay so find characters or not find characters not it's not a good name here construct this tree construct the longest common subsequence and this should run after this so after we construct the table we construct the string by starting from this square so we set I equals M and J equals n what does it mean it means that we start from this then when do we stop in fact we keep going as long as as long as we have you know both of them are greater than or equal both indices are greater than or equal to one so while I is greater than or equal to 1 and J is greater than or equal to 1 now if if what so what did we check on if the two characters are equal so if X of I equals y of J what do we do we take it yeah exactly take X of I or put it in the string add it to the state or let's say concat or add X of I to the string so the longest common subsequence then what then we have to go diagonal how do we go diagonal exactly so I - - J - - now else if if what now we have to do this if T of I minus one T of I minus 1 and J is greater than or greater than or equal whatever T of I and J minus 1 then we go to that previous row which means what yeah I - - else J - - and that's it so what does it mean it means that if we compare the two characters if they are equal we take that character and we go down if they are not equal we see which of these is greater if this is greater we go to the previous row but we do not take any characters so we just move now the question is what's the running time of this algorithm so look at that look at this table when we traced back what's the worst-case asymptotic complexity of this M so M is this it's more than a mean obviously so we don't go straight so we don't go out this way or we don't go this way so it's greater than M and greater than n but is it MN and person yeah exactly so this is M plus n okay so it's in fact here it's you know it's it's easier to say it's Big O of M plus M so we know that it's not going to exceed we know that's not going to exceed M plus n why because in each move that you make you either increment M or you increment or you either decrement I or decrement J so in every step every step that you make you either decrement I or J or both so this is going to be all of M plus N it's not going to exceed M plus n okay you are guaranteed to decrement at least one of I and J okay so this is the the algorithm here and this is how we construct the longest common subsequence okay any questions so let's talk about the longest common substring in fact you know the dynamic programming solution to the longest common substring that we are about to describe it's not the best algorithm for finding the longest common substring the best algorithms for longest common substring can be developed using the what we call the suffix tree so suffix trees can be used to develop or to to to develop algorithms that are much more efficient than the alga the dynamic programming algorithm that we are about to describe but we will study this algorithm anyway as an application of dynamic programming but just I want you two to know that this is not the most efficient so in terms of longest common substring again think of the problem this way I have say ABC and here this is X this is y this is ABC and I have ABC here and then I hit x and y so this is I and this is G now I'm at index a I in X and at index Y in J now clearly you know if I have this substring ABC clearly if x equals y then this is going to give me a longest common substring of size four so if x equals y here then this will get concatenated to both strings and it will give me a longer substring of size four so we know that if you know if x equals y then we will get something like this in fact you know I'm going to keep the same code because it's going to be the same so what this will be different so now longest common substance for substance but now what if I and J are not equal now the difference between the longest common substring and the longest common subsequence is that in sub strings I cannot skip so assume that x and y are different characters now what if I have after this for example a b c d and a b c d now the question is since x and y are different will the longest common substring between these two strings will it include this ABC in it no because there is a discontinuity here remember the string has to be contiguous and if x and y are not equal then basically when i hit two characters that are not equal i will have to start over so this is the best that i could construct up to this point and then anything that comes after x and y i would have to start over i would have to start building a new common substring i cannot I cannot concatenate to this because I already have this this disconnect you know these two characters that are not the same so in fact this simplifies the problem if x and y are equal I will do this and if x and y are not equal then the length of the longest common substring that ends at this point is going to be what well it's you know the length of the longest common substring that ends at I minus one is this so the longest common substring that ends at I minus 1 and J minus 1 is ABC but at I and J there is no common substring in fact so I have to start over and I have to start looking again for similarities so in fact what I can do here if they are not equal then T of I and J equals 0 what does the zero mean it means that when I hit two characters that are not the same I would love to start constructing a new longest common substring I cannot build on the previous longest common substring because this thing must be contiguous okay now this will make more sense if we if we do an example okay so let's use this example DC ABC let's use this example a b c d e and this b c b c d AE and now I'm gonna we can just clean it up now it will make sense when we do this concrete example okay so now and I feel that you know obviously when one of them is empty then I don't have a longest common substring I don't have anything now not equal just 0 not equal zero zero equal I go there right so equals the same we do the same that we were doing with longest common subsequence so it's going to be diagonal not equal simple 0 equal go diagonal 1 not equal 0 not equal 0 not equal 0 not equal 0 so there will be lots of zeros not equal 0 equal what should I do that not equal 0 0 0 not equal 0 0 equal diagonal oh sorry I put 3 here should I put what do I need to put the length so this is going to be what 3 so these diagonals are going to point you to the longest common substring not equal 0 it not equal 0 not equal not equal it's not equal and II not equal equal diagonal okay now where is the longest common substring so where is the long is it in the last square no so the longest common substring the answer will not be in the last square not in this square in fact it will be the square that has the maximum so I just keep track of the maximum so in this case you know I I just keep I keep track of the maximum so for example I define maximum I equals zero maximum J equals 0 and maximum long maximum longest common subsequence equals zero and then when I get this hit I do I keep track of the maximum so what should I do here so let's let's move the else so this is else T of I and J equals 0 now what should I put here Mack's previous backs and exactly yeah max of previous max longest common substring and yeah T of ing so I just keep track of the maximum well in fact I have to update I and J if this changes so it's better to do it with an if statement if yeah so let's do it with an if statement so if so that I can change if T of I and J is greater than the previous max common subsequence then the max is now equals to T of I and J and our I equals sorry our max I equals I and our max J equals J so we just mark the I on J so this way when we go through this our maximum will be this right now how will we find the string will that be easier or harder than the longest common subsequence to find the actual string why is it easy how can we find it exactly yes so the length of the longest common substring is three so in fact I just go three characters back in either string so I either you know I take these three characters and they should be the same so from this point you know from this D I go three characters back d-plus you know to other characters so since I know that the length is three I just go back by three characters in either string so this should be trivial to implement now what's the running time of this algorithm what's the running time of this yeah so this is MN same thing it doesn't change and the running time of constructing this thing obviously it's not going to add because to construct the string if you mark the maximum then you go back by the length of the string which will not exceed N or M so it will be less than n and less less than or equal to n and less than or equal to M it has to be a string in both substring of both so we will go by a number that will not exceed N or M which means that it will not add to the asymptotic complexity the symbolic complexity will stay n m okay so any questions on the longest common substring problem okay again you know this is a good application of dynamic programming but this is not the most efficient algorithm for finding the longest common substring between two strings
Info
Channel: Ghassan Shobaki Computer Science Lectures
Views: 6,149
Rating: undefined out of 5
Keywords:
Id: L1fUKbgyAk4
Channel Id: undefined
Length: 62min 32sec (3752 seconds)
Published: Sat Jan 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.