Count min sketch | Efficient algorithm for counting stream of data | system design components

Video Statistics and Information

Video

Captions Word Cloud

Captions

in computing Kahneman sketch is probabilistic data structure which can be used to count the frequency of the events on the streaming data it gives us multiple hash function to map these frequency on to the metrics you can argue that you can use hash tables in place of comp in sketch to count the frequency much more efficiently but I'll tell you this particular algorithm uses very less space or you can call it as sub linear space to count the frequencies but the only disadvantage is sometimes on really rare cases it all counts the frequency because of the hash collision my name is Lorraine and intersession let's learn how comments get works and also learn where we can use common sketch in our systems so you must be thinking what is sub linear look at this graph to understand more so the definition of linear as you know that as the data increases these space or the memory consumption also increases in a same way that is almost like there is a direct relationship between so that's why the graph is linear the sub linear on the other hand looks something like this it doesn't consume as much memory like linear but it would like less than like half or something like that so let's go to the actual problem statement suppose think that we have a stream of words in this case for the simplicity purposes I'm taking a letters so we need to count the frequency of these letters okay and that too with sub linear space and constant time and how do we do that along with that we also need to provide range queries total count and also sometimes person type so how do we do that maybe we can store all of this data in somewhere and then we can run these queries in order of n time and also the space it just taken is order of n right but the problem is these streams are kind of infinite because they will keep on coming say example in Facebook or Instagram or amazon.com these dreams that maybe the other information are any other dispatch information something like that they keep on coming in so these things are kind of infinite stream so there's no end to these streams and also we are talking about tons of data say for example this data for a day could be 1 terabytes or in a month it could be 30 terabytes and it's it would be keep on growing so whenever we think of limited space and limited time so the storage costs okay what if we have a storage taken in on amazon.com s3 or some other data store we have to pay for that right the problem is because of the limited space and and also when we run the query it should go through all of this you know terabytes of data or better bits of data - you know calibrate the frequency so we can't afford that much time also so we have limited time and also most of the time in streaming platform these streaming technology will never get a chance to look at the data once against what happens is say we have a streaming platform the data goes in and then the data goes out and you will never know where the data is gone or sometimes we don't even update this data somewhere so so this streaming platform should calculate the frequencies or whatever query we want to run work what a kind of the data we want to get as and when the stream is passing through this streaming platform so we had to have sub linear space consumption because we can't afford to have a linear space and also we need to keep on calculating or perform all this operation in real time so this solution for all of this problem is we have used some kind of algorithm our system which use a sub linear space so when we use sub linear space that means that we are kind of compressing the data and saving it in the civilian space so when we do compress you might know or you will know that the data is lost that means that we will have a little bit of inaccuracy in the data which we provide so Before we jump right into the count wins sketch working lets anywhere learn what are the possibilities with which we can do solve this problem so one is using hash table hash table takes somewhat linear space and the time complexity is out of one but it's not really or one because whenever there's a collision it might take more than an out of one so our goal was to use sub linear hash table doesn't work with sub linear space and also the time is not actually constant so what is sampling we can use something on the data stream which we are getting what we can do is we can sample in specific duration or we can do some kind of stochastic sampling or random sampling with which we can get the frequency and then we can show the frequency based on that sampling but the problem is we can't insure the true randomness so the output kind of depends on what kind of random algorithm we could use it so that also is not a very useful kazar algorithm for our problem statement now let's learn how comp in sketch works you heard sketch right a sketch is a 2-dimensional matrix or array okay that looks something like this this is a 2-dimensional matrix representation which has four rows and about six columns and so the number of rows is equal to the number of hash functions we are going to use for this convey sketch the number of hash hash function you want to use depends on yourself the more accurate results you want the more number of hash functions yogurt choose so for this example for simplicity purpose I'm going to take only four different hash function then headphone h2 h3 and h4 okay so if you see here there are four different hash function that is equal to the number of rows which we have in this sketch and the columns we have is about seven different columns and the number of columns depends on the maximum number of which the hash function gives the output so that means that we have a sketch our predefined size of the sketch and that is the reason why it is called a sub linear because this data structure or the sketch size are spaced this data structures size will never change based on the different kinds of input we are getting in this stream so this is a stream with different letters for simplicity in progress this could be numbers or this could be words or anything okay for simplicity and particularly the world the letters so now we have chosen different kind of hash functions we have the sketch okay now what we need to do to count the frequency of these letters in the stream we have to so consider we are our stream processor is listening to the stream so this will encounter the very first word a now what we need to do we have to pass this a to all these different for hash function and get those values so I have a table written here for further easy explenation purpose okay so what happens say for example we got a what we have to do is we had to pass a to this hash function eight where we get so we might get anything for for the examples purpose am my own table which gives output for explain all the different scenarios okay say we get one so when we pass a to h2 what do we get we get six when a pass to h3 we get free and when you pass to h4 we get one so now what we do we have to update the sketch with those hash values say four we have tables so first we should have this sketch filled with zeros okay next one I fill the zeros that means that we have no data same so this is the fresh sketch now we know that when you pass a to h1 we got one so what are we to do go to the first position h1 okay for a we got one that means for each one row we have to go to one and then just increment it by 1 that means we get one so next on history we have our sixth position and then incubi increment by one and then for HT we go to third position increment it by 1 and then h4 we go to first position increment it by 1 so then what we need to do so we have captured the data for a ok now we got the second element that is B so what we have to do better do the same thing Pass B to all these different hash function get output go to the sketch and increment by 1 so for B what are the different output we get so many paths b2h 1 we get 1 so we go to 100 I'll do one here so what we need to do is we just increment by 1 so the value will become 2 in this case so next I'm h2 we have to go to 2nd position now increment by 1 0 to 1 okay so for our HT we go fourth position increment by 1 for H 4 we go to sixth position I'm sorry follow me go to sixth position I increment by 1 similarly when we get K we had to go to 3rd position for hitch 1 increment by 1 I'll change the pointer and then for h2 we had a code for increment by 1 4 HD good one sorry this is this is fine okay h4 set for 842 6 so okay Justin crying but one so it will be coming to okay now we get again a so the pointer moves here so when we get a better do the same thing go to one increment so two becomes three H to go to six and increment come to four sorry for hitch tube for hit C go to that position increment for a hitch for good first position and increment okay nice now what we get again be sorry again a so same values of each increment by one so for each one it becomes four four hitch two doses it becomes three for HTE code third position it also becomes three four hit forward first question and it also becomes free now go to be or do we have B AB K for K what other hash function hash values we get for K for each one its position increment by one okay for h2 called fourth position and come in buy one for hip street go k position so you go to first position and increment by one for h4 go to sixth position increment by 1 that is 3 so then we go to s we got s so what we need to do right for h1 we got a six-person to increment that one so for h2 we got a second question increment by one for HT we go to fourth position increment by one and for h4 we go to first position any one by one so we have four here ok now we have successfully loaded or we have successfully calculated the frequency of all the stream or the elements in the stream and we have noted that the time to sketch now we have this data how do we calculate frequency of any given element suppose we want to calculate the frequency of a ok what do we do now we want to know what is the frequency of a what we need to do is we had to pass this a to all these different four hash functions again and get the hash open since we we know the number of hash function so we have to do four times that means that it's odd with one so it's honest constant all the calculations between need to do is give a to all this hash function get hashable so for a we get the same output right so we go to the same position for h1 it's one for h1 h1 we got four let's write it over here ok we got for 4h - we got six right ok let's go to 6 we got three four h3 okay we got three position get the value for hitch for we got one go to this ok so what would be got is 4 3 3 4 now as the algorithm name says we have to count we have to go to minimum in the sketch and that's the answer we did the counting because we loaded all of this data and we have incremented that's what is called as count now we need to do the minimum of those counts now we have got the data for a now we have to get the minimum what is the minimum number in this hour it's free okay that means the frequency of a is 3 let's start is here 1 2 & 3 I'll good we got 3 let's run the same case for say ok ok let's do it for K when we do and I'm late hash for K what we get the values are here for h1 we get 3 supporting each one we get 3 that is 2 for H 2 we get 4 go to fourth position that is 2 for h3 we get 1 - I think 4head 4 we get 6 go to sixth position so in this case again what is the minimum value that is 2 so that is 2 let's count it here so how many kids are here that is 2 so we got 2 so it's actually working that is what we did we did the count now we're getting the coms and get calculated minimum from the sketch that is common sketch we got the output to this always work when it will fail why do we call it as probabilistic data structure because there could be hash collision there might be some case where the data for say for example a it was there's one and six seven girls write it over here mark it over here so one and then 6 and then 3 and then 4 is at h4 and then it's 1 so for three three for a for 64 wife one or more element got these same hash values and then they all incremented and incremented these values so we could have got about five or four five or something like that right so in that case the value grab increase because of the hash collision so the more hash function we take there will be less collision so the less number of hash function we take there will be high probability of collision so it is always advised to take more hash function so our sketch has more rows in it so there will be less ability of condition we get better results countless sketch always counts the exact number are more it will never under count it will always do kind of over counting so we always get the notion that convent sketch will no miss anything but it might give more conch okay and that's what works so now that we learned how common sketch works now it's time to learn it's applications this algorithm has find its place in many different ways for example it has been used in compressed sensing networking and also it has been heavily used in NLP and machine learning algorithms and the major application is in streaming algorithms say for example spar storm or flink right so I'm not sure the sparker strong is using common sketch in their library I couldn't get that information but I'm sure that if you want to build your own streaming library for sure you can use content sketch to figure out the frequency of the events or the elements of Industry so having said that if you have any doubts you can always post it in the comment I will try to answer them as a task I would request you guys to implement the comp in sketch in your favorite programming language and post it to me to my email id or post on the comment I'm going to update the code links in my description of this video and that way the viewers can find it handy or if they can refer to the core base thank you and if you like this video please subscribe to my channel and also tell your friends to subscribe to this channel there are tons of super cool videos or about different algorithms or so some designs I'm going to post keep posting every week and also I'm always open for suggestions please do post my own solutions about the videos or how I can improve myself when explaining this stuff our if you have any requests for me to make any videos please do post thank you

Info

Channel: Tech Dummies Narendra L

Views: 58,432

Rating: 4.9701085 out of 5

Keywords: Amazon interview question, interview questions, interview preparations, algo and ds interview question, software interview preparation, developer interview questions, google interview question, Technical interview question, easy learn algorithms, system design components, Data Stream Summary, streaming counting algo, algorithm for counting stream of data, probablistic DS, probalblistic datastructure, Count min sketch, count min sketch applications

Id: ibxXO-b14j4

Channel Id: undefined

Length: 19min 30sec (1170 seconds)

Published: Sat Sep 29 2018