Map Reduce explained with example | System Design

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

math videos program work in two phases namely map and reduce map tasks deal with splitting and mapping of data while radius tasks Shuffle and reduce the data so the map function will transform the data into key value Pairs and these key value appears live here in the intermediary step of the mapreduce Java process and then these key value pairs are shuffled around and is reorganized in a way that makes sense in the final step they are then reduced into some final output the output may be some file that can be used somewhere else in the system now before diving deep let's similarize ourselves with why it was needed in the first place back in early 2000 Google algorithms and applications were collecting data 24 7 about people processors systems and organizations resulting in huge volumes of data and since vertical scaling is limited to process these large data sets they had to horizontally scale to hundreds if not thousands of machines now when it comes to data processing processing data across so many machines is a very difficult task because you need to do parallel processing across all these machines and you have to handle failures like better partitions or machine failures the challenge Google faced was how to process this massive amount of data with speed and efficiency and without sacrificing the meaningful insights this is where the mapreduce programming model comes into rescue back in 2004 two Google Engineers wrote a white paper on mapreduce model which allowed Engineers to process very large data sets that were spread across hundreds or thousands of machines in a disputed setting very efficiently in a fall torrent manner it's a pretty short white paper but may be fairly complex to understand for some but I highly encourage you to skim through that you can find the link in the white paper in my description in this video I will give a high level understanding of mapreduce framework with examples in the context of system design interviews now if you have a python or JavaScript background you are certainly familiar with mapreduce functions here in the context of mapreduce in distributed system settings map and reduce are a little bit different Google Engineers realize that most of the data processing tasks could be split up into two steps and they created a library that allowed Engineers to process huge data sets in order of terabytes spread across thousands of machines and this is fairly simple concept to understand there is this map function that transforms the data into key value peers the key value pairs are shuffled around and reorganized they are then reduced to some final output now before understanding it any further through an example there are few things worth noting first thing is that when dealing with mapreduce model we assume that we have a distributed file system which means we have got some large data sets that is split up into chunks these chunks are replicated and spread out across hundreds of thousands of machines and then this distributed file system has some sort of central controller which is aware of everything going on in the mapreduce job meaning the central controller will know where exactly the chunks of data resides and is able to communicate with all the machine that store or processes data that is a map machines or reduce machines or workers the second thing to note is that since we are dealing with a very large data sets we don't actually want to move these large data sets we let them live where they currently live on their respective machines what we do is we have this map functions operate to the data locally so instead of grabbing and aggregating the data and move it somewhere else we instead send the map programs through the data and avoid moving all this large data the third very important thing to note about this model is the key value structure of data in the intermediary step is very important because when you perform a reduce or reduce data chunks these data chunks come from the same data set because remember all these data chunks are the chunks of the same large data set so when you reduce these data chunks you're likely looking into some sort of commonality or pattern in this various pieces of data because when you got key value pairs you will have some keys that are common some keys that are common up here some keys that are common up here or down here and it becomes easy to reduce them into one single meaningful value based on that key this will become very clear when we deep dive into our example the fourth thing to note is how the model handle machine failures or network partition to handle failures the mapreduce model basically re-performs the map operations or the reduce operations when the failures occurs so if your failure occurs in the step here that is maybe there was a network partition when generating the key value pair the central controller which I mentioned earlier will just re-perform this operation and then move on to the reduce step and this can happen only if the map and reduce functions are iron potent that is if you repeat the map or reduce function multiple times the output doesn't change the outcome doesn't change regardless of how many times you repeat the map or reduce function so item potency is a requirement in the map reduce model now when Engineers are dealing with the mapreduce model they just need to make sure that they have a good understanding of the inputs and expected output that is what is the output of the map function what is the intermediary key value PS look like how they can be reorganized in a meaningful way which as an input to the reduce function and allow it to optimally reduce to a final output as you can see this framework simplifies a lot of things for engineers Engineers just need to worry about the input and output data in various steps they don't need to worry about the intricacies and complexities of processing large data sets in a distributed system basically Engineers will end up using some sort of Library implementation such as Hadoop and they just need to be aware of the various Imports and outputs of these various functions all right with all these points in mind let's understand it through an example which I am taking from the same white paper presented by the Google Engineers so here I have a set of input files and typically these input files will be distributed across several machines for our example for the sake of Simplicity I just have two input files and our task here is to count the number of occurrences of each unique word across all the files so here our first file contains the sentence this is an apple and our second file contains apple is Right In Color the mapper here counts the number of times each word occurs from the input splits in the form of key value pairs where the key will be the word and the value is the frequency of the word so the word this occurs one time the word is occurs one time the word and apples one time and the word Apple occurs one time and same with the other input files where it says apple is red in color the word Apple August one time is occurs one time red August one time in August one time and color okay Us One Time by the way both these operations both the mapping operations are happening here in parallel now once the mapping is done it is followed by a shuffle phase which is a lot of times ignored during the mapreduce discussions but it's a critical step here we group the values by keys in the form of key value pairs so here we got a total of six groups of key value Pairs and these shuffling or grouping of keys makes the job of reducer so easy and efficient because now the same reducer is used for the same key value pairs with the same key now all the words present in the data are combined into a single output in the reducer phase and the output shows the frequency of each word here in the example we get the final output of key value pairs as this occurred one time the word is occur two times anacort one times Apple Locker two times red auger one times and occurred one times and color dock at one times the trickier part in the context of system design interviews is to be able to identify this pattern of mapreduce that is given a problem can you solve that problem using mapreduce is that the real use case so that is something you need to be able to identify and which comes by solving more and more problems so for example you are given a large data set of millions of YouTube videos and their metadata information such as the likes the comments the engagement rate the length of the video and whatnot and let's say you are asked to find all the videos which have certain number of likes so for this you need to pass through the entire metadata information of all these videos and then finally produce an output file and likewise whenever you are given a large data set where you have to deduce or analyze the last data set where say the files are distributed across many machines I needed to analyze all the files collectively somehow that is a key hint of applying mapreduce with that I hope this video made sense and for any future system design problems you will be able to identify whether to apply mapreduce or not at least you should be able to have a discussion with your interviewer to check if applying mapreduce makes sense on a particular problem [Music] thank you foreign

Info

Channel: ByteMonk

Views: 27,275

Rating: undefined out of 5

Keywords: software engineering, faang, technology, interview, map reduce, system design

Id: cHGaQz0E7AU

Channel Id: undefined

Length: 9min 9sec (549 seconds)

Published: Fri Feb 03 2023