MapReduce - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today we're going to be talking about mapreduce which is kind of a programming paradigm for doing large-scale computations across a computing cluster google originally came up with the whole mapreduce idea and um yeah that way of thinking about doing large scale computations and then it got popularized through an open source implementation called apache hadoop and then that was then quite popular um in kind of processing these very large volumes of data so we can do an example of a mapreduce computation so the first thing with mapreduce is that your the job you want to do has got to be able to be split down into a map stage in a reduced phase so the map phase needs to be you want to do the same computation across all your data items at the same time and so you do that all your data is distributed across the cluster now stored on the different nodes and each node does the computation on its own data and then the reduced stage would then take the results of the map phase reduce it down into kind of a single value and then send that back as the result to the computation a classic example would be a word count so say we've got a massive text file and we've got it in a distributed file system across a cluster and that means that each node of the cluster will be doing the computation on its own data and we want to count the number of occurrences of each word in that text well so for simplicity's sake we've got a file which first line is aba and then abc and then cd and this would be distributed so let's say aba is on the first node abc is on the second node and then cd on the last node so the map stage would take this and it would put this into what's called a key value pair so we're going to take each word as the key so for each word within this we'll map it so that the word's the key and then we put the number one next to it is the value we go a1 b1 a1 a1 b1 c1 and then c1 d1 at the end of the map stage which is here we've got all the keys and values what happens then is a shuffle phase which the programmer doesn't need to know about it just kind of happens in between and that basically groups these on nodes based on the key so that all data items with the same key are stored in the same node we'll go to a1 and then on the second node we'll have b one one and on the last node we might have c one one and d one and the reduced phase is going to return a single value for each one of these keys so it's about combining all of the values associated with that key into one single value so for this we want to count the number of occurrences of like the word a and so we're going to reduce it down and we're just going to use a plus operator to do that in the reduce function so it then finally comes to a3 b2 c2 and d1 and if you imagine that we're doing this on huge huge volumes of data so over very very large files this kind of computation is a lot more efficient if you can distribute that because doing this map phase of saying okay this is one occurrence of the letter a that's independent of anything else and so you can split that across individual nodes they can do that part of the computation individually and then later on we do the shuffle and reduce it all back to the single value so are they physically moving data or just moving the computation saying right you're responsible for that computation um they'll be moving the data this all happens in one node this happens in one node and then this bit would be in one node because you'd need to know the key and then all the values that are associated with it in order to to do that computation because the other point about mapreduce is data locality so doing the computation close to where the data is stored so you want to minimize the amount you're moving data around um because that obviously takes time so you want to move data around the cluster as little as possible and do the computations close to where it's stored so that is the end of the mapreduce process yeah you've done the reduce that's it what you then do with that data is up to you so in like a business use case you could be using it to go over you know millions of customer records and get some kind of statistics out of it um you could then save that back to the distributed file system which is probably what they'd want to do for later use uh yeah after that stage it's up to your use case is this still used and so it would still be used in some cases because this is very good for kind of doing it like a single batch job you've got one computation you want to do over the data and get a single result out of it but then it's not a very flexible way of doing it so for example you've got to be able to fit your computation into this map stage in this reduced space and you've got to be working with key value pairs and this is in the apache hadoop version anyway it can be quite painful to put something into that framework or just too difficult to do and then secondly because this does basically get the data do map do the reduce and then it just writes it back to disk it's not very good at reusing that same data across multiple computations because you're constantly having to write stuff back to disk reload it so it's not good for like iterative algorithms so a lot of data mining stuff such as k-means clustering that would be going over the data again and again and again which mapreduce is not very good for um so then this there are then more recent big data processing frameworks such as apache spark um that are kind of designed to alleviate those issues having a massive file of text or anything and then having to move bits of it around that feels like it's a bit clunky as well is that the case as well do you think having the distributed file you'd use a distributed file system so a lot of this would be sorted out for you um so for example the hadoop distributed file system if you have like a huge file it's like terabytes then it kind of splits it up into chunks and puts single chunks on like individual nodes in your cluster then when you're doing the processing kind of the point of these frameworks is that you don't have to think about it so that bits kind of hidden from the programmer so for example in mapreduce the programmer doesn't have to worry too much about the shuffle phase so that's done automatically they just have to do the map and the reduce you don't want to be moving data around is the point because that's taking up time you want to be keeping everything local on a single node as much as possible train for a long time and let's not let steve off a hook right there steve over here high value of two high value of one whatever that means the interesting thing about this is we're not performing a classification little endian systems so that's why we call it endianness it all traces back to

Info

Channel: Computerphile

Views: 227,911

Rating: undefined out of 5

Keywords: computers, computerphile, computer, science, University of Nottingham, MapReduce, Google Search, Map Reduce, Apache Hadoop, Apache Spark, Rebecca Tickle, MR, Distributed

Id: cvhKoniK5Uo

Channel Id: undefined

Length: 6min 41sec (401 seconds)

Published: Tue Dec 04 2018