No one at Google uses MapReduce anymore - Cloud Dataflow explained for dummies

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone thanks for coming I'm Martin Gordon I work at developer relations in Google and today I have with me Thomas Park I'm a software engineer at Google I work out of this Seattle office on Google cloud platform you're a software engineer I am with a jacket well you know all right that also exists so Thomas is actually has actually is working on this product he will show you a live demo later but I would like to spend most of the time with you going through the actual workings the actual algorithm of cloud dataflow so let's jump right in cloud dataflow is oh sorry and you might want to actually caffeine on there now you don't have it all right so cloud dataflow is a tool for crunching data we have been challenged I mean Google has published the MapReduce paper 10 years ago and a lot of time has elapsed since as you can imagine the technology has moved forward so we've been challenging the past for not having a MapReduce as a service service the the product that we have announced that at last Google i/o cloud cloud dataflow is you can call it a MapReduce as a service service but it's much more than MapReduce the paradigm is different it's much more powerful let's try to see what it does with an example so here is something I might want to do I want to autocomplete hashtags the same thing as the browser does when you start typing it has Auto completions but I only want to do it with hashtags my input data tweets but my output data the autocomplete predictions what do I need to do will I need to read the tweets I need to extract the hash tags from the tweets I need to count them to find the most popular hashtags and then from those hashtags I will count compute prefixes so Argentina becomes a a are arg all the prefixes in Argentina and then the last step is to take all those prefixes and for a given prefix find out what are the three most frequent completions I mean hashtags which will become the top completions for that prefix so if I type a if either a or if I type Arg I have different completions based on their frequency now if you look at this execution how many of you have already used MapReduce introduction or to play with data ok quite a few so among those looking at this graph can you tell me how many metric mapreduces I'm running here a number okay it's not so easy it's to have reduces now if I was actually running this on top of MapReduce I would have to in to instantiate my first MapReduce cluster my second MapReduce cluster shuffle the data been between them take care of the intermediate values as a data scientist that's not fun that's not what I want to do this is what I want to do I have high-level constructs in my favorite language which happens happens to be well we are at devoxx what's your favorite language Java yeah Python thank you nobody said JavaScript thank you for that I can allow myself this joke I had a talk on front-end and polymer before so people know that I'm okay with JavaScript so here is the high-level code that you would use to execute this and the core of it apart from reading and writing the data is that you process it using those primitives one primitive is parallel do with a function that does something the new extract tags is a class that has a do function that actually does what you want to do on your tags count which is an aggregation function how do I'll do again which does the expanding of prefixes of the bid that from Argentina extracts a and a R and Arg and so on and then top which is another aggregation function so that's the promise here the promise is that this is as a data scientist the code you are going to write and from this data flow can infer the actual execution graph which is made of possibly several mapreduces combined in various ways and it will also deploy that execution graph on actual machines and run it and then if you want to change anything instead of changing your cluster topology by hand and redeploying stuff in different ways and redoing your networking you change the code here and hit enter and that deploys a new cluster that does your new thing how does it work well before we dive into the exact workings I have to explain to you how MapReduce works because the title in the title nobody uses MapReduce of course I lied we do use MapReduce but it's a building block of a bigger thing so MapReduce MapReduce is like making sandwiches you have ingredients you do a map step that does stuff on those ingredients they become processed ingredients and then you have a shuffle and reduced step that puts those ingredients into sandwiches clear for everyone there is a mathematician in the first row that should be shouting this is not mathematical enough okay I get it let's try to put it with a little bit more form so you have data you define a map function so that is your code whatever it does that filters and transforms this data actually the output is not transformed data the output is usually key value pairs so you do stuff to your data but you also assign a key to each piece of data here for example my keys are red and blue then the shuffle step is is is about grouping similar keys together and then the reduced step is about taking those data that I have the same key and applying some reduce function that again use supplied and these are the machines the graph is maybe a little bit more to theoretical but these are the actual computers this is working on an M on R or two functions I have supplied Java functions which have been shipped to those remote machines for execution using this MapReduce cluster topology okay what is this for well it was invented at Google so it wouldn't be useful for anything if it couldn't do web indexing typical example you have the web web pages the mappers will take the web page extract keywords so the output here is a keyword and a URL then you put all the similar keywords together with their URLs and the reduced step actually writes the index from this okay so now you know what MapReduce is but that's not all that is actually something magical that happens when the reduce step is associative again I do my mapping and I have my key value pairs processed data with red and blue as keys but now if the reduced step is associative it's an addition for example those mappers they can already do sort my elements by keys put all the similar keys together and start adding them so here I can actually ship the reduce code into mapper nodes and they can start computing partial sums because sum is associative so I can I can compute partial sums and reapply the reduce function and reapply it again on own on later results and then the shuffle step just shuffles the inconvenient results benefit here there are very aggravated results already aggregated results so the here there's very little traffic on the network those results are already partial sums and at the end my final reduce simply finishes the last sums and usually one reduced node will do if I have to add 20 numbers from my 20 mappers so we'll see this again and again in MapReduce when the reduce step is associative you can ship the reduced code up in the mapper nodes and magic happens you you have much less traffic on your network and it's much more efficient example again these are your machines typical example counting stuff so let's say I'm counting people and I have a file with the world population in it I want to count all Brazilian and all Chinese so map I take the data if the guy is a Brazilian or Chinese I output one but I'm just summing them some in the mappers I can already do the partial sums and say on this mapper now I have three Chinese and three Brazilian so the next mapper I have three Chinese and for Brazilians and so on my shuffle staff it's just this intermediate sums that gets shuffled to the reduced machines which completes the the Sun and for those who are following there is apparently a summing mistake in this slide someone reported that to me six eight seven and three is 10 and not nine indeed I'll fix it all right so this is MapReduce this was published in 204 it's been 10 years today if you want to work with the Big Data we have been publishing lots of big data papers along those years I won't go through all of them one that I want to mention is Dremel which is known as bigquery today there are people running sequel syntax on top of MapReduce actually we have this product that is specialized in only running sequel queries on top of very very big databases for your analytics needs so if your problem can be expressed as a sequel query over a big database don't even think about MapReduce or dataflow or anything jump straight into bigquery big query is a completely hosted service you upload your data you type your query ten seconds later you have your result and this service has spun up 300 machines during those 10 seconds that's fine at Google that's fine and the the one paper that I want to talk about today is a flume Java which is actually the basis for dataflow so this you will need if what you want to do with your data exceeds the box of simple MapReduce or exceeds the box of what you can do with a sequel query and it's it's pretty generic so here really you shouldn't have any limitations Emil will just quickly I will not talk about this millville is about the networking nitty-gritty of how to get your data flowing reliably in this complex cluster so to make sure that a piece of data leaves a cluster once and once only and arrives to a cluster to another piece of another machine once and once only so flume Java let's go back to our MapReduce in flume diva you will use high level primitives and the map state stage will be called parallel do the shuffle stage will be called group by key and the reduced stage is actually another parallel do there are no differences between those two they work on parallel collections and if you look here at the intermediate stage after group by key I have a collection that here happens to have two elements which are groups of elements with the same key so that's a parallel connection parallel do do something on it in the same way as parallel do does something on the initial data in the associative case well the last parallel do we call it combined values it's still a parallel do it can still be treated as a parallel do but we give it another name to record the fact that it is associative because I hope you get this by now when the reduced step is associative magic happens so let's be a little bit more precise in flume Java you have four primitives first before we go into the primitives you work on key collections all of your data is record is represented in a pea collection a parallel collection that is an immutable collection that so if you want to process data you take them from there and you create a new collection for the next step the parallel do simply takes a pea collection and outputs another pea collection of elements of a different type group by key will take key value pairs and put them in bags by their key combined values actually either parallel do but to be more precise it usually takes the output of the group bike II so key value pairs but there is one key and already all the values associated to that key and apply an aggregation function on those values that are associated to one key so the output is a P collection of key value pairs with one key and one value the input was one key and all the values related to that value and finally flatten that is for mathematical reasons we needed it's a set Union so you have two bags of elements of the same type you put them in this in one bag we need it in our graph to be able to put data together oops okay now we also have derived functions like count in the flume Java you actually can do just count but kilt is implemented using those low level primitives let's see how and just a warning P collection of a pair of key value from now on I will be calling it P table of key value okay it's just it's too long to write P collection of pair if it is P collection of a pair of key values I call it P table so the parallel do takes an element and for that value of an element it outputs one that's what counting is you first say I have seen one of those group by key starts with a tip 8p cable of TN ones so a bag of key value pairs where the key is the value of your element and the value is one and it puts all the ones related to one value together so you end up with a key table of the value of your element and a collection of ones which would represent how many of those you had and the combined values is actually just the sum so it sums all those ones together the point here is you can do account by key using those primitives a little bit more challenging let's do it join what is a join so I have one bag of key value pairs of certain types of values another bag of key value pairs of another type of values I want to do a join which means that I want a single bag that for each key has all the V values for it and all the W values for it okay how do I do that I should just be able to put those two collections in one bag but that's not so easy they you can't put them in the same bag that those elements are not of the same type so let's make them at the same time yeah this is faking but let's make them at the same time we use a tagged Union which actually is just a structure that contains either a V or a W and a tag that says which one it is so our parallel do simply does this type transformation and hooray our elements now are all the same type I can put them in the same bag flatten same bag now I do my group by key and I need a last operational as parallel do to actually get my elements out of those tagged collections and end up with one key all the V is related to that V that key and all the W is related to that so again the join can be implemented using those low-level primitives and now that I'm using those primitives what is this for if you define your computation in Java using those primitives the system will be capable of lazy execution which means that none of the those computations will actually be executed until you reach a run command and at that point it will take all your computation instructions and build a computation graph out of it a data flow graph once we have this graph we will be able to optimize it so I will go through the optimization steps once it is optimized we are ready to deploy this on mapreduces in production let's take an example example here I'm on on a shopping website and the marketing guy asked me for a report on shopping sessions he wants to know for every session how many pages have been visited if the guy clicked on checkout and if he clicked on checkout did he actually pay was the payment successful the marketing guys did like that so that's my output actually I have a second output the the checkout logs which are generated I mean by the code of my website every time somebody clicks on checkout I record in the database what he has in his cart and so on the marketing guy also wants a row extract of that because he has a nifty visualization tool alright I don't want to know give me the format I'll give that to you now what do I have as an input so first those check out logs records in my database every time somebody had to check out with interesting information about what he has in his cart then I'm using two payment providers okay so I have a payment logs of course incompatible different formats so I need to transform the first logs using parallel do B and the second logs using parallel do C before I can put them in one bag and actually apply the transformation I want on them which is D and finally the guy wants to know how many page views in one session so I'm also inputting my normal website logs and counting the pages related to one session ID I do a join on the session ID and a final formatting operation F and that's my output so forget about what this actually does I just wanted to tell you that this is a real world example and the CUDA I have written using the the flume Java primitives generates this execution graph now what can I do with this graph well the first and pretty obvious thing is that my parallel views I can merge them together remember what a parallel do is it's a piece of code that gets shipped to some remote machine for execution so if I have this piece of code with an output that is consumed by this other piece of code I can ship both to the remote machine it will that will work there is no problem there and logically I call it a new parallel do c plus d the same actually works for parallel parallel dus I'm slightly generalizing what my parallel do can do it can have multiple inputs again this is a piece of code that is shipped to a remote machine that machine is perfectly allowed to read two files instead of one that's fine so in my graph if I have sibling parallel dues I can merge them together and create a parallel view that has multiple inputs and multiple outputs syncing flattens ok this is totally obvious the Fagin is just a union ok so whether I take the output of B and see it put it in the back now or if I first apply D and put the results in a bag that's exactly the same which means that my flatten can go down and now when the flatten is before a group by key it's not necessary anymore the group by key operation is a set Union it takes all the inputs and then does the groupings so this is a completely trivial transformation but the flatten is not needed there anymore so two pretty trivial transformations now the non-trivial part I'm going to do a kind of pattern matching on my graph and what I want to extract is those patterns which I called map shuffle combined reduce so first of all how do I extract them I take one group by key and this is going to be a totally non mathematical non graph theory explanation I take a group book by key and I pull on it and all the other group by keys that come with it I take them with me so I take groups of group by Keys which have a relation by parent parallel dues feeding into both again the image I take a group by key I pull on it everything that comes that is attached to it comes together so that gives me my middle column of group by Keys each of those the the first layer are parallel dues I will try to find one column of parallel dues before my group by Keys now on the output side to be flexible if I need the output of a parallel do immediately for another nsrc I'm allowed to do the direct shortcut that is at the top but most of the time I go through a group by key and after a group by key I want a combined value so an associative reduced step and another parallel do that does something with this output if I don't have them no worries the do nothing combined values is the identity I just put it there and the do nothing parallel do is also an identity I just put it there but this is the pattern I'm looking for one column of parallel dues as many inputs as I want one column of Rubik is related and on the output channels either a straight output or group bikie combined values and parallel do on the output channel now why do I want you to do that well if you look at this group it's not quite a MapReduce it's kind of a generalised MapReduce but it can be executed in exactly the same way as a MapReduce you have a map step okay it's not defined with one piece of code it's defined with three pieces of code in one and two and three who cares you know how it is defined you have a group by key step again it's not just one drew by key there are multiple keys and to group by keys same thing you can ship this to your shufflers that's fine you know where the data is coming in you know which keys you are grouping it works and on the output as well you have a combined values you have a parallel do again you can ship this to your reducers and this node which we will call a map shuffle combined reduce can be executed using the slightly generalized MapReduce framework okay and a last interesting piece of decisions let's go back to our autocomplete we were counting our hash tags and then there is this expand prefixes step well that one is problematic its exclusive for Argentina it gives me ten prefixes aar Arg its explosive so the count on the top I said can be broken into their primitives and now I have an interesting decision to make do I group my m/s RCS like this or do I group them like this hmm who has an opinion all right this is hard so like this mighty bad idea why the first is a valid nsrc okay I have a parallel do group by TSS combine values and another parallel do but that last parallel do step is explosive its multiplying the size of my data by a factor of eight so I'm actually creating a network a big big nectar traffic flowing into the second parallel do that's bad this is much better but you will tell me well now the network traffic is between the map and the shuffle stage why why is that better someone knows associative reduce yes we have this combined value step after the group by key which means that that associative combined values will be shipped into the mapper and the reductions will start happening in the mapper so in the east stage at the output of that if I put this into a man and nsrc at the output of that I don't have plenty of values I have already started aggregating them in the mapper so that fat arrow there is not there anymore if I do this grouping like this okay so a rule of thumb cut on the thin arrow don't cut on the fat arrow we engineers we need simple rules of them all right let's see this in action so here is again my graph flatten the flattens go down oh sorry first count and join we have seen that they can be replaced by their low level primitives couch is a parallel do group by key and the actual addition the join slightly more complicated the j ones or the the tag the creation of the tag unions then you flatten all that you group them by keys and you need to unbox them from there they're tagged unions alright so this is the execution graph I'm going to optimize first the flatten it goes down down again now I have two flattens in front of the group by key that's useless the group by key already puts all of its inputs in the same bag flatten is just put everything in the same bag I don't need that now my parallel dues merge merge merge merge oh here this merge actually was an interesting decision I merge it in a certain way and the last one merge I'm just going to move the nodes around I'm not doing anything at this step at all okay and I'm inserting some identity nodes just so that I can do my pattern matching lo and behold here are my two mapreduces generated from this graph so in the end this is what my execution will look like q MapReduce steps with four inputs and this graph topology and just to make it real here are the actual machines that will be this will be executed on so I have over there I have my mappers I have my reducers and those are configured with not just one map function but those are actually configured with my map function a b c GE and j1 and the other machines at the end with F and J 2 and so on so here I know exactly how my data flows out on exactly to which computers I'm shipping my pieces of code I know how to run this performance okay well you have a look on the performance benchmarks the only takeaway here is that the optimizer we tested it it's as good as optimizing by hand so people who know MapReduce who know how to arrange them together they will not do a better job here and there is an interesting comparison to sawzall which is not completely an apples and pears comparison sawzall is a DSL for processing logs that we have at Google so you see that I'm on code size it's actually very efficient if if your problem is processing logs the processing can be written in few lines and so though execute on MapReduce that the performance is is actually crap and why is that well Sozo does not have this model of deferred execution where you create a graph optimize the graph and annals all that and also it's a high-level language which which means that it doesn't have either any low-level stuff that would allow you to optimize your MapReduce is by hand so you're stuck there with flume Java with dataflow we are at the right level of abstraction it's high level enough to let you do the work you want to do but it's all it also has these primitives that let the optimizer work do the work the optimizer has to do and now I'm sure you want to see like them right Thomas live demo so hi everybody I'm Thomas Park I am a software engineer working on bigquery actually but in a prior life I did work on the predecessor to dataflow so today I'm going to walk you through an example that I wrote to teach people inside Google how to use the programming primitives so hopefully it'll be useful for you guys as well mmm excuse me so let's take a quick example let's say that you want to extract some data from a set of Apache web logs and you can tell that I'm a developer by the way because my slides look like crap compared to Martin's but that means I'm not a developer Thank You Thomas no worries so the time line in the middle here is basically a graphical representation of a quick stream you have clicks represented as dots and they proceeded over time so let's say that we wanted to get some information about our web logs such as what what query terms are what search terms are people using to access our website and which of those are really sticky so what query terms actually end up leading to people spending a lot of time on our website so that might inform our SEO decisions or where we want to monetize the keywords excuse me so here we have clicks which search terms now just getting the search turns out is pretty straightforward you could do that with grep let's say that we want to further our example here by looking at this the the clicks that were made by different users so here we have the user the user in red up top and the user and blue on the bottom we separate it out there they're there web clicks now further we want to group these clicks into sessions so we say I want to get the sessions or the clicks that are within a relatively tight window of time to each other so this is starting to look a little bit less trivial than just using grep but we say okay these clicks happen relatively close to each other so we'll count those as one session and then the user came back later and did a few more clicks so we'll separate that out into a separate session now to blend in the search term information let's say that we take a look at those sessions that have a search query term in them like genomics our data flow for example and we want to roll up the time that was spent in the sessions according to which sessions have these terms in them and then add that across our entire global user base so in this example here we see that we are spending 75 seconds on the site when we came in with the genomic search term and 100 seconds with data flow so this all of a sudden is looking like a pretty non-trivial bit of code right now those of you who are experienced MapReduce users might say okay so what we can do in this situation is turn to a parallel processing framework such as MapReduce but the problem with MapReduce is that it's not always obvious exactly how we can decompose the problem into a series of map and reduce operations can anybody here tell me in 30 seconds or less what your maps and reduces would look like to generate this I will buy you a beer yeah it's it's non-trivial so one of the really beautiful things about about dataflow is that it lets us work at a higher level of abstraction in an ideal world we would start with the algorithm so my algorithm would look something like this first I want to parse the log files I've got these flat text files I need to extract meaning from them then I need to get the clicks and group them by client so we can identify individual users by for example taking a look at the IP address they came in on and on their their user agent for example then we calculate the session times by grouping those clicks that happen within a relatively short period of time to each other we extract the query terms so we look at the referrer string and say AHA this one had a query term in its all get that information and then we merge this information together so we can say these are the sessions that had query terms and these are the session times associated with them now I'm not a MapReduce expert it would take me a little while to figure out how to do this and MapReduce but the really beautiful thing about dataflow is that it allows us to think about things in a much more functional manner so watch this there's almost a one to one well actually there is a one to one relationship between these steps and the algorithm and things that you would do in dataflow parse log files boom it's a parallel do where you just give it a function that extracts that information from your log files awesome extract click times another parallel do that extracts the clicks and the groups of my key so we get all the clicks from a single user together same thing with calculate session times no problem extracting the query terms exactly the same thing and then this last one is a little more complicated basically we have to take these two parallel collections and do a CO group a key one thing I'll note is that Martin has been explaining the phlume Java paper to flume Java has been in use for several years at Google and so we've learned from some of the some of the operational difficulties with it so to speak flume Java can be very verbose so the programming primitives that are available in dataflow are a little bit different there are actually quite a bit more concise so having gone from writing from Java to writing dataflow it's actually a really nice experience so I think you guys are all going to be pretty excited when you get it so would anybody like to see this in action yeah all right let's do it so I'm going to come can everybody read the terminal okay all right so what I'm going to do here is start a build now this basically requires that you download the flume Java sorry the dataflow SDK but it builds with simple maven commands so it's it's pretty easy to embed in your your current build environment I'm going to build a jar file that has all my code dependencies there it is and I'm going to start a quick Runner script that I wrote that just executes the the jar so we see this starting up here it's uploading my jar to Google Cloud storage this is so that the worker nodes can get it now this takes a little bit of time to get all fired up so I'll run you through some run you through some other interesting little widgets here while we're looking at it first let me take a quick digression has anybody here ever heard of Star Wars kid and okay only a couple of you so this is a this is a video that when absolutely viral about 10 years ago I'll just show you a little bit of it you might be asking yourself why the hell am I telling you this I just want to show you this so that when we see the results of the dataflow pipeline it makes sense so start with kid is basically a video of a high school kid who is spinning the staff round and somebody remixed it and turn it into this this is kind of awesome so it went absolutely off the hook you see this video on YouTube got almost 30 million views now the cool thing about it is that this was originally surfaced on a website called whack see this is a blog that basically deals mostly with video games and anime and things like that but this person decided to open source the web logs from the time that a Star Wars kid was going absolutely viral so that's what we're working with here let's take a look at this here so the logs themselves is 1.5 gigs it's it's non-trivial it's almost 10 million lines it's a fairly chunky amount of data here's a compute engine dashboard this should show us our MapReduce workers or sorry our data flow workers spinning up at first you should maybe you should say that the the graph is executed on compute engine notes oh yeah thanks for so what we have here are two different types of notes there's worker nodes and their shuffle nodes you see we have quite a lot of them here to process this data so right now your your workloads are only constrained by the amount of compute engine quota you have so you can scale up as much as you need to as long as you have compute engine capacity and the data flow back-end will turn this into data flow workers and shuffle workers so let's take a look at the data flow UI here now this should be our data flow I'm afraid there's been a problem with data flow today for some reason the execution graph is not showing at the moment what it should look like is something like this of course the the program that we're running would have a much more interesting much more branched out graph but unfortunately that's not rendering at the moment but we do see that the logs from your data flow execution are persisted in the developers console which is pretty cool let's go back to our math gimme go back to our general for a sec so we see the system running now the neat thing about this type of parallel computation is that we are currently running over 1.5 gigs of logs if we had a terabyte of logs that should take about the same amount of time to run the computation assuming that you had enough compute capacity for it so let's take a look at the code here's my session time class if we look at the main we see that we have a pipeline that we're building the pipeline here and we're applying these from from that pipeline we get a parallel collection with the input and then we pass that into this function here which is calculate session time so I'll jump into that and this should look very familiar to all of you this is basically exactly the same code that I showed you on the previous slide if we if we jump into one of these functions here so calculate functions time for example is a pretty simple functor a lot of you're probably thinking oh my god lambdas would be awesome for this and you're totally right but basically calculates calculate session times it's just simple do fun that basically takes some of this data in this case we have to sort it I think there's a new primitive called key value order by value that will allow us to avoid the sort but basically all of these do phones are just really simple snippets code and in fact you can define these in line as anonymous classes and that works just fine as well but I'm showing you the entire file here including imports copious whitespace and comments we've written this entire program in about 200 lines if I was taking some shortcuts and defining these things in line we could probably get this down to about 150 does anybody have any idea how long this program would be if you wrote in to do I don't know either but I must be quite a bit longer all right let's check back on this guy a question yeah go for it right so the question is do we benchmark it against spark I actually don't work on the data flow team so I don't have a lot of information about what competitive benchmarking is they've done but I'd be shocked if there wasn't that data somewhere in Google but unfortunately I can't answer that question so let's see where we're at with this aha so you see now we are turning down the worker poll so we should have some results here let's take a look here's my output directory and aha you see now we have our session time data so what we're looking at here are the quick times in milliseconds and the search terms that came in I apologize I did not sanitize this data at all so if there's anything offensive please forgive me but what we see is that we can we can start to see the quick times that have a really really sticky presence to them the ones that actually caused the users to stay on the site for a long time now what I'm seeing here though is that these results are not sorted so in MapReduce or in Hadoop you would have to go and do some kind of potentially nasty hacks to your code to get this cannot sorted I'm not going to do this in live coding because I'm not a huge fan of live coding but what I can tell you is that here we have the code right before we write out the results we have a parallel collection that has the create the query terms in the search times all I would have to do to sort this here would be to add another apply step that took the top end there's there's primitives and dataflow allow you to get the top or the bottom end by key there's another one forthcoming that can take an arbitrary comparator so I'd be able to write a comparison function that looked at the value of the key value pair and compared it against and sort it against other key value pairs rather than looking at just the key does that make sense great so let's just jump back here really quick and we see that I'm now on the default compute engine dashboard so this means that dataflow has torn down all my workers we no longer have anything running it'll basically start up your MapReduce or start up your dataflow workers and then tear it down when you're done so it's very very simple and that is it now there's one more thing ah let's go back to the slides so up to now we have only spoken about processing files with data what would it take to process a real-time data and actually for this hashtag autocomplete feature that's what you want you don't want to do the autocomplete on the hash tags that were popular in 211 you want the hash tags that were popular in the last hour dataflow gives you that as well you have to change your read and write to reading and writing from a stream and then there is just one line of code that changes instead of working on all the data you work on a time window and that's it now let's step back and understand what just happened I change this one line of code in my processing code I deployed this and the processing cluster is doing a completely different thing with actual real-time stuff going on the data flowing in it's on autopilot it can automatically adjust to the volume of this stream that is coming in and what you did is just one line of code I think that's the right abstraction layer for data scientists you don't have you don't want to worry about networking and deploying stuff and managing the life cycles of your your machines that's it thank you if you want to tell us something about this session you have the link there you can read it and we still have time for questions yes Samuel the question is do you do you plan to support Java 8 so that you can do your dudu functions using lambdas that would be a mighty cool idea I think so I'm pretty sure this will happen I don't have an idea of the time frame and I'm pretty sure it will not happen in the in the first alpha other questions yes please well one place where you can run this it's on your local machine that's actually one of the beauties of this you write the code and you you you just it's just Java you run it on your local machine it does everything it's supposed to do without you having to instantiate a virtual cluster of mapreduces or whatever on your local machine so that's one thing I'll show you an example really quick actually here's the pipeline that I'm creating in my session time class to enable this to be unit tested easily we actually have a test pipeline so this runs everything in process sequentially which lets you do things like attach debuggers to it which is extremely difficult if you're running in distributed environment and shipping your classes off to remote worker nodes so you can you can execute this locally there's also the ability to execute the pipeline on a single node in a similar manner to this if you don't have large-scale computational needs and as to answer your your question can you run this the goal of dataflow here is to be a managed through is that you provide the code we take care of the operations for you so that's a managed service now we also publish all of those algorithms the flume Java paper is public and so on so I think the same thing that happened with a data or study with mat by using Hadoop will happen again I someone told me that the Apache project that implements flume it's called crunch I haven't looked it up myself but you have a specialist over there was there a question over here so the question is if we calculate the session times can we do it using a sliding window absolutely the function that I have here it's just code that you write so in this situation calculate session times this code here gets the clip times by client and then we calculate the session times using this function so if we jump in here we see that basically all we're doing is applying is sorting the collection here and then applying this function which I'll jump into really quick and it's pretty straight for computer science we're basically just doing an iteration through the clicks that exists and what I'm doing is saying if the clicks happen in a short enough period of time to each other then we count them as part of the same session then we add a little bit of time for individual clicks so you could do whatever you want it in here the the beauty of it I'm sorry right but if you have the data available in an array or a list then you can implement whatever function you want that that does calculations on it you're not convinced other questions last question someone you have to shout we don't see you very well now alright then thank you very much you
Info
Channel: Parleys
Views: 13,245
Rating: undefined out of 5
Keywords: Devoxx 2014, Devoxx, 2014, Google, MapReduce, Cloud, Dataflow, tutorial, training, course
Id: AZht1rkHIxk
Channel Id: undefined
Length: 55min 32sec (3332 seconds)
Published: Mon Jan 04 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.