Big Data Processing with Apache Beam Python | SciPy 2017 | Robert Bradshaw

Video Statistics and Information

Video

Captions Word Cloud

Captions

so I'm a brad Shia work for Google and I'm going to talk today a little bit about a project I've been working on called a patchy beam so let's start about a little bit about data obviously if you're here you're interested in data and it Python we have a lot of tools for manipulating data for reading it in for munching up for analyzing it for plotting it and this great ecosystem has evolved around how do you understand your data how do you do what you want with your data and get get the information out of it and one of the interesting things about data is data can be really big and when I say big I mean really really big this is it doesn't fit on one machine not only can you not load it into memory maybe you can't even fit it onto a machine that has you know a stack of 10 terabyte hard drives it could be it's often distributed and you want to understand and make use of this data there's there's a lot of reasons the data is getting big these days um with the Internet of Things processors of cheap internet internet steep you have mobile devices that maybe you're pinging you about user actions and where you have sensors out you know all over some desert somewhere looking for seismic activity and you want to be able to gather this data another reason you want big data is for machine learning an analogy is if machine learning is a rocket the data is the fuel and if you don't have enough data you're really just not going to get the results you want um not only can day to be big it might be infinitely big meaning you never see the end of it it just keeps coming and coming and coming and this is a streaming scenario where you're always getting more data and you can't just say ok I'll wait til my data comes in and then I'll process it you really have to process it as it goes so um and at Google what we did is we kind of started out with a company is like from from the inception we have data that was too big to put on one machine and one of the tools we developed early on was MapReduce there's this we published a paper on MapReduce in 2004 and this is kind of the the it changed the way that you think about big data processing instead of starting to think about machines and you know what they're doing and scheduling them you really kind of step up a level and think about ok I just wanted to find what I want to compute and have an infrastructure that distributes the work and gives me my results and we publish this paper and since then there's been a lot of a lot of development and the open source ecosystem there's Hadoop kind of the foundational thing and there's things like flink and spark and drill and hive and um during this time we've also been iterating on this internally so we have things like a big table and flume and mill wheel and Dremel and some of these some of these technologies we've published about and that's inspired open source projects that take their own ideas and mix things in and we've been looking at this and saying okay well we developed something we use it for a while then we publish a paper and someone else has the same ideas and reads our paper and read some other papers and develop something and we're kind of developing these two parallel worlds and so a couple years ago we divide we we I decided instead of publishing a paper let's publish some code that people can actually use and this is dataflow it's a system that that allows you to do big data processing and then we had the idea well instead of just publishing the code let's actually integrate with all these other great tools so that's that's where a catchy bhima rose it arose from it's kind of the combination of taking a lot of lessons learned from doing big data processing over the decades at Google and all the open source tools that you know and love and are familiar with so the beam model it's kind of a step way beyond MapReduce and the idea of the beam models it lets you focus on the app the abstraction is your application logic and you can swap out the backends for four different backends kind of one once we're about this so it at Google we were trying to clean up some of the some old users of MapReduce and and MapReduce developed over the years hundreds of tuning parameters like every big system does you know you can tweak this buffer and you can tweak the timing on this these are pcs over here and so we're trying to clean these up and we had teams coming back to us that said no you can't delete this flag we need it for our pipeline to community to be able to go on time and we tell them that flag has been in no op for years and this is not an isolated occurrence so what happens is when you tie yourself to a system you tune for the system you have you have you know some expert that comes in and says oh they're my pipeline is too slow I know how to tweak that and you get it running really really well in the system and as the months go by and the years go by your data profiles change your pipeline itself changes the underlying system changes the tuning parameters never change and worse than that someone says oh this is the magic go faster flag I'm going to use that on my pipeline too and you end up with these complicated pipelines they have these tuning parameters built into them that are you know maybe relevant at the time and a lot of expertise when to go into them but 90% of tuning parameters are obsolete if not now they will be in six months until we want to completely abstract away the back end from the front end um the other thing we've realized is that a batch and stream processing we want them to both be first-class citizens so instead of thinking okay I want to do a batch job so I'm going to do against you know a program against a far far Dedes and someone else's okay now we want to do a streaming job we're going to write this completely separate pipeline that hopefully across our fingers computes exactly the same thing but is completely rewritten and then of all these two pipelines in parallel instead we want have one framework that does both and kind of one key concept for for streaming that we've we've introduces this notion of trying to separate event time from processing time to be able to have strong consistency guarantees even in a streaming system um so without further ado here is an example pipeline so odd like like desk before everything in beam is deferred so so we create a pipeline object you say you just put in with contacts with pipeline as P there's our pipeline and now what you want to do is you want to read some data in so here we might read a text file we might read several text files from I read several text files that are sitting on maybe GCP or something like that and lying to the P collection again this is very fast because nothing actually happened but what we do is we now have a virtual collection of all the strings that represents all the lines in all those files and now what we do is we want to split those into words this is going to be a word counting example and what we do is we have a flat map and we can pass this lambda that here's just a regular expression this okay look for things that look like words and we want to emit take every line and omit all the different words and this pipe operator here is kind of a new syntax we played around and we never find with syntax that was like perfect but but this one's pretty good so if you just put your mind in the mode of I'm going to omit a bash prompt and I take the thing to the left of the pipe operator and I'm streaming it to this operation on the right and I'm getting something out in this case I'm getting words which are all the words in my pipe in my in all my data files nice thing about this pipe operators I can chain it so I can say words now I can map these things I have words I can pair them with the word and one and then it can do a combined four key where I just say I go on I want to sum although all these ones together corresponding to each word this this word the first first element in the top was the key and the second element is the value so what happens to them this execute is it automatically knows and because the combined per key that the summer operation is commutative and associative and when it's doing a distributed job it will lift the summation to happen on the partitions and then only sum the final result on a separate reduction step and depending on the back end we have a lot of a lot of abilities to actually dynamically decide the partitioning algorithms so if you have some files that are being in some files that are small it will start reading all of them and then the ones that are too big it'll actually say oh this file it turns out to be too big but I got some idle workers over there let's get the second half of this file to someone else and so everything kind of completes at the same time of course results aren't any oh yeah and one more thing is you can actually define your own composite operations so this is another reason for doing the pipe operator is it means that all operations whether built in whether you know in line here whether from some third-party library are all on the same footing so in one of the inspirations for being was how something called flume Java and we have fluency plus plus as well in Google and we had this pea collection and we kept adding methods to it and adding methods to it and it became this like five thousand line monstrosity and that's only because we pushed back on like you know 80% of the requests oh well I want to add my special you know compute quantiles ignoring even numbers operation to your pea collection and so right now what we do is every single thing is on the same level so this count is something that you can define as a composite operation the other nice thing about composite operations is it actually gives structure to your graph so when you're taking it when you're visualizing your graph you can visualize the high level things it's kind of like like functions and then you can open up the operation to see what's inside there and see what's inside there so you really have this tree this hierarchical tree of operations for understanding when you're visualizing graphs or when you're debugging your drafts you're saying what stakes step takes a long time we can say this count step okay we could go inside that counts to up and say what in the count stuff is taking a long time oh it's the shuffle step and then and so on also counters and things like that are hierarchically organized so you can build these massive pipelines to have we have pipelines were thousands of stages but they're comprehensible because you know maybe there's ten top level stages and you can drill down into each of those or ignore one so you don't care about so of course it's no good if you don't actually write your results out somewhere so this is the final step is to write them this also shows that the computation is generally a dag it's not just linear linear is nice because you can change things together but you may do a dag you may join things together you may split things apart and then when you exit this context the pipeline actually executes so we've shown how to kind of create the dag what about streaming this is this is one of the key concepts is we introduced streaming in a way that's minimally invasive to the batch pipeline so here's our batch pipeline we constructed and here is the streaming version we have to read from a streaming source if you want to do streaming I guess um it is possible to read from a batch source and do straining but that's not interesting because then you just finish and then also you have to specify this windowing operation that specifies how your aggregations occur so let's look at that windowing a little bit more detail so one thing in API design is you want to have somewhat of a separation of concerns so the first concern is what are you trying to compute and then in streaming pipeline you say where an event time are you computing this when in processing time are you materializing this and the last question is how to relate results relate to each other so processing time versus event time I have to credit the data artisans folks we work with on flink for this analogy it's kind of like Star Wars where you have the direction of causality in the universe and then you have the direction of causality in our observations and these two are not necessarily in the same order in fact they could be you know decades out of order and you have to read and about these two things separately to be able to have a notion of completeness so we what we can do is we can put these on separate axes we say a long time ago is the direction of causality in the Star Wars universe and then we have our universes when we observed the events and you can kind of see it's not quite linear there's a light cone that goes up here that you know we can't observe the events before they happen but we may observe the events long long after they happened so we're going to kind of use this diagram to understand how the aggregation happens in in windowing so the first question is what we're compute this is kind of a little tiny snippet of a pipeline where we say we want to compute the sums per key and remember this is this is a reduction that happens partly on the mappers partly before shuffle partly after shuffle and the typical batch operation is we have a bunch of element come in and we kind of just observe them sequentially and the final result here is 51 and we published we saw a total sum of 51 that's all our elements that we ever saw ever this is kind of the typical batch pipeline now if we want to do something in streaming we want to say well we want to group these things to Windows here were grouping these things into two minute windows what we're going to do is we're going to publish a result per window so this first one we saw 14 in the second window we saw 22 is Assam 3 and 12 and right here what we do is we say this is where in event time and you notice no matter even though the elements themselves came in different orders we now have correctly the the time from 202 to 204 has the correct sum there it doesn't matter when the elements came in when we observed them all that matters is when they really happened we're summing with respect to this this is our robustness but to publish the results sooner what we owe and yep sorry one more thing about windows is we not only support fixed windows but you can also do sliding windows so you can say you know I want this result to accumulate in very in multiple windows and we also have windows are UDF's so you can actually specify your own windowing functions and they need not be the same per key so you can actually define say session windows which says I want my window my aggregation happen with respect to a session that's defined by elements that are separated by gat no more than twenty seconds or something like that and everything is happening perky so on different keys you may actually have different sessions so if if your key is our users and your sessions are visiting a site so the other thing about windowing is it happens to the aggregation but you actually specify you tie it to the peak collection this is key to allow your windowing to work well with composite operations so when a window for instance by fixed windows now when I do a somber key and then say I followed that up by saying I want to have the largest 100 things I don't have to re specify my windowing my windowing is the same as the windowing I had before and this allows us if we have some composite operations that deeply nested says I do a sum well it now does the windowing according to that windowing that was specified up above and your pipeline can have multiple aggregations and multiple windowing throughout the pipeline and kind of the windowing enclose a portion of the dag of your pipeline so now we want to understand well when are these results actually compute it because for a streaming pipeline you can't wait for all the data you have to say at some point I'm going to emit something before I've seen all the data and this is the notion of a watermark where this green line here gives us a heuristic of when the system sees says I think I've seen all the data I'm ever going to get for this window and once I think I've seen all the data I'm ever going to get for this window I'm going to omit that window and so the downstream processes can consume it but I might have you know partially complete windows that I'm going to hold on to until I'm sure that I've seen all the data for them now in a distributed system if we have correct watermark so if we can actually say this this works great but sometimes our upstream doesn't always know for instance if I'm collecting data from mobile devices someone might have been sitting on an airplane and if this is a mobile gaming application you know when they land and they reconnect now suddenly some data comes in that oh they pups nine bubbles on my you know fancy new gaming app and this might have come hours later so we have to have this notion of what do I do when I was wrong about my estimates when I've seen all the data the simple model says well I just reject anything that's too old I got enough data but if you want to be more complete the model also extends to say well I can actually say if I've seen more data or if I've seen I can actually admit results before I even have seen all the data and specify how I want to accumulate and how I want those results to to belong to each other so this is kind of the the model that allows us to reason about streaming things in in a concrete way without affecting notice that yellow line here has not changed what I'm computing so this yellow line could be some ginormous pipe line and all I have to specify is how I want my aggregations perform this percolates all the way through and so here's an example of this this how refinement so notice that when I hit this 14 this this 9 up here I omit a renewal and say actually it was 14 not 5 and likewise in the second window here this 7 I say well so far I've seen 7 but it's not everything and then when I finally admit the window I say actually the final result was 22 you know if you're actually going to persist this one maybe bill someone on it that's what you would do so there's this notion of dividing things into what you're trying to compute where in event time what your windows are when you're going to try to compute it say you're going to try to compute you wait for everything or you're going to wait till you think you've seen all the data with some you know probability of success and then how the results relate to each other really allow you to separate these concerns and you can kind of mix and match these different pieces together to to get some to get the results you need - one of the important things is you can trade-off latency versus correctness versus cost by varying these different parameters and different applications have different uses for these this is just kind of a short overview but there's there's more information at the at the beam site and another thing to know is all of these things actually work in bash - this isn't just streaming if I wanted to group by session windows I can do that in batch - and it just works just perfectly the advantage of batch is you usually know and you cede all the data which is you know you've read all your info files so you don't have to worry about you know lay data coming in if you're in batch but all of these concepts work just the same so um what is Apache beam provide so it provides this model for dealing with streaming data and batch data in a consistent way the other thing that you dream is is it's a it's a set of API is for actually rewriting these pipelines as we saw there's a Python API there's so Java API that actually came before the Python one and then the other thing that being provides is a swappable set of runners for actually executing on distributed back-end so we have a back-end for apex flanks Park Google out data flow and in process testing where you know if you have a bug it just drops down in PDB and you can debug it right there or order your unit tests and you should be able to swap out these backends and the computation you do remains entirely the same and but you might have for instance different performance characteristics or you might have different preferences about running on the cloud versus running on your local cluster versus running on a local machine this is kind of the vision we have for Apache beam is we have multiple SDKs that are possibly in different languages and then we have a single common representation for what is a beam pipeline and we take this representation and we send it off to the runner of your choice and this runner of your choice then goes off and executes it and then we have a portion of the we call it the fun API it's FN but it is kind of fun to work with and it it actually goes calls back into the SDK within in an efficient manner to call your UDF's with appropriate batching and fusing so that you don't pay an enormous amount of stabilization overhead and so of course if I have a vision slide it's only fair I say what's the current status things not there yet but a batch for Python actually exited beta just this year it was in in May and so this is this is really exciting Java has been out for a little bit and we just recently got Python and the alpha for Python is the streaming for Python is still in alpha but but we're trying it out now it works and we're also adding support for different runners so right now the local runner works and the dataflow runner works and we're working with the flink team to get that runner working as part of that work we're fleshing out the fun API so pretty much it'll be plug and play match whatever money you want with whatever SDK you want and things should just work another point is that we're actually used as part of TF transform for tensor flows pre-processing solution so like I mentioned before if data is kind of the rocket fuel one of the important things with with training a is you got to have good data and so usually when you're doing machine learning you end up spending you know upwards of eighty percent of your time doing feature analysis getting you know doing scrub and getting your data into good shape before you actually feed it into the model otherwise you're going to get garbage out and tensorflow transform is kind of a a library it's a suite of tools that use their beam to to do this this pre-processing you basically write a beam pipeline and as part of your pipeline you you define these pre-processing funds and the nice thing about tensorflow transform is that it spits out in addition to your pre-process data that's sitting that are you know ready to be trained on it spits out some artifacts that you can use then it's serving time to guarantee that there's zero skew between your pre-processing that you execute is serving time and the pre-processing the you executed priority training this may sound trivial but it's surprisingly easy to get wrong where if you're serving is for structure is different than your pre-processing infrastructure and there's skew this can actually create enormous problems for your model so you can learn more there's a beam as a top-level Apache project there's a there are some good good blogs on The O'Reilly blog about this kind of streaming idea that really go into detail about how you can tease apart these notions of windowing and triggering and when when things happen and of course we've got user lists and mailing lists and a Twitter account and all that kind of stuff so thank you [Applause]

Info

Channel: Enthought

Views: 7,498

Rating: 4.9629631 out of 5

Keywords: Apache Beam, scipy, scipy 2017, python

Id: sle8QBPtLt4

Channel Id: undefined

Length: 21min 15sec (1275 seconds)

Published: Mon Jul 17 2017