A Whirlwind Overview of Apache Beam

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody I'm Eugene I work on the Google Cloud dataflow and Apache Beam teams I've been doing that for a few years and I think it's a really good piece of work so now I'm going to give you a really quick overview of the several core things that one needs to know about beam first a little bit of history it originated back in 2004 when Google created MapReduce which was a distributed data processing programming model but at its core it's really just a huge select and a huge group by and nothing more it just does it really well in a distributed and fault tolerant fashion it evolved in two directions pretty quickly people realized that one select and one group buy is not that useful and it's better to build a high level API where you can build whole graphs made of that and that's that's what flume Java did and that is also the programming model familiar probably to people who use spark or flink so pretty much all distributed data processing frameworks are in some way based on these two parameters parallel apply and parallel go by the other direction in which it evolved was mill-wheel a streaming processing system also created inside Google whose focus was to disrupt the idea that streaming data processing has to be approximate mill will aimed to make it deterministic and exact and the core idea behind that was to do computations based on the absolute time when certain events happened as opposed to when this particular processing pipeline saw them then in 2014 these things were unified into cloud dataflow which unified batching streaming under one programming model that I will explain soon because beam uses that programming model and data flow from the beginning aimed to become portable between different runners so the point was that you can't you can run these pipelines on more than just dataflow and in 2016 this culminated in the creation of Apache beam which is a top-level Apache project with an open ecosystem it's not critically dependent on any particular vendor for example less than half of the commuters are even from Google and it's community driven so this is just a high-level context and this is an example code with beam should be pretty readable for anybody who use any data processing framework so I'm not going to go into too much detail in Intuit there is two concepts there is collections which represent logical sets of data but don't really contain the data and there is transforms that take collections as input and produce collections as output simple enough so all BIM pipelines are composed of three primitives there is parallel will do which simply applies a function parallel to every element there is parallel group by that takes a collection of elements and groups them by a key and produces a collection of groups that share the same key and these can be grouped into a higher level abstraction units by composite transforms the three pillars of beam that I would like to talk about in this talk are first its unified programming model that erases the distinction between batching streaming data processing the portability which is perhaps even more disruptive it is the idea that you can program pipelines in a mixture of languages and run them in a mixture of backends and I'm going to talk about beams role in the data processing ecosystem as a whole first the unified model the premise behind the unified model is that batch processing doesn't really exist meaning that when people do batch processing it's practically always part of a higher level streaming workflow so we might as well model it as such I'm going to talk about just one aspect of the unified model that really needs clarification so the idea that batch processing doesn't exist comes from the fact that almost you almost always people have some sort of continuously growing input data set and they want an output data set that represents a function of the input and the input grows and the output devolves and the pipeline's job is to compute updates to the output data set the key point is that new data can always arrive as a remark data that grows is always natural temporal so new elements appeared at some point so all data in beam has a timestamp and this is the timestamp when the data logically it was created for example when the user clicked on the link as opposed to when this pipeline saw this brother Buffett says the user clicked a link it is really important because it lets you define sensible semantics independent on the execution details let's see how this programming model applies to these two primitives for pardhu it's pretty trivial when new data arrives you just apply the function to it and and you're done as for aggregation it's a little more challenging because normally you would say that you group all elements with this key and you say here's the group but in this setting there is no never such a thing as seeing all elements for a particular key so we need to define it in a more general way let's talk about that because this is really the only thing that that presents a problem for people learning beam how big does that the idea is that beam buffers data per key so the group a key operation takes a collection of key value pairs and emits a collection of pairs of key and multiple values if you mentally partition this by key you can think of it as a collection of buffers where for each key values arrived and sometimes we omit groups of values to the output when we think that we've seen enough we will never see all of them but we might at certain moments decide that we've seen enough to omit something some intermediate aggregation result so think of it this way the this arrow is the axis of event time this is not the wall clock time that passes this is a logical axis of when things happen and new events arrive in our sort of rain onto this axis and arrive at different times and arrive on two different points in this time axis and occasionally results come out a more complete formalism for this is called streams and tables and I highly recommend this presentation I think by Tyler about modeling beam in this stream streams and tables theory but at a high level we have a stream of inputs and a stream of outputs and some controlled way to decide when to emit the outputs let's look at this in still more detail so this is slightly convoluted but this is the exact mechanics of what happens with aggregation in beam when a data point arrives for a certain key with a certain value and a certain timestamp we apply the idea that when you're dealing with temporal data and when you're aggregating it you also almost always want to aggregate it into time buckets you rarely need the aggregation over like since the beginning of time if you want that that is possible too there is the so-called global window but generally you want every element to compute to contribute to aggregations in multiple windows for example you might use fixed windows or overlapping sliding windows or you might use session windows and so on the key point is that every element contributes to one or more windows and each of the windows internally has a buffer of elements that have been contributed to it elements that have that sort of belong into this window and inside each window there is a little machine called a trigger that decides what to do with the new elements when a new element arrives we perform one of three actions either we drop the element if the element is way behind for example we receive data that that map's to a window that was that closed a year ago and we decide that this is the kind of data we just want to ignore so in some cases we may just drop the data that this is called late data we can add it to the buffer and do nothing else yet or we can decide that we have accumulated enough data for this key in window and say okay here's the next group for this key in window and the pipeline will do something with that group that is pretty much it except there is one more detail that's so-called watermark watermark is a continuously increasing estimate of how old data do think we're going to see it has to be provided by the data source because in general there is no way to estimate what data is going to come but some data sources are able to provide this sort of estimate and beam can make use of this estimate for example if then what if the watermark says that it's quite likely that no data is going to arrive before an hour ago then we can emit the results in all the windows older than that so together we have these mechanics going on and as new data arrives it either gets buffered or discarded or aggregation results get computed and emitted this is the core of the beam model it is the subject of a lot of misunderstanding but I'm hoping that this explanation will clarify it okay so the key idea is that the key idea behind beams unified model is that there is no distinction between batch and streaming these terms do not even exist in beams programming model there is no such thing as a batch data set the streaming data set there is just a collection the only the thing that we perceive is the distinction between batch and streaming is different ways to control aggregation of data and different ways to decide when we think that we've seen enough okay the next core pillar of beam is its portability and here I'm going to present beams vision it is not 100% that yet the work is on track to be completed by the end of 2018 but I'd like to give a sense of where we're going we're going to a world where you can program pipelines in any in any language that has a beam SDK currently that is Java Python and go and new languages can add can be added in the future there is also a Scala SDK created by Spotify all that gets translated to a portable pipeline representation and that representation can be run by any beam runner which they are currently I think 9 some examples are the SPARC run or flinc run or Google Cloud dataflow local runner so the thing is that you can program in any of these languages or a mix and run it on any runner this has a number of obvious and less obvious advantages first there's the obvious advantage that you're not locked in into using any particular runner for example suppose you have a fling cluster and later you'll run some BIM pipelines on it and then you decide that actually I don't really like how my fling cluster is handling that I'm gonna check out spark and maybe you like spark maybe check out data flow and like it even more all that without modifying your pipeline the other the other benefit is that there is no language looking and by that I mean that you don't have to commit to a particular language before authoring your pipeline because you can use transforms written in any language from any other language for example you can be primarily a Python programmer and are we completely out of time or almost okay so there is no language looking and there is no language looking for library authors either and also this gives sort of linear rather than quadratic speed of growth for combinations of runner and language for example adding the capability to run go on flink is free once you have a flink runner and a go SDK so I think that is really important that's pretty much it about portability and this is just a map of the beam ecosystem from the lowest levels to the highest there is a number of runners there is the model there is various languages and families of libraries and various ways to program beam like sequel third-party SDKs and there is you can on top of that there is user code or there are products like whole products and businesses and services built on top of beam Tyler is going to talk about one of them and there's a number of others for data preparation and on top of that there is of course the beam community which I found to be incredibly friendly and welcoming so I encourage you to join it that is pretty much it so with that I invite you to listen to Tyler's talk that comes after mine and he will talk about beam applied to machine learning thank you you
Info
Channel: InfoQ
Views: 23,351
Rating: undefined out of 5
Keywords: Apache Beam, Data Processing, Google, Artificial Intelligence, Machine Learning, Deep Learning, InfoQ, QCon, QCon.ai
Id: buXqe0YQjMY
Channel Id: undefined
Length: 13min 9sec (789 seconds)
Published: Fri Jul 20 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.