Apache Beam and Google Cloud DataFlow – Mete Atamel

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone my name is Matata Mel I'm a developer advocate at Google in London and this is my second time in l'viv so I came here last year one of my co-workers she got an email from from the organizers about the fs in Ukraine but she couldn't come so she forwarded the email to me and she asked me you want to go to the distaff as in in a place called revealed and at the time I wasn't really sure where olivia was and I wasn't sure what def s were but I accepted I came and then I was really amazed by the event here I was expecting like a big Mira kind of thing but it turned out to be the one of the best def as I've been and I like the people I met here and and the amazing connections that I made with everyone and one thing I really like about the FS Ukraine is that it stays with you what happened is that after the FSU crane I went to that fest in Pilsen and I saw a couple of the organizers from here in in there and I also went to Tel Aviv there was a one guy from Germany who spoke here that we met here I went to no strippers in Russia and even there I met a speaker who I met here so the Efes Ukraine is great so I just want to thank the organizers for organizing such a great event and it's great to be back so today I'm going to talk about Apache beam and Google Cloud dataflow if you want to get this slides this is my Twitter right here you can just follow me on Twitter and I'll post the slides and I also have my email so if you have any questions even after today you feel free to reach out to me all right so what is let's see if this works all right doesn't work do they have to turned on all right so what's the premise of my talk you know sometimes I go to talks and then the the speaker talks and talks and talks but then you get to hear the big premise in the end well I like to do it the other way around so this is the Prime what is a patchy beam a patchy beam is a single unified model for batch and stream processing okay so what that means is that when you do big data processing you have to choose what kind of processing you want to do you have to do either batch processing which means you wait for your data to come in and then once you have all the data you process the data or you can do stream processing which means as data comes in you process your data but it's really difficult to do both at the same time because all the tools and frameworks around big data processing they do either batch or stream but Apache beam is trying to change this it's trying to give you a single model where you can do both batch and stream processing and that's what I'm gonna talk about today all right but before I do that what I want to do is first of all I want to give you an overview of big data processing at Google and talk about how big data processing evolved at Google and then from there we'll get to data flow because beam came from there as well so I'm gonna talk about data flow model and data flow service and the final we'll take a look at Apache beam and and see how they deform evolve in the Apache beam so big data processing started with what's called MapReduce I'm sure most of you know what MapReduce is there was a paper on MapReduce back in 2004 that explain how to do big big data processing on multiple machines so what you do is first you have words of data if you can process the data on a single machine then you don't need MapReduce you don't need dataflow you don't need anything actually you can just do everything on a single machine and it's nice and simple but when you have lots of data it's not feasible to process the data on a single machine so in MapReduce what you do is first you chunk your data into smaller chunks and then you map this data on on different machines so these machines they apply some kind of mapping function so they transform the data using this mapping function and from here you send these this data to other machines so you shuffle them and then these with the reduced machines they will apply another aggregation to your data and then we'll get your result so it basically most mapping your data shuffling to reduction machines and then from there applying a turn an aggregation to get your results so that's how big data processing started at Google it's settled with a paper in 2004 and after that there was a lot of innovation at Google that happened after MapReduce so if you look at after MapReduce there's a big table there's mill-wheel this pops up there's spanner this flew so there was these were all the papers or there were internal implementations that we had at Google so they weren't really available to people but just to let you know some of these things are now available for example pops-up is a publish/subscribe messaging that we used to use at Google but now it's available to people who who uses Google Cloud similarly spanner is another technology where we use a long time at Google but now we are making it available to people as well but at the time these were either papers or internal implementations that we only had access at Google now at the same time people at Apache they look at this papers and they started creating open-source implementations so for example Hadoop came along Hadoop is the open source MapReduce implementation and there were other things like spark and pig and hive so basically there was a an ecosystem of open source projects that came into being after MapReduce so because of this split in innovation we actually have two products at Google to support this so all the Big Data innovation at Google ended up with something called cloud dataflow that I'm going to talk about but at the same time if you are using open source projects like Hadoop and spark and if you want to run them on Google Cloud we have something else called Cloud Data proc so Cloud Data proc gives you Hadoop or spark cluster in 90 seconds and you can run your cluster using that and then once you're done with your job the cluster goes away and you only pay for what you use so because of this split in innovation we have two products to support two different ecosystems so MapReduce is great and it's also a lot of problems but usually you don't use a single map and a single reduce usually what you do is you chain multiple mapreduces together because your data is complicated so sometimes you do a map and then you do an reduce and then maybe you do another map afterwards so what you're doing is you're basically building a pipeline so people within Google realize this and then they created something called flume Java the idea of flume Java is that instead of thinking in terms of map and reduce you start thinking in terms of data pipeline so flume Java provided the pretty high level API is for you to define what your pipeline is and how and what kind of transformations you want to apply on that high level API and under the carriage flume Java uses MapReduce so it looks at your pipeline and it tries to optimize your MapReduce and it tries to run it for you in an efficient way but as a developer you're not really thinking in terms of map and reduce so let's just taking it take an example take a look at an example so imagine you want to compute some mean temperature from some data first there's this notion of peak election in influenza so peak election is basically just collection or some data in this case it's sensor event so there you have some sensor data coming into your system now let's say you want to pass this data into location temperature pairs because you want to be able to calculate mean temperature per location what you can do is you can you have a parsing function that takes your raw data and passes it into a location and temperature but you can also wrap this into something called part du paradis stands for parallel do and what it does is basically it takes your parsing function and it applies that in a parallel way so you're basically doing this parsing on multiple machines in parallel so at this point you have your raw data extracted into this structured input and then to get the mean out of that and what you can do is you can apply a mean function so you apply the mean function and you can do that perky so you're basically taking each location and you're applying a mean function on on each location so you're basically getting the mean temperature for each location and in the end you want to write your results so what you can do is you can just say output apply and there's many sources and syncs in Java so one of the things is writing to BigTable but you can write to many different places as well so that's how the influenza API looks like one thing that you realize is that we are not thinking about MapReduce anymore we are thinking about P collections and how we can apply different transformations on the t collection and under the coverage from Java transformed this into multiple maps and reducers and they're optimized MCMAP and reduced so let's bloom java and after from java people started using it within Google and then they started processing a lot of data but at some point when data gets really big you need to divide the data into different chunks so one way of doing that is basically divided like say per hour or per day or per week for example if you have some kind of log data you can maybe do it your logs per day so you have logs for Tuesday Wednesday and Thursday and then you can process each day separately so that's one way of dealing with big data but and this is called batch processing basically so you're you're you have a lots of data you divide it into smaller chunks and then on on that chunk you do some kind of processing to get to get your result but batch processing it works in some cases like for example if you are doing payroll processing payroll usually happens once a month because people usually get paid once a month so it works perfect for batch processing but in other use cases it doesn't work at all so imagine you want to do anomaly detection so you want to be able to detect if your if someone is trying to break into your system if you do a batch processing if you process your data at the end of the day it's too late because the person already broke into your system what you want to do is you want to be able to detect that right away so you can take action so batch processing doesn't work all the time because of latency the second scenario where batch processing doesn't work is sessions so at Google we have users and with users we have sessions but sessions first of all sessions have gaps because people come online they do some stuff and they go away and and secondly sessions don't break down in two days in a nice way so if you look at the users oops the uses here there's gaps here and then there's but you could but there's maybe no gap here so trying to do bench processing on session data it becomes real complicated so what we have is basically continuous data coming in into our system and we want to be able to process that data as it comes in that's the reality that we live in today so what can we do one way of dealing with this problem is that you can create to two separate streams from your deform your data in one stream you can do stream processing so as data comes in you process the data and you get some results but these results they will be approximate they won't be full results because you want you don't have the full data yet so get out full data you can only give an estimate so data comes in you get you produce some estimates and as more data comes in you update your estimates now in the other is in the other stream what you can do is the batch processing so you can you can feed it to the batch processing pipeline and then you can store all the data for let's say for one day and at the end of one day you can calculate the full result for the whole day right and then this way you can have estimate data from the stream processing pipeline but then you can have accurate final data from the batch processing pipeline so this kind of works but as a software engineer I don't like this model because now instead of dealing with one data flow yeah you're dealing with with two separate streams and then you're doing batch processing and you're doing a stream processing improbably different kinds of machines and we and probably different kinds of teams so you're basically duplicating your work so that you can do this kind of stuff and at Google we also had something called mill-wheel so we realized that we had to do stream processing and then people basically created mail well there's a big big data paper on this as well that you can read about but basically just like flume Java meal will provided high-level constructs to define a streaming system and it took care of all the details of how to run that streaming system for you so you don't have to worry about that yourself so when we look at streaming there are certain things in streaming that are really easy to do for example if you want to apply some kind of filtering so element-wise transformation some filtering to your to your data you can easily do that because the data comes in you have a function to run you run that function if the data needs to be filtered its filtered it is not then it goes through right so that's very easy to do but if you want to do some kind of aggregation if you want to calculate something then you have to actually chunk the data because if you wait in streaming systems the data is continuous and an unbounded right so if you don't check the data then you will wait forever to calculate something so at some point you have to decide how do you want to chunk the data so that you can calculate individual chunks and and and see what they have so for example in this case we have the data coming in we will check it per hour so every hour we have a window and then we will apply this aggregation function for that window and produce some kind of results now you can do this as you see the data so when you when the data is produced and then it gets your system then you can you can do this chunking and then aggregation at that point but usually that's not what you want usually you want to use the event data so the time that the the actual event happened that's that's the time that you want to use in your aggregations because imagine you have your mobile phone and then you're playing a game and then you send an event but you're offline you can up send that event when it gets to the server it can be hours or even days afterwards right so you want to use the time that I that actually even happened rather than the time that it took to get your system so there's this notion of processing time and event time event time is when they even happen and processing time is when they even got your system so you usually want the event time to be your aggregation but you receive these events out of order there's no guarantee that the events will have will arise your system in order so what you have to do is basically you have to reshuffle your events so that they're in the correct place in the in the windows and then you can do aggregations in those windows also sessions are they're still problematic because in sessions you you have users that come online and offline so there's gaps and then you have to do reshuffling again with with sessions as well because users might offline and when they come online they send the event but it might be an event for yesterday for example so you have this gaps team to deal with so at this point we need to talk about what's called water flow so there is this event time that I talked about and there is this processing time that I talked about if all the events got your big data system right away so if there was no latency then everything would be on this ideal line so this is when the event happened on your cell phone and they cut your big data system right away then all the points will live in this ideal line but in reality we never have this because in reality we have networks and we have latency so by the time the event happens and by the time it gets to your system there's a there's a time difference so in reality the line kind of looks like this so this line is sometimes as close to the ideal and sometimes there's a skew between the ideal and and the time that you actually receive the event and this line is called watermark in streaming systems so watermark is basically an estimate that system keeps track all to see how the late your events are because in streaming the data is unbounded so you need to have a some kind of notion of how late my data is so that you can use that information to emit results so in big in big data processing and in data flow in general there is this notion of watermark that it keeps track oh and it it tries to use that as an estimate to see how late things are so now so we have different traders at this point right if you want completeness so if you want full and accurate results you need to do batch processing there's no other way around because you need to see all the data you need to calculate all the results so that you can get completeness now if you want latency if you want low latency if you want to see results right away then you need to do streaming because in streaming you process things as they come in but then you give up on completeness a little bit because you don't have all the data yet and then you also need to think about costs right if you want to process things as fast as possible then you need to use a lot of machines but then that becomes expensive if you don't want to use that many machines then it's cheaper but the things will be delayed and then there's this fundamental question oh you know streaming or batching like you need to decide to that front but why do we need up why why do we need to decide that what can we help both at the same time and the answer is yes that's why we have data flow so in data flow the whole thing about of data flow is it's basically a one-one unified model that tries to unify the stream and batch processing so that you don't have to make that decision up front you can use the same model for both and we'll take a look at that so in data probe there are four basic questions that we are trying to answer so first one is what are you computing so are you doing sums are you doing aggregations what are you trying to compute the second thing is when in event time results are calculated so how does the event time affect the results the third thing is when in processing time are results materialized so when they actually emit results in your in your pipeline and the last one is how to refine means to relate so if some data comes late what do you do with it you do you add it to your result or do you discard it what do you do with that so let's just go through these in order and see what they mean so in terms of what are you computing dataflow provides some primitives the first one is the element wise transformation so you can take some elements and you can transform them to some other elements in some way and you can do this in parallel so you can do pardieu you can wrap this in - and this way it happens in parallel in multiple machines the second thing you can do is aggregation so you can take a lot of elements you can aggregate them into fewer elements you can use group by key and you can use combine and there are some some others as well and there's also composite transformations so in here you are basically combining your element-wise transformations with part two and you can do count so you can basically combine things together and we'll take a look at some examples of this so imagine that we have a game that we're playing on our mobile phone and then it calculates some scores and we want to calculate the score or our team using dataflow so let's just take a look at that and see how that looks like in the code so first we read the events the log lines or our game from somewhere and we get a peak election so remember flume Java also had peak elections so data flow basically users that the same notion actually most of data Falls API is their influence or directly copied from flume Java so you get a peak election of your data then next what you can do is you can you need to parse the data so just like before you have a parsing function that takes the data and extracts team and score pairs from that but then we do that in parallel so we wrap that into a parallel do so that we saw that we can do it in parallel and then you can do an aggregation so in this case we are basically taking the we are aggregating per key and here the key is a team so we are basically combining all the scores for the team and then we are doing as sum all that so that's what we are calculating here and this way we have the scores of teams and their total score so as you can see this is very similar to what we had in full Java and that's not an accident we actually used a lot of that here in data form so if you look at an example so this is an event time and this is the processing time and this is our imaginary ideal line where everything should should ideally live on this line all the events but in reality this doesn't happen so you can see events all over the place so if you look at this point this event happened around 1206 but it to our system around 12 12:07 which is not bad so it's close to the ideal but then if you look at this point this happened at 12:01 but it arrived to our system at 1208 so it's quite delayed so if you want to calculate the sum o our team what you can do is in typical batch processing you basically go bottom-up so you as you see points you add them and when you see all the points you have the result so you basically emit the result in the end so we are going basically from bottom up all the way and then in the end when we see in all the data we emit a result so this works but one problem with this is that you have to see all the data before we can emit a result right we have to wait for all the data to arrive before we can see the result this works with limited data bounded data but if you have unbounded data it's not going to work because you will always have data coming in right so you can't just wait for the data forever so in that case what you can do is you can do windowing so you can actually take your data and break that into smaller chunks so that you can do aggregations on them so windowing is optional in batch processing in in bounded data but it's not optional in unbounded data because otherwise you won't produce any results and there are different kinds of windowing that you can do you can do fixed windows so that these are just windows with with fixed size you can just sliding windows where these are windows that are fixed size but they overlap each other or you can do session based windows so these are windows with users that have that might have gaps in them as well so if to apply window in data flow what you do is you basically take the same code but then you apply a windowing function before you produce your result so you can say window into and then I'm using a fixed windows in this case and I'm specifying my duration in this case it's two minutes so I'm basically dividing my data into two minute chunks so with this what we get is something like this so you can see that we are dividing our data into two minute chunks and then we are producing four results instead of one result so for every window we have a result this is good we have more refined results now so it's sort of one number we have four numbers but we still have the problem of omitting results until we seen the last data so we have to see all the data before we can emit results so it doesn't really solve all our problems yet so to control how to emit data sooner we can use the watermark so you remember I told you about watermark and watermark is the heuristic that the system keeps track how to see how delayed the events are so we can use the watermark as a guide to figure out when to emit the results so to do that we can change our code to this so we are still doing windowing but then we're gonna trigger as we pass the watermark and by the way this is the default that you have in data flow so even if you don't specify this it will trigger at the watermark so that's what we are doing here and with this this is how our pipeline looks like so forget about this one because this talks about perfer perfect watermark that we never have but if you look at here you will see that for example in here as soon as you pass the water might be emit 5 and as soon as we pass at the watermark here we emit 22 and then then in here in the last one we wait longer because data arise later but basically we also emit result as we pass the watermark now this is much better because for example 5 is emitted much earlier than others so we don't have to wait for all the data to emit results now so that's good but the problem in this case is that we are omitting this one right because what happened is that we passed the watermark but then this data arrived much later and then we just drop the data so that's not that great thankfully in data flow there's other things there's these things called early and late firings so you can basically specify you know trigger a result at the watermark but also triggered some early firing switch which means speculative results so you want to see results as they come in so you can do that with early firings and you can also do late firings so you can specify what happens if someone if some data arrives late what do you do with that so in this case we are treating at watermark but we are saying that we're gonna do early firings at period a standard minutes one so every one minute we're gonna emit a result as a progress indicator basically to tell people that you know this is how we are progressing to our final result but then we are also adding a light firing and then this time we are using account so we are saying that if one piece of data array is late just just do a firing for that as well so with this what we have is something like this so we do an on time point four five and then for this one for example first we do an early firing is seven and then we also do another early firing as fourteen and then finally at 22 - as we pass the watermark we do a 22 on time firing so this way you can see how your windows are progressing with early firings but at the same time when nine are as late you don't drop it because there's a light firing so it will be fired as well so you included in your result as well and then the last thing that we want to do is how do refinements relate so if something are as late what do we do do we drop it do we add it to existing some all that kind of stuff so in this case we are basically adding something called accumulating fire panes which means that if there's an elated firing included in my results so don't drop it that's what this means and there's there's many others as well but this is just one of the examples so once we add this basically when nine arrives late we'll add that to our result there will be a firing and the plan will trigger as some and it will we will add that to the result so if you think about it we started from classic batch then we did batch with fixed windows then with it streaming with watermark then we did streaming with early and late results and then we did streaming with accumulations so there's all these different models that and usually you implement these with different kinds of systems but with dataflow by just changing a couple of things there and there with the model it gives you that so that you can do this kind of a very different data processing in in many different ways so that's that's what dataflow is now let's talk about now dataflow the transformation of dataflow to Apache beam so so far we talked about data flow model so that's what I described to you and then there's also data for SDK a Java SDK but there's also a Python SDK so that's what data for model and SDK is so that's kind of like the client side of data flow there's also something called Google Cloud Data for service so this is the service that runs in the cloud it's fully managed and what you do is you create your model give it to data for service and it will run it for you it will create the machines it will run things in parallel and then once it's done it will write the results and it was it will shut down your cluster so it's a service where you can run this models so what we did at the beginning of 2016 is that we donated the model and the SDK to Apache foundation and that created what's called Apache Beam so everything I talked about data for model and data for SDK it's now Apache Beam so data form is only a service now and you can write your pipelines with Apache Beam and run them on data flow and the great thing about dataflow Apache Beam is that first of all it has all the things that theta4 has and now it's called beam model it has SDKs so it has a Java SDK and also in a Python SDK that's almost as good as Java exhibit it's not a hundred percent yet so it's getting there but for me the most exciting part of beam is that it has runners for different things so you can write a model and beam model and and run that against data for like before but you can also run it against fling and spark and also you can run it locally as well so it gives you an a1 a1 model basically that you can run against different kinds of pipelines that you have whether it's in a patch if blink or Apache spark or data flow so the vision that we have for beam is that you would write your model in any language you want and then that will give you a model and then you would run that against any pipeline you want then you will get your result so that's kind of like the the division that we want to have with beam now the reality of today s o March actually is you you have a Java SDK today there's also a Python SDK that's 80% there it still needs some work on streaming for Python SDK then once you write your model you can pass that to cloud dataflow of course but then you can also pass it to spark flink and Apache apex and there's also another one that's coming up and then the runners themselves they're only in Java right now but there's also a Python runner that's being written so and this is a community project now and Google is contributing to it of course but there's also a bunch of companies that are contributing all the different runners and if you want to see another SDK supported in some other language you're welcome to contribute as well alright so let me just show you quickly how this looks like what I want to show you right now is let's take a really simple example run it locally and then run the same example in Google Cloud dataflow and see how portable things are with that so let me make this fullscreen so if you so if you go to beams website there's quickstarts for java and python as well if you look at the Java QuickStart there's a work count so this basically looks at some files and counts the words in those files and saves them so it's basically it will just produce words and and the number of times that word happened there's a maven target that you can run and this is this is some of the code it's a little hard to see but basically this is the main method and in the main method you create a pipeline with some options and then and then you define what happens in your pipeline so there's certain steps you read the lines then you count the words then you then you map elements with some formatting function and then you write the count so it's basically read do some processing and then write if you look at counting the words it does what I show you before it gets the lines and then it does extract from those line in a parallel way you're using part deux and and then we format the results so we take all the results and put it put it in word and and the counter words format so if we use format as text for that so that's that's what the code is like and you can run this locally so there is a local runner that you can use to run your app so I just do maven and word count is what I'm running and then the run itself is drag Runner meaning it's a local runner that I have running now the pipeline is running locally on my machine and if we actually take a look at the folder we'll see that we have some counts and then if we take a look at these counts these are the counts that we just calculated from from a single file so if we look at that a single file and from there we calculated some counts with words and their count so this is running locally and you can take the same pipeline and run it against spark or flink or data flow and actually if you go here it shows you all the different ways of running it against different things so for data flow we just copy and paste we mavin a target and when we run it a couple of things to note its brace the work count and then I'm running data forerunner and I am also pointing to my project and then I'm also pointing to the location of the files that I'm processing and the output of the files where I'm gonna save the results in Google Cloud you save results to cloud storage so I have my files in cloud search I read them from there and then once I'm done processing I'll save them to class storage so I run this now this is submitting a job to Google Cloud and so that we can run it in Ingo Cloud if you go to Google cloud management console and look at data flow you will see that there are some jobs that have been submitted and this is the job that I just submitted right now that's running my my my job you if you click on it you see your pipeline so it's red lines the work counts format them and write them you can get some stats about your job as well see how many things you're processing and stuff like that and if you go to Google Cloud storage there's a bucket for my app and under this bucket you see some counts and if you click on the counts will see some counselor words so everything is so the beauty of this is that this happens in the cloud and it happens on multiple machines in parallel but I don't have to deal with machines I don't have to deal with clusters it's just Google cloud does the work for me all I do is define my pipeline and my model and then just let things run and in the end I have my results so that was just a quick example to show you and the last thing that I want to talk about is so when you think about it we are basically trying to we are trying to build an abstraction layer on top o spark link and data flow and other things so we are trying to combine all of them into this single unified model so how do they do that there are a couple of ways of doing it first you can look at spark and blink and data flow and say okay this is all they do all of them they do this so let's just take this intersection of their functionality and call this beam and and be done with it this works but it will be very limited because you know we want beam to be much more than just the intersection of all these products another way of doing that is you can look at all of them and just combine everything into beam so it's beam would be like a kitchen sink of all all these products but this is not good either because there are certain things this Park that doesn't make sense in data flow or their self certain things in data for that doesn't make sense in flink if you combine all of them together into a single model then it won't feel natural so it's not going to work either what we are trying to do is something like this where we are trying to influence these different products into beam models so the beam model is influencing spark and spark is influencing the model as well so we are trying to basically transform them so that in the end people will have no reason to use just data forward just spark or just point they will just use beam model because it's rich enough to support all these different products and then they have everything basically you have everything you need in beam to do whatever you want you just pick your runner and decide where you want to run it that's the mess the thing that we want to achieve and there's also a website with capability met matrix that that tells you given the things in beam which ones are them are supported in flink and and hadoop sorry a data flow and spark and apex stuff like that so if you want to write a pipeline it's good to check and see what is supported where and this is always updated as people work on things alright so basically I show you this at the beginning where we have two sets of innovations that led to data flow that I talked about and then it led to things like Hadoop and spark that that you can run with Cloud Data proc with Apache beam we are basically trying to combine these two worlds into a single model and that's why I like about I like about shipping because it tries to unify things rather than divide them and hopefully in the future it will be a single model that people use no matter where they're coming from all right that's all I have thanks for your attention if you want the slides this is my Twitter if you want to know more about data flow this is the link and then this is the beam page and there's also an Apache beam Twitter [Music]
Info
Channel: GDG Lviv
Views: 8,324
Rating: undefined out of 5
Keywords: DevFest Ukraine 2017, GDG, DFUA, DevFest, Cloud, apache beam, dataflow google, google dataflow tutorial, big data processing google, apache spark google, apache flink google
Id: RxHijHZd0oM
Channel Id: undefined
Length: 38min 54sec (2334 seconds)
Published: Wed Oct 25 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.