Apache Beam: Portable and Parallel Data Processing (Google Cloud Next '17)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] FRANCES PERRY: So good afternoon. My name is Frances Perry. I'm a software engineer at Google, and I'm also on the Project Management committee for Apache Beam. Now today I'm going to give you an introduction to Apache Beam, which is a new open source project focused on both batch and streaming use cases. Now when you're using Beam you're really focusing on your logic and your data without letting details from the runtime abstractions leak through into your code. And what that does is give you a very clean separation. That means that your Beam pipeline, once you've written it once, can be run in multiple execution environments. Everything that you know and love, whether it's Apache Spark, Apache Flink, or Google Cloud Dataflow. Now to put Beam in context and the surrounding big data ecosystem let's take a quick look at the evolution. So what you see here is that original MapReduced paper in 2004. And this paper fundamentally changed the way we do distributed data processing. Now inside Google we took that technology and we continued to innovate. Publishing papers, building systems, but really not sharing the code broadly. In 2014, we created Google Cloud Dataflow, which is a part of Google Cloud Platform and based on all this history internally. And this is basically both a programming model in SDK for writing data parallel processing pipelines, and also a fully managed service for executing them. Now meanwhile, the open source ecosystem took that MapReduced paper and created Hadoop. And an entire ecosystem flourished around this. Beam is essentially bringing these two streams of work back together. It's based on the programming model, but generalized and integrated with the broader big data ecosystem. So today I'm going to go into more detail on two key features of Beam. The first is the programming model. These are the abstractions at the heart of Beam that lets you express these data parallel operations in a way that work over both batch and streaming data. In addition, I'm going to talk about the portability infrastructure in the Apache Beam project. Now this is what lets you execute the same Beam pipeline across multiple run times. After that we're going to get a little more concrete, and I'll show you how these concepts work in practice. There will be a demo of the same Beam pipeline running across Apache Spark, Apache Flank, and Cloud Dataflow. And then I'll talk a little bit about Talend and how they're using this flexibility that Beam provides in production. And finally at the end, some pointers for getting started with using Apache Beam. All right so we'll start with that overview of the Beam model. Now if you've been to some of the other talks over the last few days, you may have heard this example use a little bit. I think having a clear example for explaining difficult concepts is really important. So we love this example. You'll hear about mobile gaming quite a bit when you're in the Dataflow talk. So I'll keep this version brief. If you want to learn about this in more detail, you can go to some of the other talks and watch the videos. So what we're going to be doing here is pretending that we've just launched a mobile game. So we've got users all around the globe, on their mobile phones, doing some sort of task to earn points for their team. Every time they score a point this is sending this information back to our server logs and we're aggregating that data. We want to look through those server logs and learn some patterns about our users. So here we have a graph of some sample data that we're going to be working with. On the x-axis you have event time. This is basically the time a user scored some points. And on the y-axis you have processing time. This is the time that score-- that event-- arrived in our system for processing. Now, what you'd expect is that everything appears along this diagonal line. If distributed systems weren't so gosh darn distributed things would arrive in our system immediately. We'd be able to process them and move on. But in reality of course things don't work that way. So let's take a look at the score of three here. This is pretty decent. This score is only slightly delayed. It happens just before 12:07, arrives in our system just after 12:07 for processing. So this user's information was able to get into the system pretty quickly. However, over here, the score of nine looks about seven minutes delayed. It happened just after 12:01, but doesn't arrive in our system for processing until about 12:08. So perhaps this user was playing our game on the elevator, in the subway, somewhere where they had a temporary lack of network connectivity. So it took a little bit for their phone to reconnect to the network, and this information about the points that they'd scored to be able to make it back to our system. So this graph of course isn't even big enough to handle what would happen if we had a real offline mode for our game. Where we've got users on the middle seat of some transatlantic flight playing our game for hours over the ocean. There we'd have to wait for that plane to land, that user to come out of airplane mode, and reconnect to a network before we get events, potentially hours and hours delayed. So these types of infinite and out of order datasets can be really tricky to reason about unless you know the questions to ask. So the Beam model really revolves around four key questions. The first one is, what results are calculated? This is your business logic. This is the algorithms that you're performing over your data, whether it's sums, joins, histograms, machine learning models. Whatever you're actually doing. Where in event time are results calculated? This talks about how the timed event actually occurred affects the results. Are we aggregating things over all event time? Or are we trying to do it in slices like hourly windows, daily windows, or even bursts of user activity sessions. When in processing time are results materialized? This talks about how the time the elements actually make into the system affect results. So how do we know when we've seen all the data for a given time period, and we can go ahead and emit a result? What do we do about data that comes in late when those flights-- those transatlantic flights land? And we get more data way after we thought we were done. And a final question. How do refinements relate? Talks about what we do when we choose to emit multiple results as we get higher and higher fidelity. Do those different versions of the result build on each other or are they distinct. So I'm going to take these questions, and we're going to build up a pretty simple data processing pipeline to use them to calculate hourly scores for the teams playing our mobile game. So here we have a code snippet from a pipeline that's processing those scoring results. So we've already gone ahead and parsed the data into some sort of structured key value form, where we've got the team name and the number of points that were just scored for that team. So in yellow you can see the what question being answered. This is what we're actually computing. And here we're taking those key value pairs and we're summing the integers per key or per team to get the total score for that team. So let's take this code snippet and go ahead and run it in a standard batch algorithm. Now in this looping animation, you can see the gray line is representing processing time. As the pipeline's executing and processing elements and countering them, they're accumulating those elements into that intermediate state that's trailing the line. And when processing completes, the system emits the result in yellow. So this is our final result. And you can see here that we're aggregating things regardless of event time. We're just sort of taken all our data in one chunk and processing it. This is basically pretty standard batch processing. But as we dive into the remaining questions we'll see this start to change. So we'll start by playing with event time. Here we're specifying a windowing function that says we'd like that same algorithm calculated independently for different slices of event time. So for example we could do this every minute, every hour, every day. And in this case, we're doing it in two-minute windows. And so what we're going to see when we go and run this is now we're calculating four distinct results for our sample data, based on the event time of when the event actually occurred. Now at this point in this diagram though we're still waiting for the entire computation to complete before we emit any results. So that works fine for a bounded data set when you have all the data that you're ever going to have, and you know that you're processing will eventually complete. But it's not going to work if we're trying to process an infinite amount of data. In that case, what we want to do is start reducing the latency of individual results. And we do that by asking for a result to be triggered based on the system's best estimate of when it's pretty much seen all the data. We call this estimate of completion the watermark. So now the same graph we've been using before showing the watermark in green. And it's basically estimating when the system thinks it has all the data. And you can see that as soon as that watermark passes the end of the window, we go ahead and give you the result for that window. But again, the watermark in a distributed system is really often going to be just a heuristic. So here you can see that the watermark is too fast. We think we've seen all the data for that first window. We go ahead and emit the results. And then that late score that happened in the elevator or the subway comes in. And that user doesn't get to contribute that really awesome score of nine points to their team's total. So we don't want to be too slow either. On the other hand, if we were too conservative with the watermark we were finding that we'd be introducing unnecessary latency. It's really hard to tell when that last plane is going to land, and that last user who is playing at one o'clock this morning is going to turn off airplane mode and send us their data. So if we're trying to be overly conservative, we'll find we introduce unnecessary latency. So what we can do about that is use a more sophisticated trigger. And so in this case, we're asking for both speculative early firings as data is still trickling in. And also, updates to our result if late elements arrive. So once we do this and we asked for multiple results for a given window, now we have to answer the question about how those results relate to each other. And here we've chosen just to accumulate. So basically each time we emit a new result for this window, we'll get an even more precise answer. So if you look here now you'll see we get those multiple results for each of the windows. Some windows, like that second one, are introducing those early speculative firings with the incomplete results just as the data arrives to give us an idea of what's going on. This would be particularly useful if you are windowing daily for example, and you wanted to check in halfway through the day and see how things are looking. There's one on time result for each window when the watermark passes the end of that window and we think OK, this should be about it. And in that first window we're also getting a late firing when we encounter that late element of nine to update that we have a more refined result. So what I want you to take away from this section is that we had an algorithm, and we had a lot of flexibility with how we ran an algorithm. So here it was a very, very simple algorithm. It was just integers summation, but the same thing holds for much more complex chunks of user logic. You take that user logic and then you tweak just a couple of other lines around it to answer those other three questions. And what that gives you is the ability to cover the spectrum between simple traditional batch and some pretty complex streaming use cases all in a single framework. So just like MapReduce and the abstractions of map and reduce that they implemented completely changed the way that we as a community think about doing distributed parallel data processing, we hope that the Beam model is going to change the way that people think about batch and streaming going forward into the future. All right so with that brief introduction to the Beam model behind us, let's talk about how this model can be portable across multiple runtimes. So the heart of Beam really is this model. These abstractions that we've introduced for reasoning about the style of programming. Now on top of this model, the Apache Beam project aims to have multiple language specific SDKs. Developers have very strong opinions about their language of choice. That's not a battle I want to fight. We just want to be able to give you the options that meet your needs. Next, we have multiple runners for executing that Beam pipeline in the environment that fits your needs. This might be on premise. It might be in the cloud. Might be open source. Might be fully managed. Whatever your current needs are. And in fact those needs are going to change over time as your data evolves, your company evolves, the technologies that you want to use evolve. So if we want to give you the freedom to change that and take that same logic that you've invested so much time in building, and be able to run it in those different environments. Now, this diagram gets a little more complicated, because what each of those runners is doing when it executes your pipeline is calling back into those language specific SDKs to execute your per element processing-- your map functions basically. So what we've done is a project is built out a second layer of internal APIs that really cleanly separate the runners and how they're organizing and distributing the processing from the per element processing that's written in those custom languages. And by doing that we make it very easy for the community to evolve these components in parallel. So you can add a new runner without knowing details about each of those SDKs, and you can add a new SDK without knowing the guts of how those runners work. And getting this layer correct is incredibly important because this is the data intensive path. This is the path that's happening essentially per element or per bundle at a very fine grain. And so you need to be able to get this in the right place and be able to get the performance that people need out of these big data systems. All right so this is the Beam vision. This is what Apache Beam is aiming to be. We're not there yet. This is where we are. So currently we have the Java SDK. It's quite mature, and it runs across a number of different runners. So Spark, Flink, Dataflow and Apex are on the main master branch, and we have Apache Gear Pump under development. Now the Python SDK has also just merged into the master branch. But we don't yet have that bottom level of the function portability layer in place to allow the Python SDK to run across all the other runners. So currently we only have support for the Python SDK to run its scale in batch mode on Cloud Dataflow. So we're actively working towards building that out. So let's go into a little more detail about some of these runners. So these three were the original runners that were part of the Apache Beam project. And they're also the three I'm going to be demoing in just a minute. So many of you are probably familiar with Apache Spark. It's a very popular choice right now in the big data world. And it really excels in memory and interactive style computations. Apache Flink is a bit more of a newcomer to the broader big data ecosystem. But it has some really clean and excellent semantics for doing stream processing. And, finally, Cloud Dataflow, this is Google Cloud platform's fully managed service for data processing. And this evolved along with sort of the technology at the heart of Beam in those years of internal work at Google. So each of these runners does parallel data processing. But they each do them in a slightly different way. So the question is, how are we going to build an abstraction layer that takes these very different things and generalizes them into one thing? So one option would be to just take the intersection of what all the different runners can do. But that's not going to be nearly interesting enough. That's just going to be a small slice of what data processing is capable of. On the other hand if we tried to take the union and include all the functionality that all the runners possibly have, we'd end up with a real big kitchen sink of chaos. It'd be really complicated, hard to understand. It wouldn't be this clean crisp programming model that makes these complex tasks easy. And that's really what we're after. So instead what we're trying to do with Beam is be at the forefront of where data processing is going. Both pushing functionality into and pulling patterns out of those underlying runtime engines. So keyed state is a great example of a feature that existed in some of these engines for a while. It was very useful for stream processing use cases. But it didn't exist in Beam originally. So we've worked with the Beam community to understand how keyed state is implemented across these different runners. And really pull up an abstraction and a notion of this feature that fits cleanly in the Beam model and integrates nicely with our other semantics. On the other hand, we also hope that Beam will influence the roadmaps of each of these individual engines. So for example, the semantics that Flink uses right now for their data streams was heavily influenced by a lot of the early work we did on the Beam model. This also means though at times that there may be a divergence between the general power of the Beam model, and what different runners are currently capable of implementing. Because we're going to pull something into the model when we are convinced it's the right way for data processing to move forward. Not all of the runners may be able to build on that yet. So what we're going to do in the Beam project is track that very clearly on the website and the capability matrix. And I don't think you should take this as an example that some of these runners just aren't worth anything. I think what it means is that there are significant use cases that don't need some of these features. Batch processing is something we've all been doing for years. All of these runners are great at that. Some of the more semantic differences between event time and processing time are still really filtering out into the broader big data ecosystem as a whole. And you're seeing these engines make significant improvements towards those things over time. So our goal as a project is really to be establishing the high watermark of what data processing should be, and helping everyone else implement that well. All right, so with those concepts behind us. A brief overview of the model and the discussion of portability. I'd like to get hands on and dive into a demo. So what we're going to do now is swap to my laptop and actually see some Beam pipeline in progress. So what we have here is a very simple batch pipeline that's taking a bunch of those server logs where each line represents a user scoring points for their team, and we're trying to calculate the hourly team scores. So per hour, windowed by hour, what did each team score across all the users on that team. So you can see here we're starting. We're going to parse some arguments. Then tell the system that we're going to start creating a pipeline. Right here. And then once we do that, we're building the pipeline structure by adding transformations to it. So we'll start by reading new line delimited text from an input file. And then we'll take each of those lines of new line delimited text and go off and parse it. So this is basically the map functions that you know and love. I happen to be using Java 7 here. If I was using Java 8 there's an lambdas and things we can be doing as well. But if you dive into the body of this user processing function, you'll see that it's an element wise processing. It's taking a string in this case, that would be that line of long text, and turning it into a more structured format-- a game action info object-- so that we're able to more easily interact with it. So for each element we're going to grab the element, we're going to split it [INAUDIBLE] and try to extract those pieces, and you'll see that at the end there we're going to build that more structured format. Now as we're going, we might encounter a parse error. In fact will almost certainly encounter a parse error since I sneakily injected some into the input data. And so what we want to do there is go ahead and catch that exception. And we're going to increment a counter, log the method, and keep going. Because we're going to want to build a pipeline that is resilient to these kinds of things, because you can never trust your input data to be as clean as you'd like. So with that parse structure, let's go back down to the main method. So now you can see we're going to set the time stamps. And what this will do is take that time stamp that we parsed out of the log line and set it as the metadata time stamp on that element. And this allows the system to interact with the time stamp of that element. And it will use that information to do things like adjust the way windowing doing and triggering behave. Next, we'll go ahead and ask to calculate one hour fixed windows of our answer. And then we'll ask to run our calculate team scores transform. Now this is a composite transform that we wrote. That basically means it's a sub graph of more information. And this sub graph is going to take a [INAUDIBLE] collection of those game action infos and return as an MDP collection. Because it's actually-- a collection void-- because it's actually going to go ahead and write them straight out to files. And the body here is that we're going to extract the team and the score. And then we're going to sum those integers per team, so that we have the score for that given team. And then we're going to go ahead and write it out to a file with one file per hour. So that's basically the structure of this pipeline. So let's pop over for a minute and take a look at the input data. So I'm going to run over a handful of CSV files. Each of them 12 to 30 gigs. So a decent amount of data, not a huge amount. And then when we go and execute that pipeline, here I am running this pipeline on Cloud Dataflow. So what you can see here is this is the structure of that code I built. So there's our read, our parse, our set time stamps. Here's our composite operation down here the sum team scores. And you can see it gives us a way to pop in and see that internal structure that we built. This way of sort of expressing that composite structure of your pipeline to your monitoring UI is hugely useful. If you've ever seen systems that are letting you monitor these pipelines, the DAG that you're looking at is like this big and it's completely impossible to make any sense of it. So this basically lets you as a programmer-- you're used to thinking modularly-- this lets you add that structure into your pipeline in a way that the system is able to speak it back to you. So you can see that this job is running with the Apache Beam SDK for Java 06. This is actually the release candidate of next week's release-- ran for nine minutes. You'll see that some of the stages were reported that they ran for longer than that. That's because this is summed up over all the workers that were actually executing. So down here you can see Dataflow chose auto scale to 15 workers to run the size of data. If we scroll down, you can also see those parse errors that we were counting are tracked here. So we can see, as the pipeline's executing-- this is already completed before running and see these ticking up as we were encountering those parse errors. So that's this just running on Dataflow. We can go we can check our output directory for the Dataflow Directory. You can look and you can see basically we're getting one input file per hour here. We can pop in and see there's the team names. There's the scores for those teams. I made up the team names. So there you go. So that's what the pipeline looked like running on Dataflow. But I also want to take that exact same pipeline, and I'm going to run it on two other runners. So over here I have the Spark runner. So I've got a Spark cluster up and I've run out that exact same pipeline. You can see that as we dive into here, basically this is Spark's definition of that high level DAG. So what you're seeing here is basically that original map and parsing phase. Then there's that first combine that comes from the summing across the keys. And this was that second group by key that we're using to organize the output files. So if we go down of those three phases really the one with all the guts in it is this first one. Because any distributed processing engine worth its salt does as much pre-combining as it can before it starts shuffling. So when you go down and look at the comparative size of these three stages all the goodness is in here. So now you're starting to see that the DAG for this stage that looks more like the code that we were writing. So there's our read, we got our parse game here, setting our time stamps, and down to that pre-combine. And if you dig in here again you'll see the shuffle of data that's being written out is relatively small. If we look into the event timeline a bit you can see each of the parallel workers setting up their task-- going ahead and processing their data. It takes a little bit variable amount of time. And so on. So one thing that's a little bit difficult to see in the Spark UI is the amount of input data. So if you look, it looks like none of these stages are actually reading any data because this is only tracking data read to and from Hadoop and Spark storage. In this case, I'm using GCS to store my data. It doesn't matter which of these runners you're running on. We have a unified I/O API. That means that you can use that same I/O connector across all of the Beam runners. The Beam runners just implement support for the generic Beam I/O, and then they're able to get all of this functionality. If we swap over to the Flink runner, so here's the same job again running on Apache Flink I pop in like that again there's a DAG. Starting to feel familiar. So you can see the DAG a little bit different. Here we have the read in Flink There's our parse events. Our combines coming up over there. Here you can really see the data sizes show up a little better as the transitions between stages. So there's 109 gigs sent from that first stage in Flink to the second one. You can see the number of records that are involved here a little more heavily. And if you go up and look in the accumulators, again you can see in that parse events. There's the number of errors that we're encountering. So you're really able to see that this same pipeline is running at scale across all of these different engines. So at this point, what I've shown you is just that batch pipeline running. I want to also show you a streaming pipeline. So here I have a second main method that's built up. This time it's a streaming pipeline, and it's going to read from Google Cloud Pub/Sub. So it's basically reading an infinite amount of data that will just continue coming and coming. We're going to use that exact same parsing logic that we had before. This time we don't have to set the time stamps by hand. The system is able to gather them directly out of Pub/Sub. And we're going to use a slightly more advanced windowing algorithm, because we want to take this and transform it into streaming mode. We don't just want to wait for everything to complete to get those results. We want that low latency. So here we're going to ask to trigger the windows when the watermark is past the end of the window. And also ask for some early speculative firings. And that late firing every time we see a late element. But then after that, after that initial slightly different initial setup, we're going to run that same calculate team scores transform that we've written before. And again, this is a decently simple one so that I could explain it to you. But it doesn't matter the kinds of complex logic you have. You can reuse that same logic across these two modes. Across both bounded fixed size data sets and unbounded data sets. So if we take this you can pop over and we're going to see it running here in Dataflow. So here's that same pipeline with that same composite transform running down there. And you can see now it's a streaming job. It's been running since yesterday. It's only chosen to use three workers. The auto scaling just stopped at three here, because I'm injecting the data with a single threaded machine. So there's not a huge load of data. But it's processing nearly 2000 elements per second. You can see since yesterday it's processed 43 gigs of data. And one of the things that's most important in the streaming pipeline is to keep up with the amount of data that's coming in. As you can see here we're tracking the data watermark at 15:07. It's 15:08, so that's pretty decent. We're keeping up with real time as we're processing this. All right, so that's the demo that I've got today. So let's go ahead and pop back to the slides. So now I'd like to show you, or talk about a little bit, is a customer actually able to use this in production? So Talend is a Google Cloud partner and they build open source big data integration solutions. So Talend Data Preparation is a self-service solution that enables their customers to access, cleanse, and analyze their data. And it's available as both an open source release and also an enterprise version. They also have a new product in development called Talend Data Streams. And this is a UI that makes it easy to build both batch and streaming data processing pipelines. And it includes functionality like interactive data analysis. So as you're constructing the pipeline you can see samples, sort of states of the intermediate collections as you're building them. And it includes other things like schema management and versioning. So Talend is building these products on top of Apache Beam, because they believe that building on Beam will allow them and their customers to follow the evolution of technologies. They'll be able to leverage their data in the same way over time while still having the flexibility to execute on multiple runtimes as their needs change. So I'd like to talk a little bit about getting started with Apache Beam. So since you're at GCP Next I think there's a decent chance that you'd be interested in running this technology on Google Cloud. So what that means is that Cloud Dataflow is likely going to be your choice of runner. Now we've been bootstrapping Beam over the last year or so. So you'll find that when you come to Cloud Dataflow you have a few different options for running. We have the original Dataflow SDK for Java, and this is what seeded the Apache Beam project. So in some sense this is pre-Beam and it uses the older com Google Cloud Dataflow name spaces. But it's still fully supported. It works great on Cloud Dataflow. You're welcome to use any of the Apache Beam releases on Cloud Dataflow. So for Java, that would be the most recent release which is 05. At least until next week-ish when 06 comes out. And from Python, you'll have to wait for 06. It just merged into the master branch. So it hasn't made a release yet. [INAUDIBLE] if you need it. And finally, we encourage you to take a look at the next generation of data SDKs. These are based on Apache Beam. And what they do is redistribute the portion of the Apache Beam ecosystem that's most commonly used within GCP. All in a single package. All in a way that's been tested to work well together. So this is just a great way for getting started with that broader ecosystem when you're running on GCP. Now all of these Beam versions you'll notice those version numbers all come with some sort of caveat. The Beam project has not finalized their API surface yet. It's been a huge amount of churn to come from the original dataflow SDK to this. We had to rename the packages. Every file moved from Google Cloud Dataflow over to Apache Beam. So as a user, the kinds of changes are not that intense, but there are still pretty far reaching. So we've been getting all of those front loaded. But the community has not yet finalized that they've completed that. So that's actively under discussion in the community. I hope by maybe early April we'll be able to declare a stable release and guarantee that major versions will not be making backwards incompatible API changes. But other than that, these dataflow SDKs are fully ready to run on Google Cloud Dataflow. So if you're interested, you can go and check them out. There you go. So this talk is the end of a wonderful conference. There were a number of other sessions during this conference that were related to Apache Beam. If you didn't catch these in person, I really encourage you to go back and find the videos. Tyler's talk from yesterday goes into more details on the Beam model. Sort of the evolution and how it unifies batch and streaming. The insights talk is really about how to build a data processing pipeline across all of the technology in Google Cloud Platform. And it includes the use of Beam and Cloud Dataflow to do that. And then finally there was the talk this morning that was really about the building of the Beam community. So we took something that was a single company proprietary project, and we built a thriving open source ecosystem around it. And a whole new project under the umbrella of the Apache Software Foundation. So really that talk is about the journey of both the code and developers involved, and how we kick started that community. So of course you should go to Beam the website. We've got a lot of information there. You can use it to get started with any of those runners that I showed you today. There's also some great podcast videos, other links for learning more about the Beam model, and also of course information on how to get involved. How to come join the developer mailing list. How to start contributing to the project. Tyler wrote some articles for O'Reilly called "The world beyond batch." These are also a great way to learn more about the Beam model if you need some airplane reading for your way home. And then finally, if you want to start using Apache Beam on Cloud Dataflow today, you can come and check out the documentation. We'll have pointers there. You'll see the site being in transition. So we still fully support the original pre-Beam releases, but we also have a new Beam ones coming up alongside those. All right so with that I'd love to take some questions. AUDIENCE: Hi. My team, we do a lot of stuff in BigQuery. But we're hitting a couple of limitations here and there, and looking to merge, sort of stream and batch directly against big storage files. But for equivalent operations, have you done any performance benchmarking versus BigQuery and Beam on Dataflow? FRANCES PERRY: You know I don't have the data on that directly, but Sam McVeety who's in the back corner there might be able to help you out on those performance questions. AUDIENCE: Thank you. AUDIENCE: So on my team we heavily build on Spark using Scala because of the rebel. We liked actually going to the building process of seeing how the data is being transformed at every stage. And that's something why we have stopped with Spark. Does Beam have any of those source capabilities in the pipeline where you can actually just interact with your pipeline as you're building it similar to Spark? FRANCES PERRY: So I think the Beam model interactive is one of those places as we work on sort of pulling the best patterns out of the underlying runners. Interactive is something that's not easily done in the Beam model right now, but I think there's definitely been a lot of interest in it. So I'd love to see the community moving in that direction. There is though a Scala SDK on top of Cloud Dataflow. Scala DSL that Spotify has written called SCIO-- S-C-I-O. Which is available on GitHub. And you can use on top of. I think it may still build on the current Cloud Dataflow release 1.9, but it's moving towards being built on top of Apache Beam going forward. So you can definitely check that out. They do have some sort of rebel base interface as well that they build on top of Beam. AUDIENCE: OK. Thank you. AUDIENCE: I would like to understand that if the Beam community currently develops a variety of the connector or the sync for storing the [INAUDIBLE]? FRANCES PERRY: So the thing in Beam-- so we have an I/O-- so there are plenty of I/O connectors out there already in the ecosystem. Hadoop input basically everything can be read as a Hadoop input. So we're building a compatibility layer into Beam right now. I think it might still be in for request, but it's coming very soon. That will allow you to basically unlock any of the Hadoop inputs. So those will just run. However, Beam actually has its own I/O API that you can use to teach it about new data sources and syncs. And it's a little more flexible than the Hadoop input API. So the reason we developed a new API is that we really wanted to give runners the right hooks that they needed to do some more efficient execution. So if you look at Beam's I/O APIs, our source API, what it allows you to do is not just declare how to divide your input data up into x splits and then read a split. It also lets you talk about the progress through reading one of those splits, and how to further subdivide. So what that lets the service do is basically decide that because a machine is misbehaving and slow, or because some data is complex, if it sees a straggler shard which is always a huge problem in these battery systems you can go ahead and subdivide some of that work. Steal it from that worker and give it to somebody else who's done. But building that additional flexibility did require some additional API hooks. So basically where Beam's at now is you can unlock anything at sort of the level it is elsewhere in the ecosystem. And also we're building out more and more support for these more efficient APIs. All right, well wonderful. Please feel free to come up [? grab ?] questions if you'd like. I've got a handful of Beam stickers left that are there on the podium. And please come join us in the Beam community.
Info
Channel: Google Cloud Tech
Views: 37,091
Rating: 4.8757763 out of 5
Keywords: Apache Beam, Beam, data processing, Beam model, batch data processing, stream data processing, Beam pipeline, Cloud NEXT, Google Cloud, GCP, Cloud, #GoogleNext17
Id: owTuuVt6Oro
Channel Id: undefined
Length: 37min 37sec (2257 seconds)
Published: Fri Mar 10 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.