[MUSIC PLAYING] FRANCES PERRY: So
good afternoon. My name is Frances Perry. I'm a software
engineer at Google, and I'm also on the
Project Management committee for Apache Beam. Now today I'm going to give
you an introduction to Apache Beam, which is a new open source
project focused on both batch and streaming use cases. Now when you're
using Beam you're really focusing on your
logic and your data without letting details from
the runtime abstractions leak through into your code. And what that does is give
you a very clean separation. That means that your Beam
pipeline, once you've written it once, can be
run in multiple execution environments. Everything that
you know and love, whether it's Apache Spark,
Apache Flink, or Google Cloud Dataflow. Now to put Beam in context
and the surrounding big data ecosystem let's take a
quick look at the evolution. So what you see here is that
original MapReduced paper in 2004. And this paper
fundamentally changed the way we do distributed
data processing. Now inside Google we
took that technology and we continued to innovate. Publishing papers, building
systems, but really not sharing the code broadly. In 2014, we created
Google Cloud Dataflow, which is a part of
Google Cloud Platform and based on all this
history internally. And this is basically both
a programming model in SDK for writing data parallel
processing pipelines, and also a fully managed
service for executing them. Now meanwhile, the
open source ecosystem took that MapReduced
paper and created Hadoop. And an entire ecosystem
flourished around this. Beam is essentially bringing
these two streams of work back together. It's based on the
programming model, but generalized and integrated
with the broader big data ecosystem. So today I'm going to
go into more detail on two key features of Beam. The first is the
programming model. These are the abstractions
at the heart of Beam that lets you express
these data parallel operations in a way that work
over both batch and streaming data. In addition, I'm
going to talk about the portability infrastructure
in the Apache Beam project. Now this is what lets you
execute the same Beam pipeline across multiple run times. After that we're going to
get a little more concrete, and I'll show you how these
concepts work in practice. There will be a demo
of the same Beam pipeline running across Apache
Spark, Apache Flank, and Cloud Dataflow. And then I'll talk a
little bit about Talend and how they're using this
flexibility that Beam provides in production. And finally at the end,
some pointers for getting started with using Apache Beam. All right so we'll start with
that overview of the Beam model. Now if you've been to
some of the other talks over the last few days,
you may have heard this example use a little bit. I think having a clear
example for explaining difficult concepts
is really important. So we love this example. You'll hear about mobile
gaming quite a bit when you're in
the Dataflow talk. So I'll keep this version brief. If you want to learn
about this in more detail, you can go to some of the other
talks and watch the videos. So what we're going
to be doing here is pretending that we've
just launched a mobile game. So we've got users
all around the globe, on their mobile phones,
doing some sort of task to earn points for their team. Every time they
score a point this is sending this information
back to our server logs and we're
aggregating that data. We want to look through
those server logs and learn some patterns
about our users. So here we have a graph
of some sample data that we're going
to be working with. On the x-axis you
have event time. This is basically the time
a user scored some points. And on the y-axis you
have processing time. This is the time that score-- that event-- arrived in
our system for processing. Now, what you'd expect is
that everything appears along this diagonal line. If distributed systems weren't
so gosh darn distributed things would arrive in our
system immediately. We'd be able to process
them and move on. But in reality of course
things don't work that way. So let's take a look at
the score of three here. This is pretty decent. This score is only
slightly delayed. It happens just before
12:07, arrives in our system just after 12:07 for processing. So this user's information was
able to get into the system pretty quickly. However, over here,
the score of nine looks about seven
minutes delayed. It happened just after
12:01, but doesn't arrive in our system for
processing until about 12:08. So perhaps this user was playing
our game on the elevator, in the subway, somewhere where
they had a temporary lack of network connectivity. So it took a little
bit for their phone to reconnect to the network,
and this information about the points that they'd
scored to be able to make it back to our system. So this graph of
course isn't even big enough to handle what would
happen if we had a real offline mode for our game. Where we've got users
on the middle seat of some transatlantic
flight playing our game for hours over the ocean. There we'd have to
wait for that plane to land, that user to
come out of airplane mode, and reconnect to a network
before we get events, potentially hours
and hours delayed. So these types of infinite
and out of order datasets can be really tricky
to reason about unless you know the
questions to ask. So the Beam model
really revolves around four key questions. The first one is, what
results are calculated? This is your business logic. This is the
algorithms that you're performing over your data,
whether it's sums, joins, histograms, machine
learning models. Whatever you're actually doing. Where in event time
are results calculated? This talks about how the
timed event actually occurred affects the results. Are we aggregating things
over all event time? Or are we trying
to do it in slices like hourly windows, daily
windows, or even bursts of user activity sessions. When in processing time
are results materialized? This talks about how
the time the elements actually make into the
system affect results. So how do we know when we've
seen all the data for a given time period, and we can go
ahead and emit a result? What do we do about
data that comes in late when those flights-- those transatlantic
flights land? And we get more data way
after we thought we were done. And a final question. How do refinements relate? Talks about what we
do when we choose to emit multiple results as we
get higher and higher fidelity. Do those different versions of
the result build on each other or are they distinct. So I'm going to take
these questions, and we're going to build up a
pretty simple data processing pipeline to use
them to calculate hourly scores for the teams
playing our mobile game. So here we have a code
snippet from a pipeline that's processing those
scoring results. So we've already gone
ahead and parsed the data into some sort of structured
key value form, where we've got the team name and the
number of points that were just scored for that team. So in yellow you can see the
what question being answered. This is what we're
actually computing. And here we're taking
those key value pairs and we're summing the
integers per key or per team to get the total
score for that team. So let's take this code
snippet and go ahead and run it in a standard batch algorithm. Now in this looping
animation, you can see the gray line is
representing processing time. As the pipeline's executing
and processing elements and countering them, they're
accumulating those elements into that intermediate state
that's trailing the line. And when processing
completes, the system emits the result in yellow. So this is our final
result. And you can see here that we're
aggregating things regardless of event time. We're just sort of taken
all our data in one chunk and processing it. This is basically pretty
standard batch processing. But as we dive into
the remaining questions we'll see this start to change. So we'll start by
playing with event time. Here we're specifying
a windowing function that says we'd like that
same algorithm calculated independently for different
slices of event time. So for example we could do
this every minute, every hour, every day. And in this case, we're doing
it in two-minute windows. And so what we're going to see
when we go and run this is now we're calculating four distinct
results for our sample data, based on the event time of when
the event actually occurred. Now at this point
in this diagram though we're still waiting
for the entire computation to complete before
we emit any results. So that works fine
for a bounded data set when you have all the data
that you're ever going to have, and you know that
you're processing will eventually complete. But it's not going
to work if we're trying to process an
infinite amount of data. In that case, what
we want to do is start reducing the latency
of individual results. And we do that by
asking for a result to be triggered based on the
system's best estimate of when it's pretty much
seen all the data. We call this estimate of
completion the watermark. So now the same graph we've
been using before showing the watermark in green. And it's basically
estimating when the system thinks it has all the data. And you can see that as soon
as that watermark passes the end of the
window, we go ahead and give you the
result for that window. But again, the watermark
in a distributed system is really often going
to be just a heuristic. So here you can see that
the watermark is too fast. We think we've seen all the
data for that first window. We go ahead and
emit the results. And then that late score
that happened in the elevator or the subway comes in. And that user doesn't
get to contribute that really awesome
score of nine points to their team's total. So we don't want to
be too slow either. On the other hand, if
we were too conservative with the watermark
we were finding that we'd be introducing
unnecessary latency. It's really hard to tell
when that last plane is going to land, and that last user
who is playing at one o'clock this morning is going to
turn off airplane mode and send us their data. So if we're trying to
be overly conservative, we'll find we introduce
unnecessary latency. So what we can do about that
is use a more sophisticated trigger. And so in this
case, we're asking for both speculative
early firings as data is still trickling in. And also, updates to our
result if late elements arrive. So once we do this and we
asked for multiple results for a given window, now we
have to answer the question about how those results
relate to each other. And here we've chosen
just to accumulate. So basically each time we emit
a new result for this window, we'll get an even
more precise answer. So if you look
here now you'll see we get those multiple results
for each of the windows. Some windows, like
that second one, are introducing those
early speculative firings with the incomplete results just
as the data arrives to give us an idea of what's going on. This would be
particularly useful if you are windowing
daily for example, and you wanted to check
in halfway through the day and see how things are looking. There's one on time result
for each window when the watermark passes
the end of that window and we think OK, this
should be about it. And in that first window we're
also getting a late firing when we encounter that
late element of nine to update that we have
a more refined result. So what I want you to take
away from this section is that we had an algorithm,
and we had a lot of flexibility with how we ran an algorithm. So here it was a very,
very simple algorithm. It was just integers
summation, but the same thing holds for much more complex
chunks of user logic. You take that user
logic and then you tweak just a couple
of other lines around it to answer those other
three questions. And what that gives
you is the ability to cover the spectrum between
simple traditional batch and some pretty
complex streaming use cases all in a
single framework. So just like MapReduce and
the abstractions of map and reduce that they
implemented completely changed the way that we as a community
think about doing distributed parallel data
processing, we hope that the Beam model is going
to change the way that people think about batch and streaming
going forward into the future. All right so with that
brief introduction to the Beam model
behind us, let's talk about how this model
can be portable across multiple runtimes. So the heart of Beam
really is this model. These abstractions that we've
introduced for reasoning about the style of programming. Now on top of this model,
the Apache Beam project aims to have multiple
language specific SDKs. Developers have
very strong opinions about their language of choice. That's not a battle
I want to fight. We just want to be able to
give you the options that meet your needs. Next, we have multiple runners
for executing that Beam pipeline in the environment
that fits your needs. This might be on premise. It might be in the cloud. Might be open source. Might be fully managed. Whatever your current needs are. And in fact those
needs are going to change over time as your
data evolves, your company evolves, the technologies
that you want to use evolve. So if we want to
give you the freedom to change that and take that
same logic that you've invested so much time in building,
and be able to run it in those different environments. Now, this diagram gets a
little more complicated, because what each
of those runners is doing when it executes
your pipeline is calling back into those language
specific SDKs to execute your per element
processing-- your map functions basically. So what we've done is
a project is built out a second layer of internal APIs
that really cleanly separate the runners and how they're
organizing and distributing the processing from the per
element processing that's written in those
custom languages. And by doing that we make it
very easy for the community to evolve these
components in parallel. So you can add a new runner
without knowing details about each of
those SDKs, and you can add a new SDK without
knowing the guts of how those runners work. And getting this layer correct
is incredibly important because this is the
data intensive path. This is the path that's
happening essentially per element or per bundle
at a very fine grain. And so you need to be able to
get this in the right place and be able to get the
performance that people need out of these big data systems. All right so this
is the Beam vision. This is what Apache
Beam is aiming to be. We're not there yet. This is where we are. So currently we
have the Java SDK. It's quite mature, and
it runs across a number of different runners. So Spark, Flink,
Dataflow and Apex are on the main master branch,
and we have Apache Gear Pump under development. Now the Python SDK has also just
merged into the master branch. But we don't yet have that
bottom level of the function portability layer in place to
allow the Python SDK to run across all the other runners. So currently we only have
support for the Python SDK to run its scale in batch
mode on Cloud Dataflow. So we're actively working
towards building that out. So let's go into a
little more detail about some of these runners. So these three were
the original runners that were part of the
Apache Beam project. And they're also
the three I'm going to be demoing in just a minute. So many of you are probably
familiar with Apache Spark. It's a very popular choice
right now in the big data world. And it really excels in
memory and interactive style computations. Apache Flink is a bit more
of a newcomer to the broader big data ecosystem. But it has some really clean
and excellent semantics for doing stream processing. And, finally, Cloud
Dataflow, this is Google Cloud platform's
fully managed service for data processing. And this evolved along
with sort of the technology at the heart of Beam in
those years of internal work at Google. So each of these runners does
parallel data processing. But they each do them in
a slightly different way. So the question is,
how are we going to build an
abstraction layer that takes these very
different things and generalizes
them into one thing? So one option would
be to just take the intersection of what all
the different runners can do. But that's not going to be
nearly interesting enough. That's just going to be
a small slice of what data processing is capable of. On the other hand if we
tried to take the union and include all the
functionality that all the runners possibly have, we'd
end up with a real big kitchen sink of chaos. It'd be really complicated,
hard to understand. It wouldn't be this clean
crisp programming model that makes these complex tasks easy. And that's really
what we're after. So instead what we're
trying to do with Beam is be at the forefront of
where data processing is going. Both pushing functionality
into and pulling patterns out of those
underlying runtime engines. So keyed state is
a great example of a feature that existed
in some of these engines for a while. It was very useful for
stream processing use cases. But it didn't exist
in Beam originally. So we've worked with
the Beam community to understand how keyed
state is implemented across these different runners. And really pull
up an abstraction and a notion of
this feature that fits cleanly in the Beam
model and integrates nicely with our other semantics. On the other hand, we
also hope that Beam will influence the
roadmaps of each of these individual engines. So for example, the
semantics that Flink uses right now for
their data streams was heavily influenced by
a lot of the early work we did on the Beam model. This also means though
at times that there may be a divergence between
the general power of the Beam model, and what different
runners are currently capable of implementing. Because we're going to pull
something into the model when we are convinced
it's the right way for data processing
to move forward. Not all of the runners may
be able to build on that yet. So what we're going to
do in the Beam project is track that very clearly on
the website and the capability matrix. And I don't think you should
take this as an example that some of these runners
just aren't worth anything. I think what it
means is that there are significant use
cases that don't need some of these features. Batch processing is something
we've all been doing for years. All of these runners
are great at that. Some of the more
semantic differences between event time
and processing time are still really filtering
out into the broader big data ecosystem as a whole. And you're seeing these engines
make significant improvements towards those things over time. So our goal as a
project is really to be establishing the
high watermark of what data processing should be,
and helping everyone else implement that well. All right, so with those
concepts behind us. A brief overview of the
model and the discussion of portability. I'd like to get hands
on and dive into a demo. So what we're going to do
now is swap to my laptop and actually see some
Beam pipeline in progress. So what we have here is a very
simple batch pipeline that's taking a bunch of those server
logs where each line represents a user scoring points
for their team, and we're trying to calculate
the hourly team scores. So per hour, windowed
by hour, what did each team score across
all the users on that team. So you can see here
we're starting. We're going to parse
some arguments. Then tell the system
that we're going to start creating a pipeline. Right here. And then once we do that, we're
building the pipeline structure by adding transformations to it. So we'll start by reading
new line delimited text from an input file. And then we'll take each
of those lines of new line delimited text and
go off and parse it. So this is basically
the map functions that you know and love. I happen to be
using Java 7 here. If I was using Java 8
there's an lambdas and things we can be doing as well. But if you dive into the body of
this user processing function, you'll see that it's an
element wise processing. It's taking a string
in this case, that would be that line of
long text, and turning it into a more structured format-- a game action info object-- so that we're able to more
easily interact with it. So for each element we're
going to grab the element, we're going to
split it [INAUDIBLE] and try to extract
those pieces, and you'll see that at the end there
we're going to build that more structured format. Now as we're going, we might
encounter a parse error. In fact will almost certainly
encounter a parse error since I sneakily injected
some into the input data. And so what we want to
do there is go ahead and catch that exception. And we're going to increment
a counter, log the method, and keep going. Because we're going to
want to build a pipeline that is resilient to
these kinds of things, because you can never
trust your input data to be as clean as you'd like. So with that parse
structure, let's go back down to the main method. So now you can see we're
going to set the time stamps. And what this will do
is take that time stamp that we parsed out
of the log line and set it as the metadata
time stamp on that element. And this allows the system
to interact with the time stamp of that element. And it will use that
information to do things like adjust the way windowing
doing and triggering behave. Next, we'll go ahead and ask
to calculate one hour fixed windows of our answer. And then we'll ask to
run our calculate team scores transform. Now this is a composite
transform that we wrote. That basically means it's a
sub graph of more information. And this sub graph is going to
take a [INAUDIBLE] collection of those game action infos and
return as an MDP collection. Because it's
actually-- a collection void-- because it's
actually going to go ahead and write them
straight out to files. And the body here is that
we're going to extract the team and the score. And then we're going to sum
those integers per team, so that we have the score
for that given team. And then we're going to
go ahead and write it out to a file with
one file per hour. So that's basically the
structure of this pipeline. So let's pop over for a
minute and take a look at the input data. So I'm going to run over
a handful of CSV files. Each of them 12 to 30 gigs. So a decent amount of
data, not a huge amount. And then when we go and
execute that pipeline, here I am running this
pipeline on Cloud Dataflow. So what you can see here is this
is the structure of that code I built. So there's our read,
our parse, our set time stamps. Here's our composite operation
down here the sum team scores. And you can see
it gives us a way to pop in and see that internal
structure that we built. This way of sort of expressing
that composite structure of your pipeline to your
monitoring UI is hugely useful. If you've ever seen systems
that are letting you monitor these pipelines, the DAG
that you're looking at is like this big and it's
completely impossible to make any sense of it. So this basically lets
you as a programmer-- you're used to thinking
modularly-- this lets you add that structure
into your pipeline in a way that the system is able
to speak it back to you. So you can see that this job
is running with the Apache Beam SDK for Java 06. This is actually the
release candidate of next week's release-- ran for nine minutes. You'll see that
some of the stages were reported that they
ran for longer than that. That's because this is summed
up over all the workers that were actually executing. So down here you
can see Dataflow chose auto scale to 15 workers
to run the size of data. If we scroll down, you can
also see those parse errors that we were counting
are tracked here. So we can see, as the
pipeline's executing-- this is already
completed before running and see these ticking up
as we were encountering those parse errors. So that's this just
running on Dataflow. We can go we can check
our output directory for the Dataflow Directory. You can look and you
can see basically we're getting one input
file per hour here. We can pop in and see
there's the team names. There's the scores
for those teams. I made up the team names. So there you go. So that's what the
pipeline looked like running on Dataflow. But I also want to take
that exact same pipeline, and I'm going to run it
on two other runners. So over here I have
the Spark runner. So I've got a Spark
cluster up and I've run out that exact
same pipeline. You can see that as we
dive into here, basically this is Spark's definition
of that high level DAG. So what you're seeing here is
basically that original map and parsing phase. Then there's that
first combine that comes from the summing
across the keys. And this was that
second group by key that we're using to
organize the output files. So if we go down of
those three phases really the one with all the guts
in it is this first one. Because any distributed
processing engine worth its salt does as
much pre-combining as it can before it
starts shuffling. So when you go down and
look at the comparative size of these three stages all
the goodness is in here. So now you're starting
to see that the DAG for this stage that
looks more like the code that we were writing. So there's our read, we
got our parse game here, setting our time stamps, and
down to that pre-combine. And if you dig in
here again you'll see the shuffle of data
that's being written out is relatively small. If we look into the
event timeline a bit you can see each of the parallel
workers setting up their task-- going ahead and
processing their data. It takes a little bit
variable amount of time. And so on. So one thing that's a little bit
difficult to see in the Spark UI is the amount of input data. So if you look, it looks
like none of these stages are actually reading any
data because this is only tracking data read to and
from Hadoop and Spark storage. In this case, I'm using
GCS to store my data. It doesn't matter which of
these runners you're running on. We have a unified I/O API. That means that you can
use that same I/O connector across all of the Beam runners. The Beam runners just implement
support for the generic Beam I/O, and then they're able to
get all of this functionality. If we swap over to
the Flink runner, so here's the same job again
running on Apache Flink I pop in like that
again there's a DAG. Starting to feel familiar. So you can see the DAG
a little bit different. Here we have the read in Flink
There's our parse events. Our combines coming
up over there. Here you can really see
the data sizes show up a little better as the
transitions between stages. So there's 109 gigs sent from
that first stage in Flink to the second one. You can see the number of
records that are involved here a little more heavily. And if you go up and
look in the accumulators, again you can see in
that parse events. There's the number of errors
that we're encountering. So you're really able to see
that this same pipeline is running at scale across all
of these different engines. So at this point,
what I've shown you is just that batch
pipeline running. I want to also show you
a streaming pipeline. So here I have a second
main method that's built up. This time it's a
streaming pipeline, and it's going to read
from Google Cloud Pub/Sub. So it's basically reading
an infinite amount of data that will just
continue coming and coming. We're going to use
that exact same parsing logic that we had before. This time we don't have to
set the time stamps by hand. The system is able to gather
them directly out of Pub/Sub. And we're going to use
a slightly more advanced windowing algorithm, because we
want to take this and transform it into streaming mode. We don't just want to wait
for everything to complete to get those results. We want that low latency. So here we're going to ask
to trigger the windows when the watermark is past
the end of the window. And also ask for some
early speculative firings. And that late firing every
time we see a late element. But then after that, after
that initial slightly different initial setup, we're going to
run that same calculate team scores transform that
we've written before. And again, this is a
decently simple one so that I could explain it to you. But it doesn't matter the kinds
of complex logic you have. You can reuse that same
logic across these two modes. Across both bounded
fixed size data sets and unbounded data sets. So if we take this
you can pop over and we're going to see it
running here in Dataflow. So here's that
same pipeline with that same composite
transform running down there. And you can see now
it's a streaming job. It's been running
since yesterday. It's only chosen to
use three workers. The auto scaling just
stopped at three here, because I'm injecting the data
with a single threaded machine. So there's not a
huge load of data. But it's processing nearly
2000 elements per second. You can see since yesterday
it's processed 43 gigs of data. And one of the things that's
most important in the streaming pipeline is to keep up
with the amount of data that's coming in. As you can see here we're
tracking the data watermark at 15:07. It's 15:08, so
that's pretty decent. We're keeping up with real
time as we're processing this. All right, so that's the
demo that I've got today. So let's go ahead and
pop back to the slides. So now I'd like to show you,
or talk about a little bit, is a customer actually able
to use this in production? So Talend is a
Google Cloud partner and they build open source big
data integration solutions. So Talend Data Preparation
is a self-service solution that enables their customers
to access, cleanse, and analyze their data. And it's available as both an
open source release and also an enterprise version. They also have a new product in
development called Talend Data Streams. And this is a UI that makes
it easy to build both batch and streaming data
processing pipelines. And it includes functionality
like interactive data analysis. So as you're
constructing the pipeline you can see samples,
sort of states of the intermediate collections
as you're building them. And it includes other things
like schema management and versioning. So Talend is building these
products on top of Apache Beam, because they believe
that building on Beam will allow them
and their customers to follow the evolution
of technologies. They'll be able to
leverage their data in the same way over
time while still having the flexibility to
execute on multiple runtimes as their needs change. So I'd like to talk a
little bit about getting started with Apache Beam. So since you're at
GCP Next I think there's a decent
chance that you'd be interested in running this
technology on Google Cloud. So what that means is that
Cloud Dataflow is likely going to be your choice of runner. Now we've been bootstrapping
Beam over the last year or so. So you'll find that when
you come to Cloud Dataflow you have a few different
options for running. We have the original
Dataflow SDK for Java, and this is what seeded
the Apache Beam project. So in some sense
this is pre-Beam and it uses the older com Google
Cloud Dataflow name spaces. But it's still fully supported. It works great on
Cloud Dataflow. You're welcome to use any
of the Apache Beam releases on Cloud Dataflow. So for Java, that would be the
most recent release which is 05. At least until next
week-ish when 06 comes out. And from Python, you'll
have to wait for 06. It just merged into
the master branch. So it hasn't made a release yet. [INAUDIBLE] if you need it. And finally, we encourage
you to take a look at the next generation
of data SDKs. These are based on Apache Beam. And what they do is redistribute
the portion of the Apache Beam ecosystem that's most
commonly used within GCP. All in a single package. All in a way that's been
tested to work well together. So this is just a
great way for getting started with that
broader ecosystem when you're running on GCP. Now all of these
Beam versions you'll notice those version numbers all
come with some sort of caveat. The Beam project has not
finalized their API surface yet. It's been a huge
amount of churn to come from the original
dataflow SDK to this. We had to rename the packages. Every file moved from
Google Cloud Dataflow over to Apache Beam. So as a user, the kinds of
changes are not that intense, but there are still
pretty far reaching. So we've been getting all
of those front loaded. But the community
has not yet finalized that they've completed that. So that's actively under
discussion in the community. I hope by maybe
early April we'll be able to declare
a stable release and guarantee that
major versions will not be making backwards
incompatible API changes. But other than
that, these dataflow SDKs are fully ready to run
on Google Cloud Dataflow. So if you're interested, you
can go and check them out. There you go. So this talk is the end
of a wonderful conference. There were a number
of other sessions during this conference that
were related to Apache Beam. If you didn't catch
these in person, I really encourage you to
go back and find the videos. Tyler's talk from yesterday
goes into more details on the Beam model. Sort of the evolution and how
it unifies batch and streaming. The insights talk
is really about how to build a data
processing pipeline across all of the technology
in Google Cloud Platform. And it includes the use of Beam
and Cloud Dataflow to do that. And then finally there
was the talk this morning that was really about the
building of the Beam community. So we took something
that was a single company proprietary project, and we
built a thriving open source ecosystem around it. And a whole new project under
the umbrella of the Apache Software Foundation. So really that talk is about
the journey of both the code and developers involved, and how
we kick started that community. So of course you should
go to Beam the website. We've got a lot of
information there. You can use it to get started
with any of those runners that I showed you today. There's also some
great podcast videos, other links for learning
more about the Beam model, and also of course information
on how to get involved. How to come join the
developer mailing list. How to start contributing
to the project. Tyler wrote some articles for
O'Reilly called "The world beyond batch." These are also a great way to
learn more about the Beam model if you need some airplane
reading for your way home. And then finally,
if you want to start using Apache Beam on
Cloud Dataflow today, you can come and check
out the documentation. We'll have pointers there. You'll see the site
being in transition. So we still fully support the
original pre-Beam releases, but we also have a new Beam
ones coming up alongside those. All right so with that I'd
love to take some questions. AUDIENCE: Hi. My team, we do a lot
of stuff in BigQuery. But we're hitting a couple of
limitations here and there, and looking to merge, sort
of stream and batch directly against big storage files. But for equivalent
operations, have you done any performance
benchmarking versus BigQuery and Beam on Dataflow? FRANCES PERRY: You know I don't
have the data on that directly, but Sam McVeety who's
in the back corner there might be able
to help you out on those performance questions. AUDIENCE: Thank you. AUDIENCE: So on
my team we heavily build on Spark using Scala
because of the rebel. We liked actually
going to the building process of seeing how the
data is being transformed at every stage. And that's something why
we have stopped with Spark. Does Beam have any of
those source capabilities in the pipeline where
you can actually just interact with
your pipeline as you're building it similar to Spark? FRANCES PERRY: So I think
the Beam model interactive is one of those
places as we work on sort of pulling
the best patterns out of the underlying runners. Interactive is something
that's not easily done in the Beam
model right now, but I think there's definitely
been a lot of interest in it. So I'd love to see the community
moving in that direction. There is though a Scala SDK
on top of Cloud Dataflow. Scala DSL that Spotify
has written called SCIO-- S-C-I-O. Which is
available on GitHub. And you can use on top of. I think it may still build
on the current Cloud Dataflow release 1.9, but it's moving
towards being built on top of Apache Beam going forward. So you can definitely
check that out. They do have some sort
of rebel base interface as well that they
build on top of Beam. AUDIENCE: OK. Thank you. AUDIENCE: I would like to
understand that if the Beam community currently
develops a variety of the connector or the sync
for storing the [INAUDIBLE]? FRANCES PERRY: So
the thing in Beam-- so we have an I/O-- so there
are plenty of I/O connectors out there already
in the ecosystem. Hadoop input
basically everything can be read as a Hadoop input. So we're building a
compatibility layer into Beam right now. I think it might still
be in for request, but it's coming very soon. That will allow you
to basically unlock any of the Hadoop inputs. So those will just run. However, Beam actually
has its own I/O API that you can use to teach
it about new data sources and syncs. And it's a little more flexible
than the Hadoop input API. So the reason we developed
a new API is that we really wanted to give runners
the right hooks that they needed to do some
more efficient execution. So if you look at
Beam's I/O APIs, our source API, what
it allows you to do is not just declare how to
divide your input data up into x splits and
then read a split. It also lets you talk
about the progress through reading one
of those splits, and how to further subdivide. So what that lets the
service do is basically decide that because a machine
is misbehaving and slow, or because some data is
complex, if it sees a straggler shard which is always a huge
problem in these battery systems you can go ahead and
subdivide some of that work. Steal it from that
worker and give it to somebody else who's done. But building that
additional flexibility did require some
additional API hooks. So basically where
Beam's at now is you can unlock anything
at sort of the level it is elsewhere
in the ecosystem. And also we're building
out more and more support for these more efficient APIs. All right, well wonderful. Please feel free to
come up [? grab ?] questions if you'd like. I've got a handful
of Beam stickers left that are there on the podium. And please come join us
in the Beam community.