[MUSIC PLAYING] KAYUN LAM: How's everyone doing? Great? Great. Yeah, welcome. Having fun so far? Learning stuff? Both at the same time? Great. Thank you very much. My name is KaYun, together with
my colleagues Vince and Julie. All three of us are customer
engineers for Google Cloud. And we're going to present you
this session a "Modern Data Pipeline in Action." We're going to walk
through a few data pipeline from collecting data, processing
the data, analyzing the data, and discuss about the best
practices and considerations. All right. So let's just quickly
recap what data pipeline is and what it is for. As you can see,
different stages usually exist in a modern data pipeline. Usually it starts
with data injection, you have to collect the
data, take the data in. It could be coming from
servers, users, or IOT devices these days. Once we have
collected the data, we have to transform it, cleanse
it, some kind of processing. We then have to store the
data, [? put it ?] somewhere-- sometimes file systems,
storage device, or maybe in a data warehouse. And then once we have the
data in, we can take a look and see what we can
derive from the data. In the old days, it could be
a simple as an aggregation, summation, group by
day, group by week, or maybe these
days it can be very common to do some
statistical modeling or maybe use it for advanced use
cases like machine learning AI. And then visualization. It's a very, very
important step. Usually it is used
for presenting the results of the analysis-- pie chart, bar chart,
node-link diagrams, word cloud, you name it. Different types of
visualization techniques. Usually these stages are
done in an orchestrated, in an automated, fashion. It could start with, for
example, a presence of the data file, some events coming
in, some trigger, or maybe the trigger is based
on a batch schedule. Although there are cases that
maybe it is done manually during development
phase, testing phase, or maybe it is for ad
hoc analytics project. We do hear a lot of
challenges from customers who are trying to build
data pipeline these days. And they are encountering
different types of things. Starting with injection,
there could be a huge volume of data coming in. They're coming in
in different formats at very, very high speed. And then the transformation. It can be tough to
just to cater for all of those requirements
that are required by the downstream application. You have to add in all
those transformation rules. You have to keep adding,
adding, adding until a point that I'm not sure
whether you all have the experience of talking
to your data integration team. Hey, I just want to understand
how come the incoming data file has a million rows and then
you're dropping 10% of them. They got rejected, they
got put somewhere else. What happened? It's just getting
harder and harder to understand what happened
in between the pipelines. And of the storage
itself, I mean with the huge volume
of the storage, it is really getting tough
as an exercise on its own just to architect
the storage system. Just to make it highly
available, make it durable, make sure that things are
backed up correctly, make sure it can handle the
throughput that's coming in. And then the size of
the storage will grow, and you have to
figure out how to add the storage in a way that
is not impacting the system and then it can still catch
up with the requirements of the [? IOT, ?] the
throughput, and things like that. And then when we get to
the analysis of the data, well there are many cases that
so many different users having their own careers to run,
you have to figure out a way to tune it for
this user and then tune it for a another
group of users. And of course, with the data
volume, keep adding to it, the queries may runs
slower and slower. So it's not like a
one-off exercise. You have to keep
monitoring it, you know, all those things like
analyze it, run stats on it. All those things just to
keep it up and running and make it efficient. Well, there are some other
cases that the data is not even in one single place. I mean they're in
different types of storage in different data mounts. They're just in different places
making it hard to analyze. And finally onto visualization. There are so many
cases that I've seen in a customer situation. They have a BI dashboard. But then, the caveat
is that the data available for
visualization, sorry, it is just the latest
three months of the data. If you're having good
luck in your company, maybe OK, it's one year
of data, three years of data in your dashboard. It may not be all the data
set that you want to analyze. People are doing this
usually to optimize the performance of the
query so it's not overly loading the dashboard. To be honest in
many of the cases, you don't get to run
the dashboard yourself. It will be run essentially
by someone else, you know, an
administrator, a server generating the report for you. 8:00 AM in the morning, weekly
report, they send you a PDF. The PDF has a chart. You detach the PDF and
that becomes your reporting repository. So I know this session title
is called "A Modern Data Pipeline in Action." When we say action, let's
try to find some action that is fun on a Thursday afternoon. So let's play a game, shall we? So I'm going to introduce my
colleague Vince on the stage and he'll guide you through
a game as the action piece in this session. All right, Vince. Thank you. [APPLAUSE] VINCE GONZALEZ: Hi, everyone. So what we did was
we built a demo. We thought that
rather than pounding through a whole bunch
of slides, instead we'd get you involved in
actually building our data pipeline in action. So on the screen here
you'll see a bit.ly link. If you'd like to
take out your phones and hit this bit.ly
link in a browser, you'll be taken
to an application that we built for you
to actually generate the data that we'll put
into a data pipeline, and we'll visualize the results
here in the room in real time. Your participation
here is not required. But if you'd like to see
a data pipeline an action, it would be really great if
you played along with us. So please if you haven't
already, take your phone out. Hit this bit.ly
link in your browser and you'll be presented
with a simple quiz game. In the quiz game,
it's really easy, they're multiple
choice questions. You hit the button
for the correct answer and then answer
the next question. We'll give you
guys a few seconds to answer as many questions
as you can before we resume. So while you're playing, we'll
switch to the demo laptop, please. And what you'll see
here is a dashboard. As your answers flow in, we'll
see this dashboard update in real time. Nice to see those numbers moving
up as you all play the game. Cool, so what you see here
is a real time dashboard that we built with the help
of our friends at Looker. We've got a couple of
sort of individual metrics that we display on
the screen to show the number of active users,
all the people of all time who have taken our quiz. And then we've got
a little time series chart that shows
how many answers are being put in per minute. These things are updating on
slightly different refresh schedules, so that's why you
may see the answers per minute delaying a bit. But these answers are actually
moving through our data pipeline as we stand here. Can we switch back to
the presentation, please? Now what are we solving
with this data pipeline? Usually what we hear
from our customers is that there's a
handful of things that they're solving for when
it comes to building out a data pipeline. One of the things
that's often the case is that we want to get
timely access to the data. Usually people
don't want to have to wait for days or even
hours to get access to data to start analyzing it
and looking for insights. And so we don't want to have
to wait for long, drawn out ETL processes to complete
before we can actually start querying the data. We may also have people
in our organization who we want to enable to
make data driven decisions but don't have very high
degree of technical skill. They don't program in
Python or R or even SQL. And so we need to enable
everyone in an organization to make decisions
with data with tools that are easy for them to use. And finally, your data
engineers are out there building the tools for these
other constituencies to use. Your data engineers
actually also need a way to process
data consistently, and then when business rules
change or when bugs are found, they may need to
actually go back and reprocess the data that
had been already processed by the streaming engine. Let's look at how
we solve for this. This is the architecture of
the game you just played. You can see that we are
going from the user-- that's you-- through
our application. We ingest data from the
app, deliver the events down to some processing
layer which prepares the data for analysis. We store the data in
an analytics engine where we can query
over the data, build reports and dashboards. And then finally, we've got
mechanisms within these tools to actually share the data out. So let's dive a
little bit deeper. On the application,
what we did was use Firebase to
implement the front end that you were playing with. We use Firebase Hosting for
all of the static assets, the HTML, the
JavaScript, and so forth. We used Firestore to store
the questions and your answers to those questions. And then we use Cloud Functions
to react to the submission of the answers in the app. And we also use
the Cloud Function to react to the updates
of the Firestore database in real time. Those functions are
then emitting data into a Cloud Pub/Sub topic. Cloud Pub/Sub is
a serverless way to deliver events from
an application, an IOT sensor, what have you, store
it durably, and then deliver it later for processing. The other side of the
coin from a Pub/Sub topic is a Pub/Sub subscription. A subscription is used to
actually deliver events down to interested consumers. You can have up to 10,000
topics and subscriptions in a GCP project. We're only using a handful
of topics and subscriptions in order to accommodate your
answers as they flow in. Our consumer is a Cloud
Dataflow pipeline. Dataflow is a fully managed
service for ingesting-- sorry, not for ingesting but for
processing events and preparing them for later analysis. The data preparation may involve
just basic transformations. Taking the data that's
ingested-- in our case, we're ingesting JSON data. Performing some
computation over it-- we might be computing,
say, a score over the data. And then storing
that into some sync-- in our case, we're
using BigQuery. Now, if I were delivering
this talk a year ago, I'd probably have a screenshot
of a snippet of some Java code that was reading from
Pub/Sub, optionally doing some transformation
in the middle, and then storing the
data into BigQuery. Not everybody's
a Java developer, but it's often the
case that we need to be able to deploy these
kinds of pipelines simply. So we've recently added
to the Cloud platform a number of pre-built
data flow templates that allow you to easily
ingest data from a source and write it to a sync
optionally transforming it as it passes through
a JavaScript UVF that you can provide. With these provided
templates, you can get data from Cloud
Pub/Sub into things like BigQuery or Cloud
Storage without writing a single line of code. In our case, what we're doing
is ingesting from Pub/Sub, processing that data
minimally, and then writing it into BigQuery. We're also writing in an
archive out to Cloud Storage as a bunch of Avro files. We chose the Avro
format because it's extremely well supported by all
of the other GCP components. It's a first class citizen
with respect to BigQuery ingest and all of the rest of
the suite of big data tools on the platform. You already saw the
dashboard that we showed you a bit ago from Looker. Looker is a great tool, it has
excellent support for BigQuery, and is really very nice to
work with as I can attest. But it's not the only option you
have for visualizing your data. Google Data Studio
is built directly into the Cloud platform. It is free to use. It's a great way to
really democratize the ability of people
in your organization to create visualizations
and share that out to the rest of the organization. And there's a long list of
other BI and visualization tools that are supported
by our partners. I talked about the
data engineer's need to be able to reprocess
data and maybe go back and recalculate things. The Cloud Storage archive
is what enables this. So having stored our events
in an archive on Cloud Storage as a bunch of Avro files, should
a data engineer need to do so, we can write another
pipeline that might go back and reprocess all of that
data out of Cloud Storage before writing it again
to our eventual sync. This is usually run as something
like a batch job, in contrast to the streaming pipeline
that you saw earlier. What's great about this
framework, and Cloud Dataflow in particular, is
that if I fix my bug or change my business rules
in my streaming pipeline, those transforms that I used
for the streaming pipeline can usually be used pretty
much directly in a batch pipeline that runs along side. This is how we enable
the reprocessing of data, the recalculation,
restatement of results. When you run a batch pipeline,
it's not always the case that the batch pipeline that
you're looking to execute is the only thing
that needs to run. You usually need some
way to orchestrate this. There might be
dependencies that have to be satisfied before
executing the batch pipeline. Creating a database,
creating a table, executing a different
Dataflow job, or executing some bash script. We recently announced
the general availability of Cloud Composer which is
based on Apache Airflow. Apache Airflow is a
service for orchestrating complex data pipelines,
particularly of the batch style that I'm describing here. Cloud Composer is a
managed service for Airflow and makes it really easy to
set up and manage and run an Airflow cluster. Now that I've taken you
through the architecture, I'd love to welcome
KaYun back to the stage to give you a little
bit of a deeper dive into some of our
architecture choices. [APPLAUSE] KAYUN LAM: Thank you, Vince. All right. Thank you Vince for guiding
us through the modern data pipeline in action. So I would like to go back
and revisit it, and then take a look at some of
the architecture decisions that we have made when
presenting this solution. Of course, just now
the example is a game. In your use case, it may
or may not be a game. It could be in a traditional
enterprise environment. It could be some other kinds
of data processing needs. We still want to
revisit or maybe go through some of those
architectural decisions to see if there's
something universal. Maybe there's something that
you can borrow and apply in your data pipeline
in your environment. First of all, just
now the data flows from Firebase, the
application that you used your phone to play on. And then eventually
it goes into BigQuery for the dashboard
for the analytics. Technically speaking, it is
actually doable, technically feasible, to have the
data written from Firebase into BigQuery directly. As a matter of fact,
BigQuery actually supports multiple ways
of ingesting data. BigQuery can ingest
data in batch. BigQuery, on the
other hand, also has a streaming
insert mechanism. That means when you
run a simple statement, it actually can merge the
data, merge the results between a streaming buffer
in BigQuery with the data that it has been presented
in the table already. So it is actually
very, very doable to have Firebase to write the
data to BigQuery directly. We choose not to do that. We chose to put
something in between, and that is what
we call decoupling. In this case, we want the
application-- in this case, in this example,
Firebase-- to focus only on publishing the source
data, the source data from the application
in its original format. In this case, it is JSON data. And just focuses on publishing
the data onto the Cloud Pub/Sub topic. The application doesn't
need to be aware of, hey, what is the data
set name down the road. Hey, where is BigQuery
or some other databases? What is the data set name? What is the table name? How the schema looks like. And then in many cases, it may
not be one single downstream application. Right now it could be one, it
could be a second one later. And it can grow. And then we don't
want to have so many of those touch points
built into the source application in this case. Having Pub/Sub in this
case will make sure that Firebase, the
source application, only focuses on
publishing the data. And it will let Pub/Sub be the
mechanism to fan out the data, to have multiple subscribers
subscribing to the same topic. Each Cloud Pub/Sub
topic supports up to 10,000 subscribers. You don't need to
handle to design your own fan out mechanism,
Pub/Sub can do it for you. The concept of message
[? bus, ?] service [? bus, ?] this is not new, to be honest. It has been around. People have talked
about a message layer, message queue, stream storage. This is not new, to be honest. But at the same
time, it has been a little bit
challenging to architect this kind of solution. Issues that you may
have heard around, message is stuck in a queue, the
processing cannot keep up with the incoming data volume,
or if you are lucky to have a solution that is
horizontally scalable, it has multiple
partitions, multiple shots, then you need to worry
about how should I shot it, how should I partition it? The source application may
need to calculate the shotting, make sure that it is
evenly distributed. If it is not evenly
distributed, how do you split the shot to rebalance it? It becomes the architecture
exercise on its own to deliver this kind
of message solution. In this case, in our
example, Cloud Pub/Sub is a fully managed
service running on Google Cloud Platform. It is a global service. That's one single end
point around the globe. Meaning that when you are
publishing your messages to the topic, you don't
need to worry about, hm, am I publishing my messages
to US, to Europe, to Asia? No, you're publishing
your messages to Pub/Sub and Pub/Sub will
handle the rest. You don't need to worry
about which partitioning or which shot you're
publishing your messages to. Cloud Pub/Sub will just
scale automatically for you and you are just paying
for that data volume that you're sending
to Cloud Pub/Sub. The processing layer,
Cloud Dataflow, it reads data from Pub/Sub. There could be varying
volume of data coming in throughout the day, throughout
the week, holiday season. There are many of those
similar processing framework can do something
like this horizontally scalable manner. But at the same time,
that horizontal factor usually is a fixed factor. Meaning that you design it
upfront whether it is four [? nodes, ?] 10 [? nodes, ?]
10,000 [? nodes. ?] You have to have a fixed set of computing
power in order to consume the messages. On the other hand,
Cloud Dataflow can actually do something
like auto scaling. Based on the
incoming data volume coming from streaming or
maybe even from batch, Cloud Dataflow has the ability
to scale up and scale down the computing
resources underneath. So that you can adjust it and
only pay for the resources that you are truly consuming. There's no need to over
provision your streaming processing layer just to
cater for the high water mark. And then in this
case, the destination of the data BigQuery, it has
compute storage separation. Meaning that when the
data volume grows, you don't need to
necessarily tie the storage cost with
the computing costs, like the CPU and the memory. You pay for the storage and
the processing separately. This makes it simple,
elastic, and low cost. When we are sending
data to the destination under the consideration, under
the architecture decision is, how do you want that
data to look like? Traditionally speaking, I've
been in that kind of field, like ETL jobs, trying to
fit into the data schema on the data warehouse. All those things are
simple as normalization of the data, the star schema,
slowly changing dimension. Just all those ETL jobs,
they are not really doing something that is
business wise, but more to cater for the target database format. Usually the goal is
to fit into the schema so that it is efficient
for that database queries, or maybe it will
minimize the data storage so that you're minimizing the
duplication, saving the storage cost. On the other hand, in
a modern data pipeline, we can ease off on some of these
transformations in between. The reason is that in a modern
data pipeline, especially for example in this case,
what we have in BigQuery, you don't have to force it to be
like a one-to-many relationship table to split up the
data into multiple places. In a modern data pipeline, a
storage system like BigQuery when we're [? top ?]
storing the structure data, you have the ability to
store it in a format that is as close to the
source as possible. In this case, there
could be arrays, there could be
[? strokes, ?] there could be complex data types in
the originating source system. And you can persist the data
as close to the source format as possible so
that you don't need to worry about should I break
down the data into this table? That table? How does the [? EL ?]
diagram look like? And things like that. This allows the data
processing to be really fast. We minimize the logic,
minimize anything that may happen in
between, and now we can have low latency
access to the data. That is our goal. And the example we
had just now, we showed a streaming
example of us in the room, playing with the phone,
and then sending that data so that is streaming data. In real life situation,
that streaming may not be the only data source. Usually you will be handling
a mixture of streaming data sources and batch data sources. They could be the
same data type. They might be different. So it is very, very important
architecture decision to choose something that
you are not necessarily writing two separate
data pipelines or two separate pieces of codes. Cloud Dataflow supports
both streaming and batch. Meaning that when you're using
the underlying programming model, you write the set
of transformation logic, you write something,
some algorithm. It's very, very easy to have
that same piece of logic to work on streaming
data and also batch data. So keep this in mind when you're
designing your data pipeline to make sure that it is flexible
enough to handle both cases. One use case, as Vince
mentioned earlier, is on the reprocessing needs. It could be based on changing
business requirements or you have to go back
and fix something. But very, very often, data is
a very, very valuable asset. There are so many
cases that there might be some insights
hidden in the raw data that maybe in the future maybe
there's an upcoming analytics project, someone in
your organization just want to go back to
the data and take a look and see if there's some
additional insights that you can get out of the raw
data that could have come in months or years ago. So it is also very
important when you are designing your data
pipeline to keep reprocessing as one of the considerations. You want to make it easy for
anyone in your organization to just go back and say,
hey, I have an idea. I want to take a
look at the old data and see if there's a
certain kind of pattern, and then I can come up
with this new prediction algorithm which has to be
trained based on old data. That's why reprocessing, having
this as architecture decision, is important. As supported in my previous
slide stream and batch, reprocessing can take
advantage of stream and batch capabilities in the Cloud
Dataflow's programming model. And then you can go back and
process the existing data and see what kind of additional
insights there you can get. So here is kind of like a
flow chart on the high level transformation that might exist
in a Cloud Dataflow pipeline. In this example,
first of all, first of the four boxes
over there, first box reads from the Cloud Pub/Sub
topic and subscription. Second passes the JSON message. The third step is to turn it
into the BigQuery, the table row format. And then the fourth step
is to write to BigQuery. You want to choose the
programming model that allows you to express your pipeline. And these are kind of
high level transformations so you're not going to deal with
all those tiny, little details. The Dataflow's
programming models supports many other primitives
that are specialized for streaming data processing. Like a windowing
mechanism, for example, a sliding window, fixed
window, session window. So it is very
important when you're designing your data pipeline
next time, keep this in mind. Choose something that allows
you to express your pipeline in these high-level primitives. And when you are
designing something to be expressed in
a high level manner, it allows your data
pipelines to be portable. Just now I've been talking
about Cloud Dataflow. In fact, it supports
pipelines that are written in Apache Beam SDK. But Cloud Dataflow is
not the only option. The Cloud Dataflow
does make it easier to run your Apache
Beam data pipelines. It proficients the
computing resources for you. It scales up, scales down. It rebalances the work like
what we have mentioned earlier. But if you want to choose
to run it somewhere else, you can do that. Once you have written your
data pipeline in Apache Beam, Cloud Dataflow is just one of
the several supported runners behind the scene. You can say, I want to run my
data pipeline on Apache Spark. It could be on Apache
Apax, Flink, Gearpump, Samza, or even locally when
you're doing development on your workstation. So when you're designing
your data pipeline, you want to choose something
open so that on one hand, you can run it on, for example,
Cloud Dataflow on a cloud. On the other hand, if there
are some other requirements, you can choose to run it on
your own premise environment or some other environments. There shouldn't be any lock in. So with this said, I'll pass
the time to my colleague Julie and then she'll explain to
you the customer stories-- the customers who are
running these modern data pipelines on Google
Cloud Platform. Thank you, Julie. [APPLAUSE] JULIE PRICE: Hello, everybody. Thanks KaYun and thanks
Vince for walking us through the concepts of
a modern data pipeline and the architectural
decisions that go into creating a modern data pipeline. As KaYun mentioned, I
wanted to take a moment to really make it real to talk
about how some companies are creating modern data
pipelines of their own in GCP. So all of the companies
that you see listed here are doing some
really cool things with big data analytics,
big data processing, all with Google Cloud Services. And some of them, most of them,
you've already heard about. So the first customer I
wanted to talk about is Ocado. Ocado is an online
grocery retailer. They serve over 70% of
households within the UK. They don't have a single
brick and mortar store anywhere in the UK. So they're competing with all
of these really large stores that people go past
on their way home. And so they really
need to find a way to get their
customers to be loyal, to make sure that they're
always purchasing through them. So what they wanted to do was
create a big data analytics platform which would help
them convince customers of all of the benefits
of buying online as opposed to in the stores. And they also
wanted to make sure that they were using data to
drive their business insights, to do things like
inform the supply chain, predict demand, and really
just overall improve their logistics. They had a problem because
their data was siloed. They had business
data and product data. Transactional data
was all sitting in different places
across their data center, and they had no means of
communicating with each other. So what they knew
they needed to do was come up with a way to
create a platform that could pull all this data together. And they did that in GCP. So they were able to build
this advanced analytics solution that could not only
process the data through, but run advanced
analytics against it to do things like determine
what would be best to improve customer satisfaction, how they
could optimize their supply chain, how they could really
get access to insights about their business data
in closer to real time, and ultimately reduce costs. So all of those kind of
common business goals that almost every company has. So they did that by implementing
first and foremost BigQuery as their data storage platform. So BigQuery is where all of the
data ends up to be analyzed. So they do click stream
analysis, customer analysis, product and department analysis. Everything you can think of
all happening in BigQuery. And this is across over
two petabytes of data. They also are using
Cloud Dataflow for doing their transformations. So you might remember Vince was
talking about Dataflow quite a bit as well. And then they're using Pub/Sub
for all of their data delivery. So it's this kind of a common
pattern of Pub/Sub, Dataflow, to BigQuery. And now they took it even a step
further by integrating BigQuery directly with TensorFlow and
Cloud Machine Learning Engine. Ocado was one of the
very first customers to even help us test the
Cloud Machine Learning Engine. And what they did was create
these really awesome data pipelines that do
very real things. So one example of
that is the way that when somebody
submits an order, the items get picked and
packed and ready for delivery. So let's say, I'm going to make
a curry for dinner tonight. And I put into my
Ocado app that I want some vegetables, a
protein, some lentils, rice, and some nice naan bread. It's around noon time, right? I'm on my lunch, not
taking up my work time with ordering my groceries. And I go ahead and
enter that, and that's when the system starts
to work really well. Because now we have to
analyze where are these items? In what warehouse? What warehouse is closest to
where it needs to be delivered? Are there perishables? What can be picked now? What has to be picked later? And so it determines
the appropriate time to do these things and sends
signals to the warehouses. And the warehouses
have a number of robots that are used to pick all of the
items in an automated fashion. But they just don't
say, hey you, robot, you are responsible
for Julie's order. In fact, there's ML
swarm intelligence that they've created
which will help the robots work collaboratively
so that they can go through the warehouse. And many robots could be working
on picking my items, as well as others, but making
sure that they all end up in the right
package to be shipped. And making sure that it's
all done in such a way that everything ends up
on people's doors fresh as could be with no perishables. So it is really pretty
cool what they've done. Even still, they're sending
all of the telemetry data from the robots up to
the Cloud and they're analyzing that so they can
do scheduling, to make sure they have the right
number of robots on the floor in each warehouse. And to do things
like predictive wear and tear, to make sure that if
they think that something might happen to a robot, that
can pull it and replace it with a well-working
robot before something happens and interrupts
the workflow. So that's what they've
done internally using this modern data pipeline. And they also determined that--
they do a really good job with e-commerce. And so they thought,
why don't we create an e-commerce
platform that we can sell to other customers? So large brick and
mortars that also do online, and they created
Ocado Smart Platform, which leverages a lot of
these same types of pipelines to ensure now that these other
retailers across the globe can deliver the same type of
fantastic e-commerce experience that Ocado does to
their customers. Now, the next customer
that I wanted to talk about is Brightcove. So a very different company. They're actually an
online video platform. They serve content
for internet TV, news media, various
different internet media. So think all video streaming. And Brightcove runs 8,500
years worth of video streaming, video viewing each
and every month. They have 7 billion
analytic events per day. So that's a lot of data-- 85,000 years worth
of video streaming per month for all the
people that are watching. And so they were
having a problem because their legacy system was
kind of bursting at the seams. They knew that they needed to
re-architect and re-platform. And so they investigated a
number of different big data stacks, and they
landed upon GCP. And so what did they do? They implemented
a data flow that went from Pub/Sub for
event delivery, Dataflow for all of their
transformations, and BigQuery to land their data and be
able to perform analytics. And they chose this because
all of those services can scale to whatever scale that
they need without them having to worry about whether or not
the infrastructure can handle it. So this story's a little bit
shorter than the last one, but really interesting
nonetheless just because of the grand scale
with the amount of information that moves through the
pipeline every single day and the amount of
information that gets analyzed every single day. And speaking of scale, that
brings us to our last customer that I'd like to talk about. So are there any
Spotify users in here? A few. So I personally use
Spotify every day. I absolutely love the service. So I was very excited to learn
that Spotify chose to put all of their tech stack on GCP. So we're not going to talk
about all of the tech stack, we're just going to talk about
the cool stuff, of course, the data stuff today. Spotify had the largest Hadoop
environments in Europe-- a 2,500 node, 50,000 CPU
core Hadoop environment with 100 petabytes of
capacity, 100 terabytes of RAM. It makes me shudder
at the thought of the expense of that system. But it was really important
because over 20,000 jobs were running on the system per day. From 2,000 different
workflows, it was supporting 100 different
teams within Spotify. So you have to imagine
it was very complex. It was very important. They couldn't just break it down
and build it somewhere else. So they knew that they wanted
to get out of the on-prem world and out of the single
Hadoop cluster world. And so they chose
to come to GCP. And so maybe you might
recognize a pattern here. What do you think they chose for
their ad hoc analytics and data storage? They chose BigQuery. Their BigQuery
environment serves over 10 million queries
and scheduled jobs every single day-- I'm sorry, every single month--
processes 500 petabytes of data every single month for all
of the different users that need to query against it. Also, you might guess,
that for event delivery, they chose Pub/Sub. And if you remember the scale
of what Brightcove was doing, 7 billion analytic
events per day, Spotify has one
trillion requests that go against Pub/Sub. And Pub/Sub is able to
scale to handle that. Not only just to
handle it, but 99% of all requests that
come through Pub/Sub have a maximum latency
of 400 milliseconds. So I think they've got the
scale and the low latency pretty much worked out for
that part of the pipeline. Now data processing,
again you might have guessed they're using Dataflow. They run 5,000
Dataflow jobs per day. But they also realize
that they had some things that they wanted to
continue to do on Hadoop. So there was some
workloads, some ETL that they were doing in Hadoop
that they wanted to keep there. And so they also
introduced another service, which we haven't really
talked about today, which is Cloud Dataproc. And that's our managed Hadoop
environment within GCP. Well, the big difference
between a traditional Hadoop environment and Dataproc is that
we're able to take and decouple the storage from the
compute so now you can store your data in
Google Cloud Storage and just re-point
your Hadoop jobs to look at GCS instead of HDFS. And in doing that,
that enabled them to have very workflow specific
Hadoop environments that only spin up when it's
time to run the job. They spin back down
when it's finished, and they only pay for
the compute resources when they absolutely need them. So you might have
seen, I'm not sure if anybody sat in this
session yesterday, but Spotify did a session
on their full migration from app tier all the
way down to the back end. So if interested, I
definitely recommend taking a look at that if you
want to see all of the thoughts and considerations they
went through when designing their solution in GCP. But the moral of the story, the
reason why I really bring it up today, is because it was huge
for Spotify to no longer have to worry about the
scale of infrastructure, the stretching of
infrastructure, whether or not they
had what they needed for the amount of data
that was coming in, or the growth that was
going to happen as they come with new music licensing and
new users coming to the system. Now they can just
analyze the data, understand listener
behaviors, understand how music tastes
correlate, and really build a fantastic music streaming
and music recommendation experience for the consumers. So we're nearly there. I know this is the very
last session of all of Next, unless you're coming
to bootcamps tomorrow. But before we separate,
there's a couple of things we wanted to do. First of all, does anybody
want to know who might have, quote unquote, "won" the quiz? Answered the most
questions right? So you might see on the side
here in the upper right hand corner of the app, which you
may have logged out of already. If you log back in, you're going
to see what your username is. We didn't want to put
anybody's email address or name up on the screen. So if we switch back over to the
dashboard, we can have a look and see who answered the
most questions correct and also who answered
the most accurately. So let's see. I can't see that far. If we can see who the user is. KAYUN LAM: FO8YX2. JULIE PRICE: So it's
not a requirement, but if you're still
sitting in here and you know that
that's your user and you want round of applause
from 300 of your new best friends, go ahead and stand up. Anyone? Anyone? [APPLAUSE] It's a mystery. And we also have somebody
who got 100% correct. KAYUN LAM: Let me
do a quick refresh. JULIE PRICE: 80% correct. KAYUN LAM: HJZSY2. JULIE PRICE: So if you're here
and you want the recognition, please go ahead and stand up. Again, not a requirement. But thank you all
for playing the game and for watching the "Modern
Data Pipeline in Action." If we could switch back
to the slides, just a few more things before we go. First of all, you
have the opportunity to make this real
for yourselves. So everything that we
showed you here today, everything that we built in
this modern data pipeline, up until the point
of the visualization you can build on your
own with this Codelab. So you have this link here. You'll have access to
the slides as well. Additionally--
did I go too fast? I'll wait until I
see all phones down. Additionally, there
are also a number of sessions that
are relevant to what we talked about here today. So there are sessions on
Firebase, on Dataflow, on BigQuery, and Looker, and
how well they operate together. And if you go ahead into the
next website or app and just filter by data and
analytics, you'll be able to see lots of
things about the things we talked about today. Finally, we have some
resources for you to learn more to get
started with GCP. If you don't already
have an account, you can get a free trial account
so that you can build out this. And that free trial account will
have more than enough credits in it for you to
run that through. And also, if you
could please, please, please before you leave this
room open up the next app and fill out the
survey to let us know how we did so
that we can make sure that we make next year's
Next even better than this. We'd really, really
appreciate it. Thank you. [APPLAUSE] [MUSIC PLAYING]