A Modern Data Pipeline in Action (Cloud Next '18)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] KAYUN LAM: How's everyone doing? Great? Great. Yeah, welcome. Having fun so far? Learning stuff? Both at the same time? Great. Thank you very much. My name is KaYun, together with my colleagues Vince and Julie. All three of us are customer engineers for Google Cloud. And we're going to present you this session a "Modern Data Pipeline in Action." We're going to walk through a few data pipeline from collecting data, processing the data, analyzing the data, and discuss about the best practices and considerations. All right. So let's just quickly recap what data pipeline is and what it is for. As you can see, different stages usually exist in a modern data pipeline. Usually it starts with data injection, you have to collect the data, take the data in. It could be coming from servers, users, or IOT devices these days. Once we have collected the data, we have to transform it, cleanse it, some kind of processing. We then have to store the data, [? put it ?] somewhere-- sometimes file systems, storage device, or maybe in a data warehouse. And then once we have the data in, we can take a look and see what we can derive from the data. In the old days, it could be a simple as an aggregation, summation, group by day, group by week, or maybe these days it can be very common to do some statistical modeling or maybe use it for advanced use cases like machine learning AI. And then visualization. It's a very, very important step. Usually it is used for presenting the results of the analysis-- pie chart, bar chart, node-link diagrams, word cloud, you name it. Different types of visualization techniques. Usually these stages are done in an orchestrated, in an automated, fashion. It could start with, for example, a presence of the data file, some events coming in, some trigger, or maybe the trigger is based on a batch schedule. Although there are cases that maybe it is done manually during development phase, testing phase, or maybe it is for ad hoc analytics project. We do hear a lot of challenges from customers who are trying to build data pipeline these days. And they are encountering different types of things. Starting with injection, there could be a huge volume of data coming in. They're coming in in different formats at very, very high speed. And then the transformation. It can be tough to just to cater for all of those requirements that are required by the downstream application. You have to add in all those transformation rules. You have to keep adding, adding, adding until a point that I'm not sure whether you all have the experience of talking to your data integration team. Hey, I just want to understand how come the incoming data file has a million rows and then you're dropping 10% of them. They got rejected, they got put somewhere else. What happened? It's just getting harder and harder to understand what happened in between the pipelines. And of the storage itself, I mean with the huge volume of the storage, it is really getting tough as an exercise on its own just to architect the storage system. Just to make it highly available, make it durable, make sure that things are backed up correctly, make sure it can handle the throughput that's coming in. And then the size of the storage will grow, and you have to figure out how to add the storage in a way that is not impacting the system and then it can still catch up with the requirements of the [? IOT, ?] the throughput, and things like that. And then when we get to the analysis of the data, well there are many cases that so many different users having their own careers to run, you have to figure out a way to tune it for this user and then tune it for a another group of users. And of course, with the data volume, keep adding to it, the queries may runs slower and slower. So it's not like a one-off exercise. You have to keep monitoring it, you know, all those things like analyze it, run stats on it. All those things just to keep it up and running and make it efficient. Well, there are some other cases that the data is not even in one single place. I mean they're in different types of storage in different data mounts. They're just in different places making it hard to analyze. And finally onto visualization. There are so many cases that I've seen in a customer situation. They have a BI dashboard. But then, the caveat is that the data available for visualization, sorry, it is just the latest three months of the data. If you're having good luck in your company, maybe OK, it's one year of data, three years of data in your dashboard. It may not be all the data set that you want to analyze. People are doing this usually to optimize the performance of the query so it's not overly loading the dashboard. To be honest in many of the cases, you don't get to run the dashboard yourself. It will be run essentially by someone else, you know, an administrator, a server generating the report for you. 8:00 AM in the morning, weekly report, they send you a PDF. The PDF has a chart. You detach the PDF and that becomes your reporting repository. So I know this session title is called "A Modern Data Pipeline in Action." When we say action, let's try to find some action that is fun on a Thursday afternoon. So let's play a game, shall we? So I'm going to introduce my colleague Vince on the stage and he'll guide you through a game as the action piece in this session. All right, Vince. Thank you. [APPLAUSE] VINCE GONZALEZ: Hi, everyone. So what we did was we built a demo. We thought that rather than pounding through a whole bunch of slides, instead we'd get you involved in actually building our data pipeline in action. So on the screen here you'll see a bit.ly link. If you'd like to take out your phones and hit this bit.ly link in a browser, you'll be taken to an application that we built for you to actually generate the data that we'll put into a data pipeline, and we'll visualize the results here in the room in real time. Your participation here is not required. But if you'd like to see a data pipeline an action, it would be really great if you played along with us. So please if you haven't already, take your phone out. Hit this bit.ly link in your browser and you'll be presented with a simple quiz game. In the quiz game, it's really easy, they're multiple choice questions. You hit the button for the correct answer and then answer the next question. We'll give you guys a few seconds to answer as many questions as you can before we resume. So while you're playing, we'll switch to the demo laptop, please. And what you'll see here is a dashboard. As your answers flow in, we'll see this dashboard update in real time. Nice to see those numbers moving up as you all play the game. Cool, so what you see here is a real time dashboard that we built with the help of our friends at Looker. We've got a couple of sort of individual metrics that we display on the screen to show the number of active users, all the people of all time who have taken our quiz. And then we've got a little time series chart that shows how many answers are being put in per minute. These things are updating on slightly different refresh schedules, so that's why you may see the answers per minute delaying a bit. But these answers are actually moving through our data pipeline as we stand here. Can we switch back to the presentation, please? Now what are we solving with this data pipeline? Usually what we hear from our customers is that there's a handful of things that they're solving for when it comes to building out a data pipeline. One of the things that's often the case is that we want to get timely access to the data. Usually people don't want to have to wait for days or even hours to get access to data to start analyzing it and looking for insights. And so we don't want to have to wait for long, drawn out ETL processes to complete before we can actually start querying the data. We may also have people in our organization who we want to enable to make data driven decisions but don't have very high degree of technical skill. They don't program in Python or R or even SQL. And so we need to enable everyone in an organization to make decisions with data with tools that are easy for them to use. And finally, your data engineers are out there building the tools for these other constituencies to use. Your data engineers actually also need a way to process data consistently, and then when business rules change or when bugs are found, they may need to actually go back and reprocess the data that had been already processed by the streaming engine. Let's look at how we solve for this. This is the architecture of the game you just played. You can see that we are going from the user-- that's you-- through our application. We ingest data from the app, deliver the events down to some processing layer which prepares the data for analysis. We store the data in an analytics engine where we can query over the data, build reports and dashboards. And then finally, we've got mechanisms within these tools to actually share the data out. So let's dive a little bit deeper. On the application, what we did was use Firebase to implement the front end that you were playing with. We use Firebase Hosting for all of the static assets, the HTML, the JavaScript, and so forth. We used Firestore to store the questions and your answers to those questions. And then we use Cloud Functions to react to the submission of the answers in the app. And we also use the Cloud Function to react to the updates of the Firestore database in real time. Those functions are then emitting data into a Cloud Pub/Sub topic. Cloud Pub/Sub is a serverless way to deliver events from an application, an IOT sensor, what have you, store it durably, and then deliver it later for processing. The other side of the coin from a Pub/Sub topic is a Pub/Sub subscription. A subscription is used to actually deliver events down to interested consumers. You can have up to 10,000 topics and subscriptions in a GCP project. We're only using a handful of topics and subscriptions in order to accommodate your answers as they flow in. Our consumer is a Cloud Dataflow pipeline. Dataflow is a fully managed service for ingesting-- sorry, not for ingesting but for processing events and preparing them for later analysis. The data preparation may involve just basic transformations. Taking the data that's ingested-- in our case, we're ingesting JSON data. Performing some computation over it-- we might be computing, say, a score over the data. And then storing that into some sync-- in our case, we're using BigQuery. Now, if I were delivering this talk a year ago, I'd probably have a screenshot of a snippet of some Java code that was reading from Pub/Sub, optionally doing some transformation in the middle, and then storing the data into BigQuery. Not everybody's a Java developer, but it's often the case that we need to be able to deploy these kinds of pipelines simply. So we've recently added to the Cloud platform a number of pre-built data flow templates that allow you to easily ingest data from a source and write it to a sync optionally transforming it as it passes through a JavaScript UVF that you can provide. With these provided templates, you can get data from Cloud Pub/Sub into things like BigQuery or Cloud Storage without writing a single line of code. In our case, what we're doing is ingesting from Pub/Sub, processing that data minimally, and then writing it into BigQuery. We're also writing in an archive out to Cloud Storage as a bunch of Avro files. We chose the Avro format because it's extremely well supported by all of the other GCP components. It's a first class citizen with respect to BigQuery ingest and all of the rest of the suite of big data tools on the platform. You already saw the dashboard that we showed you a bit ago from Looker. Looker is a great tool, it has excellent support for BigQuery, and is really very nice to work with as I can attest. But it's not the only option you have for visualizing your data. Google Data Studio is built directly into the Cloud platform. It is free to use. It's a great way to really democratize the ability of people in your organization to create visualizations and share that out to the rest of the organization. And there's a long list of other BI and visualization tools that are supported by our partners. I talked about the data engineer's need to be able to reprocess data and maybe go back and recalculate things. The Cloud Storage archive is what enables this. So having stored our events in an archive on Cloud Storage as a bunch of Avro files, should a data engineer need to do so, we can write another pipeline that might go back and reprocess all of that data out of Cloud Storage before writing it again to our eventual sync. This is usually run as something like a batch job, in contrast to the streaming pipeline that you saw earlier. What's great about this framework, and Cloud Dataflow in particular, is that if I fix my bug or change my business rules in my streaming pipeline, those transforms that I used for the streaming pipeline can usually be used pretty much directly in a batch pipeline that runs along side. This is how we enable the reprocessing of data, the recalculation, restatement of results. When you run a batch pipeline, it's not always the case that the batch pipeline that you're looking to execute is the only thing that needs to run. You usually need some way to orchestrate this. There might be dependencies that have to be satisfied before executing the batch pipeline. Creating a database, creating a table, executing a different Dataflow job, or executing some bash script. We recently announced the general availability of Cloud Composer which is based on Apache Airflow. Apache Airflow is a service for orchestrating complex data pipelines, particularly of the batch style that I'm describing here. Cloud Composer is a managed service for Airflow and makes it really easy to set up and manage and run an Airflow cluster. Now that I've taken you through the architecture, I'd love to welcome KaYun back to the stage to give you a little bit of a deeper dive into some of our architecture choices. [APPLAUSE] KAYUN LAM: Thank you, Vince. All right. Thank you Vince for guiding us through the modern data pipeline in action. So I would like to go back and revisit it, and then take a look at some of the architecture decisions that we have made when presenting this solution. Of course, just now the example is a game. In your use case, it may or may not be a game. It could be in a traditional enterprise environment. It could be some other kinds of data processing needs. We still want to revisit or maybe go through some of those architectural decisions to see if there's something universal. Maybe there's something that you can borrow and apply in your data pipeline in your environment. First of all, just now the data flows from Firebase, the application that you used your phone to play on. And then eventually it goes into BigQuery for the dashboard for the analytics. Technically speaking, it is actually doable, technically feasible, to have the data written from Firebase into BigQuery directly. As a matter of fact, BigQuery actually supports multiple ways of ingesting data. BigQuery can ingest data in batch. BigQuery, on the other hand, also has a streaming insert mechanism. That means when you run a simple statement, it actually can merge the data, merge the results between a streaming buffer in BigQuery with the data that it has been presented in the table already. So it is actually very, very doable to have Firebase to write the data to BigQuery directly. We choose not to do that. We chose to put something in between, and that is what we call decoupling. In this case, we want the application-- in this case, in this example, Firebase-- to focus only on publishing the source data, the source data from the application in its original format. In this case, it is JSON data. And just focuses on publishing the data onto the Cloud Pub/Sub topic. The application doesn't need to be aware of, hey, what is the data set name down the road. Hey, where is BigQuery or some other databases? What is the data set name? What is the table name? How the schema looks like. And then in many cases, it may not be one single downstream application. Right now it could be one, it could be a second one later. And it can grow. And then we don't want to have so many of those touch points built into the source application in this case. Having Pub/Sub in this case will make sure that Firebase, the source application, only focuses on publishing the data. And it will let Pub/Sub be the mechanism to fan out the data, to have multiple subscribers subscribing to the same topic. Each Cloud Pub/Sub topic supports up to 10,000 subscribers. You don't need to handle to design your own fan out mechanism, Pub/Sub can do it for you. The concept of message [? bus, ?] service [? bus, ?] this is not new, to be honest. It has been around. People have talked about a message layer, message queue, stream storage. This is not new, to be honest. But at the same time, it has been a little bit challenging to architect this kind of solution. Issues that you may have heard around, message is stuck in a queue, the processing cannot keep up with the incoming data volume, or if you are lucky to have a solution that is horizontally scalable, it has multiple partitions, multiple shots, then you need to worry about how should I shot it, how should I partition it? The source application may need to calculate the shotting, make sure that it is evenly distributed. If it is not evenly distributed, how do you split the shot to rebalance it? It becomes the architecture exercise on its own to deliver this kind of message solution. In this case, in our example, Cloud Pub/Sub is a fully managed service running on Google Cloud Platform. It is a global service. That's one single end point around the globe. Meaning that when you are publishing your messages to the topic, you don't need to worry about, hm, am I publishing my messages to US, to Europe, to Asia? No, you're publishing your messages to Pub/Sub and Pub/Sub will handle the rest. You don't need to worry about which partitioning or which shot you're publishing your messages to. Cloud Pub/Sub will just scale automatically for you and you are just paying for that data volume that you're sending to Cloud Pub/Sub. The processing layer, Cloud Dataflow, it reads data from Pub/Sub. There could be varying volume of data coming in throughout the day, throughout the week, holiday season. There are many of those similar processing framework can do something like this horizontally scalable manner. But at the same time, that horizontal factor usually is a fixed factor. Meaning that you design it upfront whether it is four [? nodes, ?] 10 [? nodes, ?] 10,000 [? nodes. ?] You have to have a fixed set of computing power in order to consume the messages. On the other hand, Cloud Dataflow can actually do something like auto scaling. Based on the incoming data volume coming from streaming or maybe even from batch, Cloud Dataflow has the ability to scale up and scale down the computing resources underneath. So that you can adjust it and only pay for the resources that you are truly consuming. There's no need to over provision your streaming processing layer just to cater for the high water mark. And then in this case, the destination of the data BigQuery, it has compute storage separation. Meaning that when the data volume grows, you don't need to necessarily tie the storage cost with the computing costs, like the CPU and the memory. You pay for the storage and the processing separately. This makes it simple, elastic, and low cost. When we are sending data to the destination under the consideration, under the architecture decision is, how do you want that data to look like? Traditionally speaking, I've been in that kind of field, like ETL jobs, trying to fit into the data schema on the data warehouse. All those things are simple as normalization of the data, the star schema, slowly changing dimension. Just all those ETL jobs, they are not really doing something that is business wise, but more to cater for the target database format. Usually the goal is to fit into the schema so that it is efficient for that database queries, or maybe it will minimize the data storage so that you're minimizing the duplication, saving the storage cost. On the other hand, in a modern data pipeline, we can ease off on some of these transformations in between. The reason is that in a modern data pipeline, especially for example in this case, what we have in BigQuery, you don't have to force it to be like a one-to-many relationship table to split up the data into multiple places. In a modern data pipeline, a storage system like BigQuery when we're [? top ?] storing the structure data, you have the ability to store it in a format that is as close to the source as possible. In this case, there could be arrays, there could be [? strokes, ?] there could be complex data types in the originating source system. And you can persist the data as close to the source format as possible so that you don't need to worry about should I break down the data into this table? That table? How does the [? EL ?] diagram look like? And things like that. This allows the data processing to be really fast. We minimize the logic, minimize anything that may happen in between, and now we can have low latency access to the data. That is our goal. And the example we had just now, we showed a streaming example of us in the room, playing with the phone, and then sending that data so that is streaming data. In real life situation, that streaming may not be the only data source. Usually you will be handling a mixture of streaming data sources and batch data sources. They could be the same data type. They might be different. So it is very, very important architecture decision to choose something that you are not necessarily writing two separate data pipelines or two separate pieces of codes. Cloud Dataflow supports both streaming and batch. Meaning that when you're using the underlying programming model, you write the set of transformation logic, you write something, some algorithm. It's very, very easy to have that same piece of logic to work on streaming data and also batch data. So keep this in mind when you're designing your data pipeline to make sure that it is flexible enough to handle both cases. One use case, as Vince mentioned earlier, is on the reprocessing needs. It could be based on changing business requirements or you have to go back and fix something. But very, very often, data is a very, very valuable asset. There are so many cases that there might be some insights hidden in the raw data that maybe in the future maybe there's an upcoming analytics project, someone in your organization just want to go back to the data and take a look and see if there's some additional insights that you can get out of the raw data that could have come in months or years ago. So it is also very important when you are designing your data pipeline to keep reprocessing as one of the considerations. You want to make it easy for anyone in your organization to just go back and say, hey, I have an idea. I want to take a look at the old data and see if there's a certain kind of pattern, and then I can come up with this new prediction algorithm which has to be trained based on old data. That's why reprocessing, having this as architecture decision, is important. As supported in my previous slide stream and batch, reprocessing can take advantage of stream and batch capabilities in the Cloud Dataflow's programming model. And then you can go back and process the existing data and see what kind of additional insights there you can get. So here is kind of like a flow chart on the high level transformation that might exist in a Cloud Dataflow pipeline. In this example, first of all, first of the four boxes over there, first box reads from the Cloud Pub/Sub topic and subscription. Second passes the JSON message. The third step is to turn it into the BigQuery, the table row format. And then the fourth step is to write to BigQuery. You want to choose the programming model that allows you to express your pipeline. And these are kind of high level transformations so you're not going to deal with all those tiny, little details. The Dataflow's programming models supports many other primitives that are specialized for streaming data processing. Like a windowing mechanism, for example, a sliding window, fixed window, session window. So it is very important when you're designing your data pipeline next time, keep this in mind. Choose something that allows you to express your pipeline in these high-level primitives. And when you are designing something to be expressed in a high level manner, it allows your data pipelines to be portable. Just now I've been talking about Cloud Dataflow. In fact, it supports pipelines that are written in Apache Beam SDK. But Cloud Dataflow is not the only option. The Cloud Dataflow does make it easier to run your Apache Beam data pipelines. It proficients the computing resources for you. It scales up, scales down. It rebalances the work like what we have mentioned earlier. But if you want to choose to run it somewhere else, you can do that. Once you have written your data pipeline in Apache Beam, Cloud Dataflow is just one of the several supported runners behind the scene. You can say, I want to run my data pipeline on Apache Spark. It could be on Apache Apax, Flink, Gearpump, Samza, or even locally when you're doing development on your workstation. So when you're designing your data pipeline, you want to choose something open so that on one hand, you can run it on, for example, Cloud Dataflow on a cloud. On the other hand, if there are some other requirements, you can choose to run it on your own premise environment or some other environments. There shouldn't be any lock in. So with this said, I'll pass the time to my colleague Julie and then she'll explain to you the customer stories-- the customers who are running these modern data pipelines on Google Cloud Platform. Thank you, Julie. [APPLAUSE] JULIE PRICE: Hello, everybody. Thanks KaYun and thanks Vince for walking us through the concepts of a modern data pipeline and the architectural decisions that go into creating a modern data pipeline. As KaYun mentioned, I wanted to take a moment to really make it real to talk about how some companies are creating modern data pipelines of their own in GCP. So all of the companies that you see listed here are doing some really cool things with big data analytics, big data processing, all with Google Cloud Services. And some of them, most of them, you've already heard about. So the first customer I wanted to talk about is Ocado. Ocado is an online grocery retailer. They serve over 70% of households within the UK. They don't have a single brick and mortar store anywhere in the UK. So they're competing with all of these really large stores that people go past on their way home. And so they really need to find a way to get their customers to be loyal, to make sure that they're always purchasing through them. So what they wanted to do was create a big data analytics platform which would help them convince customers of all of the benefits of buying online as opposed to in the stores. And they also wanted to make sure that they were using data to drive their business insights, to do things like inform the supply chain, predict demand, and really just overall improve their logistics. They had a problem because their data was siloed. They had business data and product data. Transactional data was all sitting in different places across their data center, and they had no means of communicating with each other. So what they knew they needed to do was come up with a way to create a platform that could pull all this data together. And they did that in GCP. So they were able to build this advanced analytics solution that could not only process the data through, but run advanced analytics against it to do things like determine what would be best to improve customer satisfaction, how they could optimize their supply chain, how they could really get access to insights about their business data in closer to real time, and ultimately reduce costs. So all of those kind of common business goals that almost every company has. So they did that by implementing first and foremost BigQuery as their data storage platform. So BigQuery is where all of the data ends up to be analyzed. So they do click stream analysis, customer analysis, product and department analysis. Everything you can think of all happening in BigQuery. And this is across over two petabytes of data. They also are using Cloud Dataflow for doing their transformations. So you might remember Vince was talking about Dataflow quite a bit as well. And then they're using Pub/Sub for all of their data delivery. So it's this kind of a common pattern of Pub/Sub, Dataflow, to BigQuery. And now they took it even a step further by integrating BigQuery directly with TensorFlow and Cloud Machine Learning Engine. Ocado was one of the very first customers to even help us test the Cloud Machine Learning Engine. And what they did was create these really awesome data pipelines that do very real things. So one example of that is the way that when somebody submits an order, the items get picked and packed and ready for delivery. So let's say, I'm going to make a curry for dinner tonight. And I put into my Ocado app that I want some vegetables, a protein, some lentils, rice, and some nice naan bread. It's around noon time, right? I'm on my lunch, not taking up my work time with ordering my groceries. And I go ahead and enter that, and that's when the system starts to work really well. Because now we have to analyze where are these items? In what warehouse? What warehouse is closest to where it needs to be delivered? Are there perishables? What can be picked now? What has to be picked later? And so it determines the appropriate time to do these things and sends signals to the warehouses. And the warehouses have a number of robots that are used to pick all of the items in an automated fashion. But they just don't say, hey you, robot, you are responsible for Julie's order. In fact, there's ML swarm intelligence that they've created which will help the robots work collaboratively so that they can go through the warehouse. And many robots could be working on picking my items, as well as others, but making sure that they all end up in the right package to be shipped. And making sure that it's all done in such a way that everything ends up on people's doors fresh as could be with no perishables. So it is really pretty cool what they've done. Even still, they're sending all of the telemetry data from the robots up to the Cloud and they're analyzing that so they can do scheduling, to make sure they have the right number of robots on the floor in each warehouse. And to do things like predictive wear and tear, to make sure that if they think that something might happen to a robot, that can pull it and replace it with a well-working robot before something happens and interrupts the workflow. So that's what they've done internally using this modern data pipeline. And they also determined that-- they do a really good job with e-commerce. And so they thought, why don't we create an e-commerce platform that we can sell to other customers? So large brick and mortars that also do online, and they created Ocado Smart Platform, which leverages a lot of these same types of pipelines to ensure now that these other retailers across the globe can deliver the same type of fantastic e-commerce experience that Ocado does to their customers. Now, the next customer that I wanted to talk about is Brightcove. So a very different company. They're actually an online video platform. They serve content for internet TV, news media, various different internet media. So think all video streaming. And Brightcove runs 8,500 years worth of video streaming, video viewing each and every month. They have 7 billion analytic events per day. So that's a lot of data-- 85,000 years worth of video streaming per month for all the people that are watching. And so they were having a problem because their legacy system was kind of bursting at the seams. They knew that they needed to re-architect and re-platform. And so they investigated a number of different big data stacks, and they landed upon GCP. And so what did they do? They implemented a data flow that went from Pub/Sub for event delivery, Dataflow for all of their transformations, and BigQuery to land their data and be able to perform analytics. And they chose this because all of those services can scale to whatever scale that they need without them having to worry about whether or not the infrastructure can handle it. So this story's a little bit shorter than the last one, but really interesting nonetheless just because of the grand scale with the amount of information that moves through the pipeline every single day and the amount of information that gets analyzed every single day. And speaking of scale, that brings us to our last customer that I'd like to talk about. So are there any Spotify users in here? A few. So I personally use Spotify every day. I absolutely love the service. So I was very excited to learn that Spotify chose to put all of their tech stack on GCP. So we're not going to talk about all of the tech stack, we're just going to talk about the cool stuff, of course, the data stuff today. Spotify had the largest Hadoop environments in Europe-- a 2,500 node, 50,000 CPU core Hadoop environment with 100 petabytes of capacity, 100 terabytes of RAM. It makes me shudder at the thought of the expense of that system. But it was really important because over 20,000 jobs were running on the system per day. From 2,000 different workflows, it was supporting 100 different teams within Spotify. So you have to imagine it was very complex. It was very important. They couldn't just break it down and build it somewhere else. So they knew that they wanted to get out of the on-prem world and out of the single Hadoop cluster world. And so they chose to come to GCP. And so maybe you might recognize a pattern here. What do you think they chose for their ad hoc analytics and data storage? They chose BigQuery. Their BigQuery environment serves over 10 million queries and scheduled jobs every single day-- I'm sorry, every single month-- processes 500 petabytes of data every single month for all of the different users that need to query against it. Also, you might guess, that for event delivery, they chose Pub/Sub. And if you remember the scale of what Brightcove was doing, 7 billion analytic events per day, Spotify has one trillion requests that go against Pub/Sub. And Pub/Sub is able to scale to handle that. Not only just to handle it, but 99% of all requests that come through Pub/Sub have a maximum latency of 400 milliseconds. So I think they've got the scale and the low latency pretty much worked out for that part of the pipeline. Now data processing, again you might have guessed they're using Dataflow. They run 5,000 Dataflow jobs per day. But they also realize that they had some things that they wanted to continue to do on Hadoop. So there was some workloads, some ETL that they were doing in Hadoop that they wanted to keep there. And so they also introduced another service, which we haven't really talked about today, which is Cloud Dataproc. And that's our managed Hadoop environment within GCP. Well, the big difference between a traditional Hadoop environment and Dataproc is that we're able to take and decouple the storage from the compute so now you can store your data in Google Cloud Storage and just re-point your Hadoop jobs to look at GCS instead of HDFS. And in doing that, that enabled them to have very workflow specific Hadoop environments that only spin up when it's time to run the job. They spin back down when it's finished, and they only pay for the compute resources when they absolutely need them. So you might have seen, I'm not sure if anybody sat in this session yesterday, but Spotify did a session on their full migration from app tier all the way down to the back end. So if interested, I definitely recommend taking a look at that if you want to see all of the thoughts and considerations they went through when designing their solution in GCP. But the moral of the story, the reason why I really bring it up today, is because it was huge for Spotify to no longer have to worry about the scale of infrastructure, the stretching of infrastructure, whether or not they had what they needed for the amount of data that was coming in, or the growth that was going to happen as they come with new music licensing and new users coming to the system. Now they can just analyze the data, understand listener behaviors, understand how music tastes correlate, and really build a fantastic music streaming and music recommendation experience for the consumers. So we're nearly there. I know this is the very last session of all of Next, unless you're coming to bootcamps tomorrow. But before we separate, there's a couple of things we wanted to do. First of all, does anybody want to know who might have, quote unquote, "won" the quiz? Answered the most questions right? So you might see on the side here in the upper right hand corner of the app, which you may have logged out of already. If you log back in, you're going to see what your username is. We didn't want to put anybody's email address or name up on the screen. So if we switch back over to the dashboard, we can have a look and see who answered the most questions correct and also who answered the most accurately. So let's see. I can't see that far. If we can see who the user is. KAYUN LAM: FO8YX2. JULIE PRICE: So it's not a requirement, but if you're still sitting in here and you know that that's your user and you want round of applause from 300 of your new best friends, go ahead and stand up. Anyone? Anyone? [APPLAUSE] It's a mystery. And we also have somebody who got 100% correct. KAYUN LAM: Let me do a quick refresh. JULIE PRICE: 80% correct. KAYUN LAM: HJZSY2. JULIE PRICE: So if you're here and you want the recognition, please go ahead and stand up. Again, not a requirement. But thank you all for playing the game and for watching the "Modern Data Pipeline in Action." If we could switch back to the slides, just a few more things before we go. First of all, you have the opportunity to make this real for yourselves. So everything that we showed you here today, everything that we built in this modern data pipeline, up until the point of the visualization you can build on your own with this Codelab. So you have this link here. You'll have access to the slides as well. Additionally-- did I go too fast? I'll wait until I see all phones down. Additionally, there are also a number of sessions that are relevant to what we talked about here today. So there are sessions on Firebase, on Dataflow, on BigQuery, and Looker, and how well they operate together. And if you go ahead into the next website or app and just filter by data and analytics, you'll be able to see lots of things about the things we talked about today. Finally, we have some resources for you to learn more to get started with GCP. If you don't already have an account, you can get a free trial account so that you can build out this. And that free trial account will have more than enough credits in it for you to run that through. And also, if you could please, please, please before you leave this room open up the next app and fill out the survey to let us know how we did so that we can make sure that we make next year's Next even better than this. We'd really, really appreciate it. Thank you. [APPLAUSE] [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 3,136
Rating: 4.8596492 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: EN_RJ428i1g
Channel Id: undefined
Length: 46min 12sec (2772 seconds)
Published: Thu Jul 26 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.