Building stream processing pipelines with Dataflow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone i'm ella a customer engineer at google cloud welcome back to the technical series for startups where we're creating a series of videos for technical enablement to help startups to start build and grow their businesses successfully and sustainably with google cloud at this stage we've taken you through some key storage and analytics services that will enable you to accelerate your business with google cloud in our last video we introduced processing as a key component of your data architectures by explaining how you would manage spark and hadoop clusters on the google cloud platform however for digital startups today there is so much more that you might need to achieve when it comes to processing and so in this video we'll introduce you to dataflow and in particular how you can facilitate your stream processing pipelines in this session we'll answer the following questions what could dataflow bring to your business how do you start using dataflow we will take you through a demonstration that will address how you design scale and optimize your data flow pipelines and then we'll look into how other businesses are implementing this service so what can dataflow bring to your business to set some context classical analytics typically seeks to answer the question what has happened when it comes to using data to make informed business decisions while this is still relevant in many ways it often leads to building dedicated infrastructure that hunts for data across many business systems and acts on data that's sometimes old increasingly we are seeing businesses change that question to what is happening by producing unified systems that integrate data in real time and produce live insights that can be acted on now making this shift isn't easy and crossing the digital transformation chasm so to speak requires a bridge that's built with an end-to-end software stack to meet these real-time requirements and design patterns that will enable this unified system so how do we go about building this bridge well fortunately we've built most of it for you with our managed service dataflow it allows you to build a unified system for your extract transform and load processes that captures both real-time and batch data in a fast scalable and cost-effective pipeline the key advantages of dataflow are that it is fully managed so it removes your operational overhead when it comes to scaling optimizing and provisioning your processing pipeline it is portable because all pipelines are built with an open source apache beam library it is a unified processing pipeline which means you don't need to recreate the wheel when it comes to your batch first streaming processes you can easily use the same pipeline for both patterns it offers comprehensive built-in support which means your pipelines will be fault tolerant consistent and correct regardless of their complexity it also shows throughput and lag statistics as well as login system checks which means that regardless of this complexity your pipelines are simple and seamless to manage and lastly it's easily integrated with other google cloud products which enables you to access the best of cloud using native integrations with services like bigquery or vertex ai or perhaps cloud storage and other google cloud services meaning that you can build rapidly and in a trusted environment this efficient and reliable streaming capability enables your business to move from reactive to proactive processing in fact one of the world's leading music streaming services describes how dataflow is enabling this shift before dataflow no one really wanted to write streaming jobs because they were too difficult or frustrating and no one really did it at all but now more than 40 to 50 of all new jobs when it comes to processing are streaming so how do you start using this technology when it comes to implementation there are two pathways that you can take the first is if you have simple requirements for your pipeline with fairly trivial transformations and you want to get them up and running quickly and easily the second path you can take if your etl processing requires you to make some more complicated transformations with more intricate management of pipelines if path 1 sounds like the right fit then use our pre-built data flow templates right from the console or with the google cloud api if path 2 is more suited to your development needs then you can start building your pipeline with the apache beam software development kit and run that on dataflow to dive further into these options dataflow templates allows you to create manage and adapt the most common use cases within dataflow all without really writing a line of code we also have user-defined functions which allows you to build upon these pre-built templates to suit your business needs alternatively apache beam is an open source set of sdks that facilitate an interaction with the overall data processing model and is the language through which you specify your data flow processing pipelines within the beam sdk you specify your pipeline as a data source and a data sync between these you perform a series of p transforms which are operations or functions of sorts that are performed on objects called p collections each p collection is immutable and p transforms can be executed in parallel here's a short example of what that might look like for a small series of transformations that are performed using a cloud storage source and a bigquery sync notice that this is specified in java but beam also supports python and go another key concept to familiarize yourself with when you start using data flow is windowing for a data stream new elements are always being produced and we need some sort of strategy to ensure all elements are captured and processed exactly once if you want to perform live aggregations on your data pipeline then you need to divide that data into event time-based chunks called windows by default dataflow has a single global window but this isn't always useful for real-time aggregation different types of windowing strategies that you can use include fixed windows where data is divided into discrete time chunks and aggregation occurs after each time chunk and data can only belong to one of those windows there are also sliding windows which are essentially overlapping fixed windows which allows aggregation to be performed more frequently another common pattern is session windows where the windows size is defined by a gap in user activity such that windows then represent bursts of individual user activity this is really common in e-commerce and gaming use cases and is supported in dataflow given these two pathways let's dive into a simple way that you can get these up and running with either dataflow templates or apache beam we're going to be working with a public data set called new york city taxi rides where real-time data is published on taxi rides across new york city to a public pub sub topic let's say we wanted to store all of this data in our own data warehouse on bigquery in order to perform some kind of real-time analytics on it the architecture for that might look something like what we've got here pub sub data flow bigquery if we want just the live raw data without too many complex transformations then that sounds like a great use case for dataflow templates you can see here that we've set up a bigquery data set and table with a schema that aligns to the json payload that we expect to receive from the taxi rides public pub sub topic you can also see by navigating to preview that there are no current observations in the table yet and that it's completely empty this means we have our pub sub source ready to go as well as our bigquery sync all set up from here we navigate to the dataflow console and start by creating a new job using a template we'll use the pub sub topic to bigquery template we'll also specify things like name the region to host the dataflow instance in the taxi ride's public topic that we want to listen to a temporary storage bucket for our data flow pipeline and the bigquery table where we want our data to end up from here it's as easy as clicking run job we can see that our job is now up and running so let's head back to the bigquery table to see if we can see data being ingested and you can see here that there is already live data that is streaming into the table that we've created ready for analysis pretty cool right however what if i don't really want to be storing all of that raw data that's coming in from the pub sub topic let's just say i work in a finance department of a new york city taxi company and i want to get a live aggregate of the meter readings so i can get an understanding of how much money is coming in for a given time window well that sounds like i'd need to do some more complex transformations and implement some windowing which isn't really covered by dataflow templates necessarily and in this case we need to use something like apache beam to specify our own pipelines so let's have a look at how you can launch a beam pipeline on dataflow remember that the apache beam sdk is supported in java python and go today we'll use the python sdk so that we can take advantage of google's managed jupyter notebooks which we can launch straight from the dataflow console i'm going to spin up an instance by navigating to workbench on the dataflow console and then specifically user managed notebooks and then launch a new notebook which will give me the option to select an apache beam specific environment from here i'll launch my jupiter labs on this instance and note that you can of course use your local environment if you prefer to do that instead now that i've launched into this instance you can see my apache beam pipeline specified in a notebook file notice that i've specified the import for a data flow runner my pub sub topic that i'd like to read from all of the options required for this streaming pipeline which you can read more about on the apache beam website and because we're reading for an unbounded source i've created a sliding window at 10 second intervals with one second for each slide i've also chosen to pull out the meter increment variable so that we can aggregate that variable and send the results to our bigquery sync and finally we use the beam io module to write to our bigquery sync and then specify to run the pipeline and then once we run these cells in our notebook we should see that a dataflow job has been spun up back on the dataflow console we can see that this job has been successfully launched and when we navigate to bigquery we can see that this live aggregate data is being streamed directly into our new table so now that you have these pipelines up and running you can optimize them by observing and managing their execution graphs and job metrics these are tools targeted to help you set measure and meet your service level objectives or slos these capabilities provide you with a visual way to inspect data as it moves in and out of your system so that you can identify the bottlenecks these tools also provide an automatic way to identify and tune your performance and availability problems and finally when it comes to scaling your data architectures dataflow has been built to scale with you if we pop the hood on this service we can see that its secret source is in all of these things that it's doing in the background importantly dataflow chooses the appropriate number of worker instances required to run your job and will automatically scale to fewer or perhaps more workers at run time based on the specific characteristics of your pipeline so then how are other companies using dataflow in their data architectures mses is a great example of how dataflow can bring real-time capabilities to a data architecture mses is one of the world's leading marketing platform companies enabling personalized one-to-one interactions between customers and marketers across all channels mses uses pub sub dataflow bigtable and bigquery to build a scalable digital marketing platform that integrates real-time analytics and builds real-time ai models by performing real-time event processing at scale mscs was able to reduce the cost of their ai platform by 70 as well as significantly improve their user experience which is pretty cool right so if you've gotten to the end of this video and you still want to hear more about dataflow here are some steps that you can take firstly have a read of the dataflow documentation as well as some common dataflow use cases watch some of our more in-depth demonstrations of the dataflow service online get hands-on with our online courses and finally reach out and get connected with someone from google cloud on how dataflow could work for your business you'll find a link for all of these items in the description below next up in this series we'll be moving on to data visualizations with data studio where we'll cover what the data life cycle is what data studio is and how it works how you can start creating reports with data studio and how other customers are using it with their visualization workloads don't forget to like and subscribe to our youtube channel and click on the bell to get notified for each time a new video is posted thank you for listening and stay tuned for more on how you can start building your startup with google cloud [Music]
Info
Channel: Google Cloud Tech
Views: 5,188
Rating: undefined out of 5
Keywords: Google Cloud, Startup, Startups, GCP, VC, Series
Id: dXhF3JJg3mE
Channel Id: undefined
Length: 15min 16sec (916 seconds)
Published: Wed Aug 31 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.