AWS re:Invent 2019: [REPEAT 1] Building a streaming data platform with Amazon Kinesis (ANT326-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody good afternoon thank you for joining us today really appreciate all of you who decided to line up and and wait to learn more about streaming data my name is Adi and I manage the Kinesis data streams and video stream services at AWS and today we're going to be talking to you about the evolution of streaming data on AWS through the various Kinesis services but at the same time we're going to be spending a whole lot more time with the real star of the show I'll be joined by Timothy Inge who's the vice president of engineering at GoDaddy and GoDaddy and Tim's team have been on this journey over the last several months on moving from a more legacy oriented store forward batch oriented world into a continuous streaming real-time architecture for a whole bunch of different use cases and he'll be going in in a lot of detail can a pragmatic lessons learn on the journey and the results so far for my part I'll be providing kind of an overview of real-time use cases we've been on this journey of streaming data now for well north of six years actually was here in November 2013 when we launched the first Kinesis service I was actually the product manager back then for the service and and since then we've helped you know tens of thousands of customers across the world over hundreds of thousand different kinds of use cases and hopefully what you learned some of it is kind of the distillation of those experiences having said that let's show of hands how many of you today have a Kinesis immeuble or even without Kinesis a streaming data workflow introduction okay I would say about 40% and how many of you are kind of on the more beginning stages of the journey where you're dabbling with it have a POC or two are just curious about what streaming data means show of hands please got it I would say about the same so as you can appreciate some of the challenges with these sessions is that you find customers on all ends of the spectrum so we'll do our best I will provide a more high-level overview of the services and drill down into a few distinct diverse use cases and then Tim will come in to go much much deeper so just kind of a really quick context setting data in and of itself doesn't is not about streaming you know data exists and it is up to us up to customers to drive them into a more streaming form to deliver those use cases for which time criticality of the decision is very very important so again data in and off itself it doesn't wish to be bashed or wish to be streamed it is what we decide to do with it but over time what we have learned is that there is conceptually this kind of time value to the data when businesses respond and react to certain kinds of data in real time they tend to either drive more efficiencies save cars drive more revenue engage customers prevent attrition you get the idea and this this kind of diagram I think kind of thematically catchers and this is by my Gualtieri one of the VP of research at Forrester in more pragmatic terms though we kind of define the streaming data domain in the milliseconds to minutes window if you will because when we talk about real time you know we go what is real time and what's real time to you may be different than what it is real time for somebody else but when we think about most use cases in the millisecond domain these are typically those that we would call kind of messaging style use cases for communication between between variety of micro services in the seconds domain is where a majority of the core of log ingestion style use cases customer experience logs the user experience logs IT infrastructure logs application logs security logs you name it and that is the vast majority of streaming data that we see today across any set of customers the other big set of use cases not quite the same volume but arguably the same kind of importance is change data capture use cases we've seen a lot more of those happen primarily because a there's been an explosion in the kind of databases and we the reliability and cost of converting that data flow into a streaming model has made it far more easier and more accessible to do change data capture use cases than ever before and in the minutes time frame is when you have all of this ingested data that is now being a transformed pre-processed format converted metadata decorated so on so forth as it makes its way into a data Lake of sorts or into any number of other data analytical stores and destinations from where other downstream processing is kicked off there is a canonical streaming data flow that comprises of the following layers and you look across any sort of use case and our bet top dollar that you will find conceptually components falling in these buckets there is the source of data this is the the thing the application the piece of IT infrastructure that's going to generate the actual raw data that you care about there is an ingestion mechanism software library a toolkit an SDK an agent that is going to be responsible to reliably publish that data into a streaming storage system now people always get when we talk about streaming there they go is it storage is it a buffer is it a queue and we get into all sorts of conversations for the purpose of simplicity it's the kind of a storage system it reliably captures durably stores cares a lot about ordering and allows multiple different applications to simultaneously and independently consume that stream and always play it back in the exact same order that is really the hallmark of a streaming storage system it has to be processed so it has a stream processing counterpart and there are a whole number of different frameworks managed services that allow you to do this real-time or continuous stream processing and invariably the streaming data is not discarded after it has been transformed or processing in some some shape or form it'll find its way into a data Lake on AWS s3 is by far the most common place where we see streaming data end up but we also see data warehouses no sequel stores we see things like elastic search and a variety of other places where customers decide for their use cases to get streaming data into on AWS for the last six years we've built out a number of services that work together to enable a whole bunch of different use cases as I was mentioning earlier six years ago at this very platform we announced Amazon Kinesis data streams developer oriented a streaming storage platform for developers to build and operate their own stream processing applications using any framework of their choice targeting second or sub sub second use cases Kinesis firehose aims to make it bonehead simple to load training data directly into specific a double use destinations like s3 redshift and elastic search and they work together data published into a Kinesis data stream can be firehose your way into s3 a pattern we see happen all the time and the stream processing counterpart for the streaming services is Kinesis data analytics this is available for customers to use in in terms of expressing their business logic both in sequel and in Java and I'll give you a little overview a few minutes from now the sources of data at hand are as I mentioned very very diverse so when you think about what use cases make sense for you mobile app website clickstream security logs metering billing data IOT sensor data all of these are real life examples of data that are streamed today and an addressable scale in terms of how do you get it to actually publish data into a stream endpoint you've got options so largely there are a set of AWS libraries and toolkits starting from the SDK the Kinesis producer library agents so on and so forth and there are also a variety of open source or third-party software that customers can build install and configure for their sources of data to emit it in as in a stream form also a number of AWS services themselves publish their relevant data into kinesio screen so IOT core cloud watch cloud wash logs events and through the data migration service change logs can also be shipped into into Kinesis stream as one of the targets as I mentioned before Kinesis data streams is kind of the storage layer that customers provision said their capacity and ingest data at very low cost and at high high scale the service has been around for four six years and at its core it is a highly durable log replicated file system that is fully multi-tenant runs across three availability zones and is what we would call a foundational service at AWS lots of other alw services for example cloud wash cloud wash logs are built on top of Kinesis data stream so that's how we think about the service in terms of how essential it is to the functioning of AWS really and I don't think that's that's an exaggeration the purpose here of course is for developers to be able to build their own stream processing applications consuming from Kinesis streams you could use Kinesis data analytics you can use lambda you can use spark spark streaming flink running on EMR or on self-manage infrastructure on AWS there are two consumption models one allows you to differing primarily in terms of dedicated throughput provide to the consumers and the read after write or put to get latency available the standard consumers have on average about 200 millions of latency in terms of when the data is put through a replicated durably store and may available to the consuming application and then using the enhanced fan-out consumer type that gets shrunk down to about 70 milliseconds on average fire hose is a different service with its own api's and its purpose is is to really provide customers with a zero admin zero operator overhead way of loading the streaming data into destinations like s3 redshift and elasticsearch this is near real-time why it's working backwards from the destination like s3 or elasticsearch and so it will buffer compress encrypt the data and make sure that does a world-class reliable durable writing of the data into those into those specific destinations and that's configurable to a certain degree by the customer by customers themselves it also along the way does some amount of data format conversion to shape the data as such that an analytical process can run more efficiently the key difference between the two has that as I had mentioned earlier is when developers want to build their own stream processing applications with sub-second latency start with streams if your use case is really just about getting the date streaming data loaded into these specific destinations with a small degree of transformation start with firehose when in doubt you can always start with streams why because you can attach a firehose to it and get the data into the direct into the destination of choice without any other extra extra legwork on your part in the stream processing domain just like with publishing there are lots of options the ones that that we talk about here are with the Kinesis data analytics for sequel and java but you can use lambda with whom the integration has existed since lambda launched way back when and of course all the the key stream processing open source Apache frameworks that run on EMR or or in any other form you wish to choose which to do so also consume and can process the streaming data so the sequel oriented use cases are really about not having developers learn a brand new programming model or framework and being able to express simple and complex application logic within the sequel construct there are a whole bunch of different pre-built functions that's one of the great advantages can be expressed succinctly and reasonably fast performance while lowering the bar to entry for someone who's trying to get started with streaming data and over the years we've also built precompiled machine learning functions that can be invoked as part of the konista analytics for sequel offering Kinesis data analytics for java was announced and launched last year this time at last reinvent and this is effectively fully managed Apache flink for those of you in the know flank is one of the more popular robust and sophisticated stream processing frameworks that exists today customers can build create their own jar and effectively uploaded into the KDF of java service where the service is responsible for you know mapping the jar into parsing it compiling it mapping it down to the operator graph running it across a distributed set of compute instances managing kind of the reliable scalable processing behind that and of course all the goodness of link is naturally inherited because that's the programming model that is being expressed here now to kind of tie this all together the source the ingestion mechanisms the stream store is the processing layer and the destination you will note that as you start on your journey or if you're deep in your journey one extraordinary new use cases you will have these components to varying degrees let me kind of spend a minute talking to you about you know one use case and this use case is from a leading provider of 3d design and engineering software delivered as as a large-scale service to millions of users worldwide this customer runs on AWS the majority of this software as a service is run as numerous numerous micro services raw easy to continuous service as well as lambda all customer experience interactions are captured inside of the knesset data stream on the order of several terabytes a day and as the data flows in real-time Amba on the top part you see how firehose is actually a consumer for the Kinesis stream and is and emits that data into elasticsearch and now the customer has an operational view into what my customers use looks like via elasticsearch all the data is compressed batched and then sent into s3 from where more business-oriented users can use something like Athena to query against that data so they're still kind of in the minutes maybe an hour time frame which is which is pretty reasonable for a core business user without building a very complex etiologic and a batch ruin did workflow and then Kinesis data analytics on the bottom is also a consumer of the same Kinesis stream data generating a variety of customer experience metrics sums percentiles derived measures that are emitted into clouds cloud watch from where they have their kind of monitoring hooked up and in this scales this scale is a very low cost and they have near real-time or real-time access to their customer experience here's another very very different example this is an example of one of India's leading supply chain services company and they have successfully filled you close to half a billion orders across more than 50 million households in India and and this is about the moment a customer clicks on their virtual shopping cart had that had at ease that event be generated and tracked from when the your physical item is in the fulfillment center - when it gets pushed onto a delivery truck gets moved to a sortation center from where a last man provider picks that up and then delivers the package to the customers doorstep each of those events is emitted through their set of services that are fully containerized and emitted into a Kinesis stream from where they use a series of lambda functions not just to fan out those different business events to a whole number of other downstream systems but they also have pre processing or post processing a little bit in this case and have that put again into into elasticsearch for broader operational use cases it's that's a lot of discrete event data that is being captured and processed in real time in Kenny's streams a very different use case than your typical customer experience log ingestion of data and then kind of a third and kind of the final use case this one's just fun for me at least now this is about a company who wants to show how data offers players coaches sporting team management and fans a very different way to engage with sports in this case the company has a different take on imbuing existing sporting equipment in this case Cricket Golf and rugby balls and fitting them with sensors that can track motion velocity direction force spin and then that data is emitted in real time into a canoe so stream from where a few different actions are taken again like before you will see this pattern emerging invariably a large segment of the data is stored in s3 in this case we have firehose in this case lambda is used to generate real-time metrics on that data that's that's emitted into a dynamodb table for where those metrics have been served up to the various constituents teams players and so forth so this it's kind of a fascinating use case but this time in the IOT domain so hopefully you have a as a whirlwind tour but hopefully you have a sense for what streaming data is how what how the different layers are organized and some sense of the Kinesis services the role they play and how some some customer use cases have been having solved for now I'm gonna hand over to 2 Timothy Inge as I mentioned Tim is the VP of engineering at GoDaddy and he's going to go deep in terms of how what a streaming data journey really looks like from a practitioners approach all right well it's good to be here this afternoon thank you so much for your time now before we get started I wanted to talk a little bit about what GoDaddy is doing because that what's up the context on why we invested in a streaming data platform now most of you may know GoDaddy as a domains and hosting company but over the last few years we've been evolving our mission to empower entrepreneurs everywhere as a result of this mission we've expanded our product portfolio unifying them under a single experience which is designed around helping our small business customers be successful at what they do behind the scenes all of our products and services and experiences generate a lot of data so we broadly classify our data into four categories the first is logs and this is your typical application or service logs that's generated just by running your service or your products the second is ecommerce data which is a lifeblood for our company this is data that describes what a customer has purchased the third is interaction data and this is data that describes how our customers are interacting with their website or with our merchandising experience or with our products and the final classic category of data is Product data this is data such as DNS settings or website content that our customers have so how do we process all of this data and the reality is that we've been on a data platform evolution we started probably where most of you started which is we just issued queries against sequel databases sometimes we would load those load the data into Excel for visualization that evolved into what we call big data right and so we deployed an on-premise Hadoop infrastructure and then we copied the sequel data from sequel onto Hadoop and we did our processing there using tools like hive or Pig or spark and so forth we generally power things like analytics with these but over time we start to power other types of workload like marketing workloads and so the question for us is what's the next step in the data platform evolution and so to kind of consider that we kind of looked at the challenges that we kind of faced as a business and so there were four areas that we kind of thought about as we were trying to figure out like how do we evolve the data platform moving forward the first is around service reaaahhr tech chure some of our services were undergoing a change they were changing from kind of soap based services built on top of highly schema tyst sequel tables into very decoupled micro services may be built on key value stores like dynamo and so this presented a major challenge or a major change to the data platform the second is slow and unreliable data processing it often took days to process data in Hadoop and oftentimes we didn't apply engineering rigor to our data processing and so sometimes the data was unreliable and not trustworthy the data platform was originally built for humans and so we had a lot of ad hoc data processing which is perfectly fine when an analyst is looking at the data but it starts to show its cracks when you're trying to build product or customer experiences over ad hoc data processing and the final challenge is that we just run into private data center limitations so as a result of these challenges and prior few more we formulated a few goals for the next stage of our evolution for the data platform the first was we want to migrate to AWS we want to take advantage of all of the services and capabilities that Adi mentioned and well we want to be able to use those things for ourselves the second is that we want to think about data as a service but that is how do we think about data built not just for humans but what does it mean to build product and customer experiences on top of a data platform and the third goal was to invest in low latency data processing so why did we invest in that or streaming or low latency data processing why did we invest in that and why did GoDaddy invest and why would you be interested in that the reality for us is that we kind of thought about three different motivations that kind of caused us to invest in low latency data processing the first is our business right as our business evolves and as our business grows there is demand for richer and more timely data in our analytic functions we want to make better decisions with fresh and reliable data and beyond our analytical functions we want to we want to power better customer touch points and customer experiences through the use of data the second motivation is a technical motivation right as I mentioned the change to micro services and to key value stores that were decoupled presents a major challenge so what we do is we stream this data from these services into the data platform where we can combine all of this data and this replaces the kind of sequel based type aggregations that were typically done with store procs on your service databases and for us this separation of reads and writes is really critical for transactional service performance and the third motivation is really around machine learning right batch and bulk data processing is very useful for training but for inference it often demands low latency data because inference is often used in scenarios that affect customers and so you need low latency data to make low latency inferences so what do we build we organize our thoughts around a couple of workloads or three workloads right the first is we want to build a set of api's for our products and tools to integrate with these api's would give these services access to the data that was processed by the Data Platform and we wanted to make sure that the api's were designed with the the services and tools need in mind and so the business logic was actually implemented in the in the data platform this meant that we didn't have to push that business logic up to the transaction services or store products in their databases which was a huge win for us the second workload was around data notification and so services can also subscribe and get notification when data changes and so you have two options you can ask for data using api's or you can be notified about data changes with notifications and our last workload was a low latency tableau dashboard for analytics purposes so how do we build this so when we thought about the migration to AWS we had decided that we were going to build our streaming data platform on the Kinesis family of products why do we choose Kinesis over four kind of reasons for it the first is that a lot of customers use Kinesis right and so there's a lot of information and knowledge sharing out in the world which is really important for us because we didn't know a lot and we relied on on a lot of learnings that were available the second is that Kinesis is fully managed we had a really hard time running our own Kafka infrastructure on-premise and we kind of didn't want to do that again as we move to AWS the third is elastic skill right again we had a really hard time trying to implement elastic skill on our data center there's just something that we didn't want to have to deal with and finally as Adi mentioned Kinesis has integrations with lots of different AWS services and that will allow us to focus on the things that matter for a business and uh none tinkering infrastructure glue connectivity so our team was new to both stream processing and AWS when we started this journey and so we had decided that we were going to build a streaming data platform on AWS for very simple new scenarios and we were going to support the existing business on our Data Platform as is this allowed us to kind of focus on a something very simple and iterate very quickly to add features to the streaming platform without having to be worried about the complexities of our existing business and so to work on the new streaming platform we kind of thought about three phases three of the three phases to approach building the new streaming platform phase one was about learning we had a lot to learn and we kind of organized our learning into those two areas we had to learn about AWS and we had to learn about stream processing so for stream processing we evaluated a lot of different technologies and we selected Apache beam as our stream processing library we like to punch it beam because it can run on spark which is what we happen to have on our private data center and it can run on flink which we knew at the time that amazon was going to support via Kinesis data analytics for java applications we also like beam because the programming model kind of made sense to us and so we invested learning more about the programming model we experimented with things like bounded streams unbounded streams all the different types of windows and triggers and we kind of had to get deep to understand the programming model of beam and we were able to do this running on spark and so in that way we could be couple-hour learning for AWS one of the things that we knew we wanted to do is we want to be able to operationalize the data platform so what that meant was that starting with the new AWS account we wanted to be able to in a fully automated way provision and deploy a data platform from scratch right and so we invested and we learned about technologies like cloud formation and home charts to the point where we can fully provision a data platform and we also invested in things like CI CD so that changes to business logic in the data platform will be treated in the same way as changes to business logic in a service complete with blue/green deployments we brought these things together by deploying flink about deploying beam on flink on eks on Amazon in a fully automated way so that's at the stage for a phase two and this is kind of where we kind of got into the actual streaming workload and so as I mentioned we kind of kept it simple and we identified a very simple end to end scenario that will allow us to test how to build a low latency streaming data platform we focused on the three workloads we want to have an API integration we want to have a notification integration and we want to have a reporting tableau integration and we want to do this with an engineering excellence that is what you would expect from a service that is customer facing so how did we do it so I'm gonna go over first of all a high level architecture and then I'm gonna walk through a specific streaming example so the first thing that we focused on was what we call data ingress which is how do you bring data from the various services into Kinesis data streams as Adi mentioned there's lots of different ways but suffice to say our goal here was to make it very easy for the rest of the business to bring their data into Kinesis and we want to do it in a way that we had metrics and monitoring and schema validation so that we could detect errors as close to the source as possible during our learning phase we deployed as I mentioned we deploy flink on eks and we knew that at some point we would migrate that over to Kinesis data analytics the next thing we invested in was we built a set of libraries or SDK around beam and the idea here was that we knew in order to scale we had to empower we had to allow our developers to build business logic without having to worry about a lot of the complexities comm with distributed systems and streaming and metrics and monitoring all those things and so we invested in things like libraries and SDKs and tools to allow our developers to build business logic in a very simple way and then we did this with a simple integration with Aurora databases and so we had an integration to Aurora database with an API stood up that allowed you to query event and so our end-to-end dataflow took events from the Kinesis ran it through beam and then wrote those events into Aurora that's kind of like our first start one of the next things that we did was we want to support replay and disaster recovery now I talked about that a little bit more in the next slide but what that meant was we want to set up fire hose as Adi mentioned we fire hose all of the Kinesis events that came in into s3 one of the consequences of this is that we actually started building our data leak with this because we registered that those raw events into a central glue catalog on the other end of the data platform we invested in additional integration so beyond Aurora we also integrate with dynamo DB with SNS and with s3 as well and this formed another part of the data Lake which is the process data would be entered into the data Lake and blue catalog and so this is a high-level picture of what our data platform looks like so I'm going to go into a now when you look at this you might be like well why do we need all of this machinery if all we're gonna do is take an event that comes into Kinesis and then write that event out to Aurora right the reality is like the complexity from of streaming comes when you want to do operations or business logic that require memory that is that you cannot produce a result simply by looking at the event that comes in but you need something else maybe you need to know an event that happened or maybe you need to wait for a future event to come before you can process this particular event services like Kinesis data analytics or flink make this possible and this is what we call stateful stream processing so I'm going to go through a simple example the stateful stream processing and a good example the state flow stream processing is aggregation alright so in our example we're gonna get orders from e-commerce and we want to aggregate that is we want to count how many orders through and we want to emit as a result the count so in this case the the thing that we need to remember the state that we need to remember is what orders how many orders have we seen so far right and so how does this work so what we what we have done is we have a piece of technology that we call the data importer the data importer has a very simple existence in life and its job is to bring data from a streaming service like the e-commerce service into Kinesis and like I said earlier we want to do it in a way that's where we have metrics on it where we can count and compare how many orders were ingesting and we want to make sure we do schema validation and whatnot so that we can detect errors as close to the source as possible we also firehose all of these events into s3 as I mentioned there so at this point we now have data in Kinesis and we have the order data in s3 this enters our primary business logic the the data platform that's built on flink via a piece of common code that we call the fuser and the fuser is part of this sdk that we kind of built the fusers goal in life is to abstract the fact that we have archived events in s3 and kind of like real time or live events coming into Kinesis from our business logic and so what it does is it mashes these two things together to produce a unified event stream our business logic then processes this unified event stream and so like I mentioned in this case the thing that we need to remember is what how many orders have we seen before and that's the memory that's a piece of memory that we need to know and so in a stateful processing data platform we store this in flink State now if you're using Kinesis data analytics amazon manages this state for you and you don't have to worry about it but if you're deploying flink yourself you have to configure a state database in our case we use rocks DB it's kind of like the state database for flink this means that you have to tune rocks DB appropriately for your workload and it's a little bit of a it's it's difficult but either way whether you use Kinesis data analytics or whether you use Flinx flink with rocks DB you have state which is stored inside the stateful platform so as an order comes in through the fuser the stateful transform executes and in this case the stateful transform is execute by doing something that we call a combined operation and what a combined operation does is it takes the state and then it adds and this he says we're doing counting it adds one to the state and then it sends that out to the end I mentioned earlier we have several integration so we write the count to Arora verb I there's an API that you can query and ask for the count we write the count two DynamoDB where you can have a key value lookup for the count and we write the count to SNS where you can get notification when the order count changes so at the end this is kind of what we achieve so we were able to build a single data platform streaming data platform in a very very simple scenario right and to end latency from kind of one an event happened to one it was delivered to one of the destination endpoints we measured about 15 milliseconds of 15 seconds although I've heard that recently we made some improvements to bring that down under about 500 milliseconds but we've done all this with engineering excellence right and so we have unit testing we have functional testing we have integration testing we have CI CD we have blue green deployments we have disaster recovery we have the ability to rebuild and replay we are monitoring and all of these things are really important I'm going to touch a little bit on disaster recovery and rebuild right now as you saw in the previous slide we wrote everything to s3 and one of the question we asked is like why do we archive stuff right like um why don't we just have a streaming data platform that just regional Kinesis as time goes and the reason for that is that we want a ability to be able to replay what's in the data platform with events from day 0 so why would you want to do that there's a couple of reasons why we investing in this the first is if business logic changes right if business logic changes we want to be able to replay and re execute the business logic on all of the data from day zero another reason is for disaster recovery we might lose our arora database and one of the ways we can recover the arora database is by simply re running the data platform with all of the events from day zero a third reason for this is if you may want it you may want to add a new integration for example we might want to add redshift integration and when we do that we might want to flow all of history into register so that redshift has all of the latest data as well so about six to eight weeks ago we put this data platform as I've described into production i started taking production traffic for the first time so the next thing we tackled is okay well this is great for services that support streaming but how do we support our existing workload services that are built on sequel right we couldn't go to the business say well you can't get streaming unless you re architect all of your services support streaming we wanted to be able to support streaming in such a way that we could leverage the data platform that we just built and deployed and we had a goal to be able to do it without without changing the existing sequel based services with no changes to them so the way we did this was that we use a technology called change data capture which Adi mentioned changing the capture is a sequel feature which is effectively a table of changes that you can read from if you run CDC agent usually on a replicated database now one of the gotchas would change data capture is that it operates at the sequel table level so you get a you get a change data capture stream for each table that you have in our legacy system or in our existing kind of sequel based workloads our services were highly normalized so what that meant was like a logical entity like an order in this case was typically represented by many sequel tables and so as you stream from CDC you have to do joins to them which is one of the reasons why we invested in streaming joins in the Data Platform so what does this look like so all we had to do was we had to make a very simple change we added a capability to date to the data importer to be able to read from a CDC table and so what that meant was that we were able to take the CDC entries coming out of sequel in a very low latency way and bring that into Kinesis at that point for the data platform it just all becomes data it's just schema right and so we modified our business logic to account for not just the schema of the original new e-commerce service that we built but also to account for the schema that's coming in from CDC and now suddenly we have streaming low latency data processing on existing workloads built on a sequel server so I want to highlight some metrics for you we were able to do all of this with the same kind of latency that we saw with an integration with the new services supported fully streaming in other words there was no kind of like latency hid to using sequel server CDC we have about 1 billion records or so in our existing sequel databases and we tested some throughput so we were able to process all of those records in about 3 hours using 8 small-sized ec2 instances for our eks we've experimented with paying more money with larger instances and we could get that down to about 30 minutes or we could pay less money and take much more time but the point here is that we were able to have elastic scale for a data processing we also were able to achieve five million writes per minute into Aurora which is one of the endpoints that we had for us the most exciting thing there was the ability to stream data from existing workloads built on sequel server onto Amazon and have low latency data processing over all of that data so what's next for us we have a lot of things that we need to do and that we want to do but one of the things that we've been exploring along with the Amazon team is my greeting our fling clustered onto Kinesis data analytics as I mentioned it's hard to manage your own Flint cluster you have to deal with tuning and rocks DB management and all of these types of things that we kind of don't really want to be in the business of it so we're really excited about working with the Amazon team on this so this is what our future architecture or what our desired architecture will look like all right so overall this is our journey so far right our journey isn't over there's a lot more to be done but so far what we have is we now have a stateful stream processing data platform and we're able to use this to power all of the integrations that we desired we can power our product experiences through an API make a power notification type experiences and we can power low latency analytic experiences so there's a few things that we've learned through this journey and I want to share with you a couple of these learnings the first is that stateful streaming is a relatively new area right and so our approach we thought was a good approach which was to do to learn stateful streaming in a very incremental small way right one of the things that we learned is that actually there's some cost surprises for us with batch data processing typically the kind of things that you log or you log things like whoa which stage or the batch job have you finished and did the batch job complete successfully and to end with stateful streaming processing we often had to we had to log we had to add logs for every event that came in right and so now suddenly we the number of larger the volume of our logging increased exponentially which was something that we did not expect when we started on our journey the second thing that I would describe is that scaling is very hard distributive scaling is a very hard task scaling flink and scaling eks it's challenging there's a lot of knobs to there's a lot of things that you have to do and so it's been a challenge to kind of get the data platform stable to be honest and that's what we're looking forward to something like Kinesis data analytics because we don't have to do that work we can rely on Amazon to do that work for us the thing is a lot of things kind of work simply you know when you when you do like five to ten transactions per second and your flink cluster is running really just on one node a lot of things kind of work but when you're pushing like tens of thousands of transactions per second suddenly you have to worry about a lot of different things as you increase the number of nodes that are that you have to manage and eks if the worry about logging and and diagnostics and alerts and how do you know what went wrong and how do you know worried what wrong and so that's some of the learnings that we have and really I think one of the last things is that debugging a streaming data platform like a distributors data data platform it's really hard and so we kind of organize the way we thought about how we debug along these two questions the first is how do you know something went wrong in a batch process that's kind of very simple right like your air flow job or you use your easy job that are there succeeded or it didn't succeed but with stream processing maybe one event didn't get processed right maybe but how do you know how do you know that an event didn't get processed so we had to invest a lot of monitoring and tools to be able to answer that question the second question we have to answer is like how do you recover from that situation let's suppose that something went wrong how do you recover from that again in a batch job you just kind of rerun your batch job but in a streaming world where you're actually affecting customer experience that's not really an option for you and so one of the things that we found is that man debugging and kind of like running the service is pretty hard partly this is because our data platform is now used in customer facing scenarios and as a result of that all of the things that we worry about when we're building services like mean taina resolution and all of these things they apply to the data platform the final thing I want to leave with this we learned a lot from last year when we were here at reinvent and we've talked to a lot of different customers some of you may be in this room we spend a lot of time just learning from what you guys were doing and so that's a really important learning for us and so to continue that tradition I have with me today Jake Swenson who's our architect and Jason McKay who is our director of engineering and they are more than happy to answer any questions that you may have about what we did or what we've done or why we did some of the things that we did all right cool thank you so much I'm gonna bring Adi back up to wrap up thanks a lot Tim that was pragmatic super detailed and what I thought was a really kind of compelling story around the realities of moving to streaming data on AWS so we have about 10 minutes and change and we are very happy answering any questions that you may have there are a couple of mics in the in the aisles over there so in case you care to venture forth please do and ask questions if not just shout them out will repeat and we'll really take this so so it whites a little bit bright any questions that we can help you with okay first of all congrats on the presentation thank you go ahead may I ask you yes go ahead okay how many kindnesses streams do you use it's one per service or one per event in our in our in our deployment that we put into production we have one per service and so in the case of e-commerce there's about I think six or seven services and so there are six or seven Kinesis streams one for each of the services and and for the legacy did he use the B's you or anything like that for CDC or the DMC yeah so for legacy we use CDC and so there's one stream per table which 2d which he platform DT use for that which platform or - or we just we just read the CDC tables or so so we have that data importer piece of code which just reads the CDC tables and then writes it to Kinesis using the Kinesis sdk okay thank you yeah any other questions how so the question is how big is a team and how long did it take to complete phase one and two I think that's a that's a hard question because we started the journey with only a handful of people right in terms of the exploratory which we probably started about 16 to 18 months ago and then as we journeyed through that we started adding more people on the team to work on the data pod because like I mentioned we wanted to make sure we supported the existing business as is and so the majority of the team was working so so on the data team we probably have about 30 30 or 35 people and we price split it we probably both eight people working on kinda like the core platform infrastructure now and the rest of the team is working on a business logic and it took us pray about so we deployed into production about eight months ago and we got our we set up our Amazon accounts for the first time after we signed the contract about a year like I think it was about September of last year so end-to-end was probably about one year yeah so the question is are we happy with the selection for beam and and and I guess I would say that what we selected beam because it provided a set of high-level abstractions for us that was one of the things reasons why another reason why was because it actually allowed us to deploy a beam kind of pipeline on spark and flink which was one of the other reasons why we want to use beam and so far I think we've been pretty happy with beam it provides what we think to be the right set of models and mechanics to handle streaming like stateful streaming so we have the ability to manage Windows whether we use global Windows or session windows or whatnot and we can control the triggers or how we send events down and so I think we've been pretty happy about our choice you could talk to Jake and Jason a little bit more details afterwards yeah yeah so the question was that on our on premise we used Kafka and now amazon is offering msk which is a managed calcio service and we'll we switch to Kafka I don't think we have any plans to right now you know we've already invested in Kinesis and so I think it's been it's been good for us I will say though that the way we designed the data platform is such that and because we are running in our private data center today on top of SPARC is that we could switch to Kafka as kind of like the data streaming storage solution if we needed to but we don't have any plans to right now yeah hi a quick question about your streams you mentioned you have a stream per table are you sharding those streams and if you are how are you dealing with any concurrency issues ordering issues on those charted streams so the question is are we charting our streams and Kinesis yes and then how do we deal with ordering issues yes we are starting our streams on Kinesis and in terms of dealing with ordering issues I'm gonna actually ask you to talk to Jake and Jason who have dealt with those problems you know to a very low level detail yeah so in one of your slide you mentioned about like five hundred thousand or maybe five million Aurora writes per minute are you guys using the multi master Aurora or you still have like one one simple single load of Aurora weird ones how did you achieve that I'm we have one single Aurora yeah yeah it was a big Aurora instance yeah thank you Hey so this this may be a question for you Aditya but you mentioned just earlier was much and how does AWS think about like if you're just starting out now that there is a managed koepcke service mm-hmm when would you use Kinesis versus when would you say good question and something that we get all the time we if you've already used Kafka someplace on Prem even on Adobe's lots of Kafka that runs on AWS and you've built tooling functionality that uses kind of the Kafka Kafka ecosystem in a way then we would want you to you know do away with all the operating burden of managing Kafka clusters and brokers and zookeeper and use an escape absolutely because because at that point your cost of learning cost of training up skills is pretty low because you've only done that if you're looking for the easiest way to get started scale and at the lowest cost with the most AWS integration so I'm talking I am right I'm talking lambda I'm talking you know cloud wash logs emitting stuff into into a stream then I would say you start with Kinesis so that's so I think at the end of the day it's our job to provide the right set of options for customers and you probably will give me doing that thank you I have a question at which point here joining those dreams you mentioned that you would pull data from different tables using CDC and you mentioned that now you have to join them where where is this joining happening yeah so in our case that join happens inside beam and it depends on the business logic that we want to do so there's roughly two types there's there's kind of like what we would call like an inner joint right whereby you want to let's say that you're just joining to in our case we're joining quite a bit more but let's just say we're joining two streams in in the case we're doing like what we would call an inner joint you as one event comes in you would have to store that event in state and then wait for the second one to come in and then when both are present you then emit the result out of out of beam and out of link you could also do an outer join whereby as one event comes in you just write that to Aurora for example and then when the second event comes in you can just do an absurd into Aurora to kind of fill out the nose if that's right terminology and so it really depends on the business logic that you need to build and whether the business logic allows you to do an outer join which will basically you'll have lower latency on outer join but you may have incomplete data or you may have to wait for all of the events to come in before you can do an inner join type of business logic so do you do you have some kind of a time limit the window how long you wait like 10 minutes or that's a great question yeah today are are okay so I'm I'm gonna say things that they may correct me today at least in the pipeline that we've deployed in the production we use a global window and so we basically will wait indefinitely and that's worked for us right now because the way we bring data in we guarantee that there that the data we have guarantee data delivery right and the way we do that is we use kind of like a pulling mechanism and so we manage kind of like we're where have we read from the source system and then we basically ensure that we read all records from the source and that's what we would do with CDC that's what we do with CDC as well you can imagine that if you can't guarantee data delivery right so maybe maybe you're pushing data into Kinesis and maybe sometimes you would have an error and maybe sometimes data would get dropped then you may not want to select a business logic which requires you to receive all data before you write it and so it kind of depends on your end-to-end flow kind of like what your data quality or your data SLA is as well as what your business logic demands okay thank you okay oh my god in ten seconds so one more question yeah okay he talked about the recovery right so you have you store everything in three and then you have to replay what is the best practices you followed by historic data into s3 and the order in which it received and you replayed in the same order so that's a great question so the question is like in terms of storing data in Esther I think we use a bed time right arrival time arrival time actually I think you should talk to these guys for much more detailed things they're gonna be available afterwards you know and we have nothing to do after this session so we'll be outside if you have more questions we're willing to chat for a while one more question can you just very briefly what is your cdc importer written in and out of what service does that run on it's written in python and it runs on eks okay awesome thank you very much cool thank you so much [Applause]
Info
Channel: AWS Events
Views: 12,251
Rating: 4.9055119 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, ANT326-R1, Analytics, GoDaddy, Amazon Kinesis
Id: TAkcRD6OxPw
Channel Id: undefined
Length: 61min 3sec (3663 seconds)
Published: Wed Dec 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.