AWS re:Invent 2017: Big Data Architectural Patterns and Best Practices on AWS (ABD201)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right good morning and welcome my name is Siva raghupathy I lead the Big Data solutions architecture team for America's that comprises of database analytics and AI services welcome to my session there's also a repeat session a little bit later today if some of your friends are not able to get in it's in a bigger room below so just a little bit of background about myself I've been with Amazon Web Services for the last eight and a half years for the first two and a half years I helped build a couple of services Amazon DynamoDB which is a no sequel database service and later on Amazon RDS which is a relational database service for the last six years I've been working with customers around the world including amazon.com and helping them architect big data solutions on AWS so this presentation is a culmination of my experiences so to say some some interesting mind map pictures along the way you may love it or hate it but it's very colorful so welcome to this session let's get started now in terms of what to expect I'm going to go through some of the big data challenges customers are facing some of the architectural principles that I came up with six architectural principles that always have governed my you know decisions and help guide customers I think this probably the most crucial part of the presentation I think if you take away those six points chances are you're building a great application big data application so I'm going to actually simplify the big data processing by breaking that into multiple stages collect store process and consume and in each stage we're going to go through and pick up what options I'm gonna outline some of the services that are there and then give you some parameters such as to how to pick up the right right service or the tool in each stage and and then I'll and one of the things is as a person building services I have a unique exposure to how we build services at AWS so I'm gonna address some of the why questions in the presentations I feel like as a decision maker whether you're an architect or a business person making a decision to understand why you know kind of gives you the power the rest of it flows through you know how naturally follows from why etc so I'm going to address some of those and towards the end I'm going to give you three architectural patterns and design patterns if you and a reference architecture a pretty colorful reference architecture as you would see so one thing I'm not going to have in this presentation is a lot of demos I'm going to use the entire hour most likely and I'm going to hang out here and I can have lunch with you if you have time after and I can you know feel free to you know raise any questions chances are I have a lot of content I'm gonna run right into the end of the hour so I'm happy to take on any number of questions that you have this afternoon I'm also not going to have any code I feel like you should get your architecture right before you write code so this is an architecture session which means I'm not going to have any demos Ben is doing a talk right after this he's going to show you a cool demo it is serverless big data you know presentations so I think I'm gonna focus on the architecture here and now let's get started with that my hope is to give you a compass I was thinking about you know what are the two things that I can give you as takeaways I felt like okay I'm going to give you a compass as to how to navigate the space and I'm giving you going to give you a few maps you know those are design patterns and a reference architecture I'm gonna call the maps so those are the two takeaways you know that I hope to you know come out you know through this person that that'll come out through this presentation so so the volume velocity and variety of big data is ever-increasing you know about ten years ago somebody building an application that does you know iron 50 terabytes of data in just per day or about 150,000 or 200,000 requests per second is pretty rare but nowadays customers are routinely building these applications oftentimes you know with one or two developers or a span of a few days or weeks if you use the right manage services you can easily accomplish that in addition to you know file data or text if you will the video and audio are becoming first-class interfaces you know we all start we are we are all started to talk to our phones and you know echo devices etc so your big data applications will not only be data or text driven would also potentially be wise driven and conversation driven you know how do you leverage the right services to be able to actually build those things into into your in get a pipeline is something that will address and then in terms of the evolution big data systems are rolling from batch processing systems so you know typically you ran your hive batch application or your data warehouse batch application you ran a report every week or every month you know from then on you know that was there used to be the case many years ago ten or fifteen years ago and then this mode onto stream processing rather than if you are doing click streams you know rather than writing the entire click stream in a file and then copying it to HDFS or s3 and doing processing you may want to actually put that data in Kinesis or Kafka and do real-time stream processing on top of that so that's you know so therefore you can have answers really fast in addition to these two you know we're adding actually intelligence into these applications it feels like putting a brain you know on top of the little Lego my son created for me that I took a picture of a few years ago I passionately used that during this talk so it almost seems like we need to take our big data pipelines and then prediction enable them in other words you need to be able to do batch predictions and real-time predictions on top of your fast-moving data or slow-moving data and you know for the last 8 and 1/2 years I've seen AWS database services your wall from virtual machines to managed services to server less you know what that means to me is you know in the case of the virtual world you pretty much created an ec2 instance and you install your database software or a Big Data software and then you stitched on EBS volumes onto that I'll use the local disks and you kind of installed your you know your software and did that instead of that we created managed services such as RDS where you simply say I want a database by the way I need so much a memory and so much of disk and then go ahead and create the database for me we have a control plane under the covers that basically stitches the EBS you know volumes and instances and deser aids raid across you know writes the data across multiple EBS volumes if need be and then assembles this for you and then then we're off to server loss in my mind you know DynamoDB could be server less in the sense that even though there's servers behind that are finally optimized for no sequel scenarios you simply specify what you want in the case of a DynamoDB you say here is a table here is the primary key for the table by the way I'm going to do a million reads per second and two million writes per second into this table once you specify that in terms of a simple API in a create table API we under the covers basically provision partition the table across thousands of nodes and make that available for you right and then know that gives you a lot of leverage right now that in another in in in instead of worrying about servers and software you simply worry about your programming abstractions you need a table you have these keys and you know you need to be able to stitch all of this into your big data pipeline so we'll basically I'll whenever in each stage I highlight a service I'm going to call it server less or managed service etc so so it's pretty important to be able to use all of them and we'll see how I mean this is I used to call the zoo of technologies so this is one of the this is an opportunity and a challenge you know on the left side there's a plethora of tools basically to do these functions one of the challenge is really picking which one to use on the left side is the open source of the Hadoop ecosystems that churns out an amazing array of technologies and the right side is AWS services any time I come to reinvent them you know will likely announce a whole bunch of new services and I'm not gonna be able to put that in the slide next year probably so you know I splattered this slide with a whole bunch of AI technologies such as you know MX net tensor flow you know Kiera's Theano etc the question to me was I was writing this slide was how do I fit all of that into my big data architecture I'm gonna try to do that towards the you know course of the presentation so no with that the big data challenges are like this you know is there a reference architecture what tool should I use how and why is and more importantly these days ai how do i I enable my big data application you know he went to a big Gator conference maybe three years ago it was in a spark was the answer no matter what the question is these days it's AI is the answer no matter what the question is so we're all struggling to figure out like how do you enable AI to your applications right that's what I see across our customer base and if you're getting started you're not you know join the party I think we can get going really fast so these are the architectural principles I'm going to dive a little bit deeper into this the more and more I build big data applications the modern mode it looks like a sort of a pipeline right and a couple of things repeats itself right there so there's a flow where the data gets stored and processed and oftentimes people try to put that together the best practice there is to actually decouples storage from compute what that allows you to do is if you're processing if you get a new processing technology we didn't have spark ten years ago when spark spark came on took over processing technology we basically replaced hive with spark if you nicely decouple your processing stage from storage you can simply you know go ahead and replace a newer technology instead of the old and get going so it's pretty dark pretty important to to basically decouple your pipeline so your pipeline should ideally look like still processed or process sort of repeating itself and then using the right tool for the job is paramount at AWS so whenever we build as a builder when I was building dynamodb or RDS we typically tend to build these tools to do perform a few functions extremely well anytime you don't use that you don't match your application characteristics to that of the application what happens is say that the application doesn't work fairly well all your cost goes out of whack I'm gonna actually illustrate the cost piece you know to an example a little bit later in the presentation so what are the defining characteristics of various applications you know if you're doing stream storage subsystem what are the defining characteristics that you should look for if you're looking for analytics applications what are the defining characteristics that you should look for is something that I have characterized this so the key point there is you should be using the right tool for the job one size does not fit all at least at AWS from what I have seen and leveraging managed services our server list services is key you know instead of you getting an easy-to instance installing software managing it and so on this feels pretty cool for the first six months or so after that it gets pretty darn boring if you when once you have your application and ship versions simply using a managed service you know takes the muck off of your you know engineers and lets us deal with that and then so you can actually quickly build your applications ship those application and give value to your customers you also you know spend a lot you know along with it comes availability scalability etc using log centric design patterns is key I think this is one of the very important and interesting points here because if you're familiar with databases transaction logs have been around for a long time in other words even if your database gets corrupted you can always build the state of a database through a transactional log you basically take you know an empty database start applying your transactional log and build the database the key point there is that if you are able to keep all the data that comes into your company you know think of that as a transaction log and persist that as is without chopping off any of those you know legs and hands then you may be able to construct the state of your data at any given point in time this kick up this becomes even more important with AI and machine learning because in many cases we don't know what questions the business folks would ask going forward and in order to predict some future behavior you need to have enough data to be able to build the model and materialize the model if you don't have the data then you have to start building the data set you know after you have the problem so you know as AI architect or a data architect it's probably pretty important to actually keep the data now since things like storage with s3 is so cheap you know three or two cents per gigabyte per month you can pretty nicely compress the data and keep all the data without much cost then when you go to build that you know build a model you can pretty much use that so I'm gonna reinforce eyes that and are building immutable keeping all the logs you know don't delete anything would be another simple way of saying that you know if at all possible right and then milk utilizing the view so you should think of your various services you know elasticsearch or data warehouse as simply a view on top of the immutable data that you will put in your data lake if you will so that's the essential idea there you know things like bitcoins etc work off of a general ledger and it's it's a fancy word for a log a big log that has all the changes that goes forward right you know general Ledger's have been around in accounting for a long time transaction logs I've been in databases for a long time this is the same idea that repeats itself being cost conscious often you're gonna make the right decision at AWS contrary to you know expensive things should be more cooler more better here the cheapest thing is likely the best service to use if you have two options you know pick a RB service chances are the cheaper service is going to be the right fit because we have priced that to do that function extremely cheap and extremely well that is contrary to the you know general thinking that expensive things are better and last but not the least I added the bullet for this year always thinking from the know n first you will at some point a enable your system you might as well start thinking now you know if I were to a I enable this big data pipeline how what should I collect or how should it look like how would it look like right so that is something that is very important I think this is probably the most important slide and you know if you practice these concepts chances are you're gonna get your design right so I just want to re-emphasize that so now with that in place let's go ahead and simplify big data processing so the more and more I think a big data processing I was a mechanical engineer before in life so big pipes you know it's pretty cool for me data goes in on one side answers come in another side that's the essence of any big data processing not many any more complicated than that and in in between there's potentially multiple stages they look like something like this collect store process are analyzed and consume and typically what also happens is the store process store process repeats itself you know to shape the data in a form that the downstream application can consume pretty rapidly so and what governs that is what I call as the pipeline latency which is how long you have to answer you know you may have to answer you know in milliseconds then what goes in between is gonna be dictated by what your time to answer you may have week to answer then you could basically use a few other components or throughput right which is the amount of data you know gigabytes or megabytes or terabytes or petabytes point per second right are processed per second and then the cost those are very simple principles right and in a way if you take away all the other stuff fancy stuff and the essence that's what you're trying to do you're trying to ask the question mr. customer you know how fast do you want the data to be materialized what is the view you know are you gonna look at this in the form of you know what is the pretty picture or on a report that you want so let me shape the data in the form that you'll get the report rather rapidly and then by the way what is going to be the bill for this application you know how much are you willing to pay usually tracking your design across those parameters you know have always helped me you know guide customers to use the right solution along the way this is like the concept map I'm gonna introduce this idea of the temperature of your data anytime I'm building a system I often think about am I dealing with hot data warm data or cold data right typically hot data is smaller pieces of data you can think of that data in like cash or your RAM random access memory it to be hot data typically tends to be fairly small item sizes tend to be small you know you're basically doing a very frequent access of the data pretty much you know the the number of AI ops per gigabyte if you want to technically think about that it's pretty intense heat right I ops per gigabyte towards the colder side what happens is the data tends to be pretty large you know big files in some cases you don't care whether you want to access this in seconds a millisecond if the data shows up in a few hours it's fine you know for example if you put data set in place here you know the lowest tiered a glacier you know data comes back and three and a half or four hours it's fine you know because you're just simply pulling out some data said that was archived and then you can wait for that long so having a sense of whether you're dealing with hard data warm data and cold data often allows you to pick the right tools I just want to introduce the concept here now let's go into the collection stage with that in the collection stages you know you're going to have applications that are using odbc jdbc interfaces writing to databases or you may have a data center that's connected using Direct Connect I'm going to use a lot of small icons the slides would be available for you both in terms of SlideShare as well as YouTube so you can look at it don't you know don't squint your eyes I walked around the room yesterday it was pretty pretty small for folks in the last row to look at this but don't worry about this they'll get the slides the key concepts are in bold letters so you should be able to follow that you know data structures and database records typically you know I'd call this sectional data and file data right you're basically you have a let's say if you have a click stream analytics to choose the system you in the application tier you're writing files and then you're actually moving use Apache flume or something else to actually move the files onto HDFS or s3 you are moving data the file data in some cases this could be video clips or audio recordings as well our streaming data streaming data is typically generated by devices and sensors it is essentially time series data in a strum instrument is measuring rapidly you know temperature or pressure or something else and then it's actually sending that stream on to the IOT platform which is actually eventually sending it to AWS or actually you know sending it if they're connected devices they're sending it as is you know basically these are the three types of data you'll be dealing with now what is the storage type that you should use for these tea types of data typically transactional data goes into in memory databases sequel or no sequel file data goes into file or object stores such as HDFS or s3 and events go into some kind of a stream storage and I'm gonna focus on zoom in on stream storage right now so typical stream storage subsystem that are prevalently used by our customers are apache Kafka it's a high throughput distributed streaming platform for for managing streaming data it has all kinds of control will actually you know cut various attributes of kafka compared to others amazon Kinesis streams is a fully managed service for storing streaming data on AWS we also have introduced amazon Kinesis firehose which is built on top of amazon Kinesis which not only collects the data but it actually transports the data to an end data store you know for example if you write data to streaming data to amazon kinases Fargo's it will actually transport the data to s3 or to redshift or to elasticsearch which is pre built you specify the destination you specify how frequently or how big a buffer that needs to get before you push the data there it automatically does that for you and then let's look at why stream storage you know basically decouples producers from the consumers it provides a persistent buffer it color allows you to collect multiple streams of the data in other words multiple producers are given you know that red green and violet producers etc on the left side and then they're writing to a stream it could be either a Kinesis stream or a Kafka stream I've also highlighted dynamodb streams here which you could potentially use in for some cases but it's a minority case you know we'll have a discussion about that if you want offline essentially what happens is when you write when the producers write to it the stream nicely decouples this from a consuming application therefore the consuming application and the producers can go at their own speed it also allows you to preserve client ordering which is a defining characteristic of a stream so in other words if the red producer is going to say I am red in other words key equals red and it's going to write packets 1 2 & 3 what the what the Kinesis are the Kafka you know partition of the stream will ensure that all of those red packets will always go to the you know shard one or the partition 1 therefore a downstream application you know like a consumer 1 and consumer 2 that are parallely consuming that you can actually assume that the red packets are always going to be in us in a partition 1 on the stream 1 therefore it can do complications - just give me the maximun average etc there's one downside of that if you say key equals 1 if your put rate in producer 1 is going to be more than what a what what a single short can handle you will get into some scenarios if you if you get that then the only option there is to not actually do that and then run a downstream application to do the to do the sorting if you will right so that's the essential character we go so you know this is also called this the fancy name for this is streaming Map Reduce right which in other words you write it the MapReduce is already being done by the underlying framework the moment you specify key equals red or key equals blue etc now oftentimes I get asked by customers like what about this Amazon sqs should i what should i use this for well it also does decoupling persistent proffers and it also allows you to do you know collect multiple streams but it doesn't have there's no client ordering at least in the standard queue there's no streaming MapReduce in other words there's no notion of you know separating those packets out into various partitions pertaining to a single producer and there's no parallel consumption in other words if one client consumes that payload there's a thing called visibility timeout then that payload is not available for the rest of the folks to consume so as long as you know you you don't care for those using that is okay but if you care for those functionality of multiple producers consuming that then this is not the right use case we also introduced you know FIFO queues recently the FIFO queues I'll compare and contrast that in a moment if you folk use allow you know preserving client ordering and there's also a workaround some customers of use in other words you know you can write to a SNS topic and then you can have either a lambda function or the queue subscribe to that those are the edge cases right so you know just in case you're actually doing a design of your system and you want to actually consider that I just wanted to actually put that in the slide chances are using Kinesis or Kafka is going to be ideal for the reasons I will point out in a moment early I wanted to include that there just as a reference here is the various attributes now this is a very busy slide it's impossible to read this probably from here but I'm gonna step through this on the Left column you know I have given various attributes such as is this a managed service you know is there guaranteed ordering is there delivery you know what about deduping what kind of a delivery mechanism does it allow it is it at least once processing if I write a packet will the packet be available to consumer at least once at most once or exactly once what is the semantics there what is the retention period if you write it - you know SQS or kinases how long do these systems maintain those packets if you will what is the availability characteristic of this you know scale and throughput for example if I take Amazon Kinesis it's a fully managed service it guarantees ordering there's only at least one semantics you know in the edge cases Canisius will have two rackets for the same put that you did in other words if you're building a billing system if you don't want to double your customer you should be d duping those records using something like dynamo DB if you want you know at most ones processing that is not automatically built into this apparently in Kafka that is automatically built in if you use the right client if you use transactions etc so if you compare and contrast anytime you go to customers they usually tell me like don't give me a menu can you pick one for me because I'm having a headache so if you want me to pick one for you I would pick actually Amazon Kinesis is the best place to start because it's a fully managed service and it's probably the cheapest if you price that if you include admin in that but Kafka gives you sort of infinite control as you can just see there it gives you at least ones processing exactly ones processing or at most ones processing you know the retention period is configurable whereas in the case of Kinesis at 7 days if you don't take the data out in 7 days it goes away and so on right so I think I just this is the often customers go through this paradigm so I just wanted to document this in a slide that you can basically you know look at this as a reference and if you compare this with the sqs you know if you if you take s key was fee for cues you can only do 300 PPS into fee for queues that's the last time I look at it I'm not sure the team updated those numbers we tend to increase these numbers or a period of time the last time I look at it it's about 3 RTP s if you're doing more than 300 TPS v4 queues may not be the solution for you so I just wanted to highlight that piece as well and instead you should be thinking of using Kinesis or if you don't have guaranteed ordering you need you should use standard SQ us at the bottom I have called this Hart horses that the bar so if you're dealing with colder datasets typically you know putting it in sqs makes a lot of sense in general if you're dealing with streaming data you're dealing with hot datasets in all of these services or it would nicely handle hot datasets file in the object store what type of file store or the object store should use you know s3 is for those of you who don't know is a managed object service for building you know basically we call this a storage for the internet you can put any amount of data any number of files into s3 and it is perhaps the best place to put your file data so I think the biggest recommendation here is if you have file data you should absolutely put that in s3 before you consider something else he needs to be a good reason why not to do that right that's the essence of this why it's natively supported by pretty much every single you know big data framework you know I go to some of these big data conferences that you know the person presenting we get a framework simply starts off saying assuming that the data is an f3 don't even tell you what s3 is so that's how popular and prevalent this is right so it natively supported by many frameworks the other piece is it nicely allows you to decouple storage from computer what that means is that if you put your data in HDFS you need to have these machines humming and running for your file system to be available whereas in this case you don't have to do that you simply can shut down the you know do your processing write the results read the data from s3 process the results process the data write the results back to s3 and simply shut down the cluster so you can use things like spot instances you can also fine tune if you have a memory intensive workloads you can use memory you know based instances you know for example ec2 has you know compute intense know you know data tot easy to instances memory memory based ec2 instances you know and CPU optimize DP ec2 instances so you can pretty much match your workload towards the characteristics you know after of the instance and pick the right tool rather than worrying about actually this instance providing storage as well in which case you have to pick instances with with the bigger disks local disks if you will and then so you know s3 is also designed for 11 9s durability and and you need you don't need to pay for data replication yes three keeps multiple copies or the data within the same region in fact you don't even know about that and then it automatically whereas in the case of HDFS you know you pay for data replication if you want to do a3 verification you want to have a big disks for that in the case that has three the price includes actually we keeping multiple copies of the data and multiple data centers within the same region you do pay a price for replicating the data across regions but not within the same region also it you can do server-side encryption you can also see with KMS keys with keys managed by AWS or you know for example you can also do in our ssl etc it's fairly low cost and often times the question asked is what about HDFS when do I use this for more and more HDFS is used as somewhat of an intermediate like a cache if you will for your hardest data sets you know you store the persistent data and s3 you read a copy of that you do the processing you write an intermediate copy in HDFS and you go ahead and in some cases that data is read multiple times so use that as essentially a cache but not necessarily a persistent store you know yesterday also has various tiers s 3 SS 3 standard s 3 standard in frequent access and Amazon plays here and then there's s3 analytics which is a service which allows you to basically runs analysis on your s3 bucket and tells you what storage class to use forward what data sets so that is how you should think of your tier data as you need ages you should actually be moving those you know data to glacier a few well from s3 if you're not using the Sun if you are in very infrequently use this go ahead and put that in history infrequent access now what about caches in databases right amazing elastic cache is the managed service that uses mem that can support mem cache other Eddie's and in engine under the covers we also introduced dynamo DB accelerator which is an in-memory cache for data and DynamoDB so if you actually write a dynamodb application and point to to a Dax end point then would it be accelerator in point you know when you do a write to it it does write through cache to Dynamo when you do reads from it you don't need to need to even change the application you simply switch the endpoint to this you know why am I talking about databases in a big data application chances are if you have velocity of the data you know high velocity millions of writes per second you need to literally put that in dynamo DB probably the best tour for you to put that in terms of the managed service available on AWS right in some cases let's say I was working with a customer in the advertising realm they do serve ads in them for them Dax was very important because you know for example for ad lookups and so on you know they put their persistent data you know user profiles is a big table if you're in another pricing business they use a profile you know gets you access to millions of times per second it's a fairly read intensive workloads having a cache in front of that actually you know saves you a lot of money so using something like dynamo DB accelerator is pretty important in that scenario RDS which is a relational database service it has multiple engines starting with Amazon Aurora you know my sequel post etc including you know Oracle and sequel server so this is an anti-pattern if you're building a big data application if you're using a database you should probably not do this you know it's reason because if you're doing millions of writes per second you pretty much have to shard your data across multiple instances and sharding can be a pain and also in the case of dynamo you can write a million writes per second and do ten reads per second and you can simply specify that and we'll charge you for that in the case of the database you get both you know if you use supersize it for rights you also get supersize automatically for reads you know you don't save any money by doing that right so but financially as well as technically typically it's a bad choice instead what should you do right well your database tier should look something like this in memory you know sequel no sequel search and graph for example when I put a slide like this the immediate next question from customers is well how do you keep all these in these things in sync you know here's an idea you know for example in this case it's an application doing no sequel to DynamoDB DynamoDB has update streams I simply hooked up a lambda function that will actually look for all those changes and take those changes and apply that in elasticsearch as well as elastic cash right so if you have a leaderboard for example if you're if you if there is a gameplay involved in your application you're writing all the gameplay into a no sequel store and you are updating you know you have an update stream in the case of dynamo to begin a pick that up and you populate it into your leaderboard and the leaderboard data structure is automatically populated right so even though it's slightly eventually consistent this is a very scalable way of doing it in many applications it's totally fine doing that there's very few use cases that you know may not fit in the scenario that's a great idea you know that is the gate that is also a great example of a materialized view on top of an immutable log here I'm also you know writing all the data at the lambda function there's another lambda function that's simply taking all the data and persisting that and s3 that's the lower you know at the very bottom that that's your immutable log so in other words you want to go back and find out you know what happened in this gaming scenario I didn't persist all this data in the database all the data that came into your application is persisted durably on s3 that allows you to do you know various things like buildings clubbing what happened in your in your infrastructure etc a little bit later you know that's what I've indicated as cached views and search view and an immutable Lord right those are the three things we went through so which data store should I do it should I use you know typically there are four pieces there what governs the decision is your data structure you know are you dealing with fixed at schema JSON key value you know what is your access pattern you know storing one of the things that I've learned over years you know either building databases and big data applications is that you eventually store data in the in the form you eventually access it you know that is the essence of all kinds of query processing index building and so on right that is a very simple - it sounds pretty simple but actually if you if you use that paradigm it's really really helpful especially at very high scale if you don't store the data in the form you access it you don't have time to build your data structures often time customers get into all kinds of trouble because they're materializing of you when the when the question comes whereas you should have already built the queue built the built the answer in the form or store the data in the form that it was prevalently access place out everywhere whether you pick you know hash keys and range keys and dynamodb or a SART Keys and distribution keys in your data warehouse eventually the essence of all of this big data processing is to store the data in the form that you will access it right and then having an idea of whether you're dealing with hard data wom data cold data is right and the cost always helps you pick the right choice at least at AWS here is a simple example of data structure you know if it's a fixed data scheme sequel or no sequel plays out fairly well if it's JSON no sequel or search place well if its key value typically putting your data in memory or no sequel makes a lot of sense you know graph databases are getting very popular because you know the speed of accessing deep networks is going lower and lower so you need to have MIT you need to start to think about materializing a draft view which is something that I've learnt this year it's getting more and more interesting more and more customers won't want to deal with graphs so we're at that thing so basically materializing a graph view of your data is becoming more important these days similarly when you look at it access patterns if you have a get access pattern you know having a data and no sequel or in-memory based on how fast you want the data to flow is key if it's a simple one-to-many or many to many relationships you can know sequels will nicely handle that if you're a large-scale you know you can use no sequel instead of a classic sequel you know where you get into some trouble you know still multi table joins and transactions you know sequel databases do that fairly well that is one case where you still have to use our good old sequel and it's a pretty problem to use case in many cases if you need that functionality you have to shard your data across multiple multiple instances there's no other way of doing that you know if you need those kind of joins and transactions and if you're dealing with faceting and search if you're going to Amazon website on the left side we have a pain saying you know all you know show me all the prime applica you know products that are shipping Amazon Prime that's called faceting right you know what is the fast that what you want to sort so if you if you need faceting basically putting that in a in a search store such as cloud search that elastic search is always the good you know right answer and I'll be obviously for graph you know multi-level graphs graph databases work fairly well again this is another concept map you know somebody when they designed reviewed my slide told me this is the ugliest slide that this person has ever seen right so last time when I said that in the talk somebody in the audience said why do you put that slide if that's the case it's not the reason I put that slide is that's how I think in my brain right and multiple customers many many customers have found that extremely helpful whenever you're picking you know always you're dealing with a spectrum of services in your head right when you're actually making a choice in some cases there's a nice overlap between multiple services you know depending upon on the on the x-axis that I have request rates cost per gigabyte latency data volumes on the on the y axis I have structure of the data being low or high so it's sort of a mental map it's by no means it's accurate but generally kind of gives you a continuum of services right why did we invent so many pieces it turns out that we need all that functionality across the spectrum and in many cases as a designer you're trying to figure out what is the right thing for your for your use case in some cases there's more than one scenarios in which case you have to break the tie and how do you break the tie I'm going to give you an example for that it's hold on for another slide which data store should I use again this is a very see slide what I've done is taken all my experiences all my hundreds of design reviews and then characterized these various data stores across various dimensions you know what's the average latency what's the typical data volume people use for this data store what's the typical item size you know for example if you take DynamoDB you know typically it get inputs returns back in about an average about three to four milliseconds or two milliseconds you know somewhere along a single-digit millisecond if you will whereas on the left side if you go to elastic cache if you need microseconds typically you need you should use Dax or elastic cache you know typically sequel databases return answers in a few milliseconds you know one or two milliseconds and depends upon the complexity of the query in some patients it can take seconds if you will and then at the very bottom I've given availability characteristics right you know dynamodb keeps three copies of the data at three different data centers when you do a write to dynamite it's it synchronously writes to at least two copies in two data centers before it comes back in about three or three milliseconds so that kind of durability is important then pretty much you should pick dynamodb for your store and you know RDS Aurora is also highly durable it writes data to three data centers three AZ's so so this kind of is a map again for you to consult when you're trying to pick your data store I'm not going to go through every aspect of this and here's an example it's actually an email that somebody sent me right this is a developer on amazon.com that basically set me you know I should I use s3 or Amazon DynamoDB the person says I'm currently scoping out a project the design calls for many small files perhaps up to a billion during peak write the total size would be in the order of 1.5 terabytes of this data set right when I actually poked around it turns out something like this request rate is 300 writes per second object size is 2k you know one point five terabytes roughly is how much of the data this person is storing for a month and then 777 million objects per month now this is a quiz for you right I'm gonna stop talking so how many people think we should use s3 raise your hands please okay about ten hands raised here or somebody's not sure it's okay or just that's a fun exercise how many think it should be DynamoDB a few more hands right okay well let's try this are you sure you can change your mind okay so often times is the tool called simple monthly calculator how many of you have seen this tool it's nicely but eat somewhere okay nice a lot of hands awesome if you plug all this if you fire that tool up and plug all the parameters this is the there's a tab for DynamoDB where you plug in the dural data set size item size and the rights per second right and then there's a get another tab for s3 you plug in those numbers these points are very small but assume you're plugging it this one right I'll show you a bigger font next slide so and then what happens is it gives a result like this as you can just see for the folks who set as three wins they were right in some dimension the storage cost is only $34 but turns out that the put cost is three thousand eight hundred and eighty-eight dollars right was not a surprise yeah it turns out when we designed s3 we designed this to do bigger files you know and we priced this to do bigger files you can you know s3 will happily take small you know a small small payload you know object sizes could be one byte to any number of bytes that you want to put in but it turns out that you know put cost may overwhelm you if you're putting very tiny objects so the best practice is s3 so I agree gate the data to write this there's three and we priced this product like this you know what I was building this slide in fact I made a mistake I used to have three infrequent access it turns out infrequent access is a lot more expensive than $3,800 it's not on 7000 if I if I saw that correctly so actually do a pricing exercise usually oftentimes during my design review after 20 minutes I park my call and tell my customers hey let's just compute the cost of this solution that we're putting through right any time you write down two things come out of this the customer typically gives the two thumbs up let's go on or in some cases we pick the wrong service may say you know what something is not adding up here or my requirements are totally off right I'm gonna change this or change the service right now any time I do a presentation in front of you know hundreds of people I showed this to the s3 product manager the person was really upset so so I decided to put a compensating scenario there I increase the payload to 32k and see what happens there right it turns out in that case obviously as three wins right it turns our DynamoDB probably and throughput and storage costs DynamoDB uses SSD under the covers you know 25 cents per gigabyte per month is the storage cost well as the storage cost of dynamodb and then the when you provision I ops you also each I up is 1k I up you you there's a unit price for that if it's a 32k pill or you're riding is 32 K into 32 right so the price becomes much higher so in this case clearly you know pricing that thing gives you the right solution of you well right now let's quickly that's a great example of you know you should always be cost conscious when you design systems the data yes now let's go into you know the analytic stage right we've looked at collection we looked at storage now we're going to an analytic stage you know in terms of predictive analytics Amazon has been you know doing a lot of a deep investments in AI over the last 20 years you know when you go to a website when you do you know people who bought this bought this too when we actually tell you you know you may want to look at these things you know we're basically using a I machine learning there right yeah when when we move things in our in our fulfillment centers when we move actually when we move all these boxes that aren't the routing for that is actually done by you know machine learning you know there's a concept in a Amazon go is an experience store buying experience but you simply walk into the store pick whatever you want and walk away we automatically bill you right so basically we're doing image processing we're actually mapping that to your aw you know Amazon account and then billing you as well right so all of that you know we've been at it for about twenty years or so so we're exposing this services for AWS customers in three forms you know API driven services you know these are targeted towards developers so if you are if you want the speech enable if you want to enable speech recognition you see you can use Amazon lacks Amazon poly or for for doing text-to-speech and you can also use Amazon recognition right if you have you know images or clips of videos that you won't actually recognize faces etc you can actually build that into your big data application we also have machine learning platforms such as Amazon machine learning you know which is a managed fully managed service for doing regressions logistic regressions you know sort of classification if you will linear regression and as well as logistic regressions spork ml is very popular you can run spork ml on top of EMR and a lot of customers do you know a lot of you know what logistic regression as well as linear regression multinomial regression etc using using spark ml and then we also have shipped a deep learning emi Amazon am i stands for Amazon machine image and you can basically run using deep learning ami you can run you know it comes bundled with a lot of frameworks such as you know MX net cancer flow cafe Kiera's which is a Python framework you know for doing high level programming on top of on top of tensorflow etc it kind of comes in bundled with that so if you have data scientists you can simply launch the deep learning army run your notebooks there or you can actually run an auto scaling group and then run multiple armies with with GPU instances to be able to actually serve predictions at scale and you know like obviously API driven services are targeted to our developers and then the other you know ml platforms as well as the machine learning ami are directed at data scientist as well as deep learning experts in terms of interactive and batch analytics elasticsearch is a fully managed service for running I mean for running elasticsearch redshift as well as the recently introduced redshift spectrum redshift is a managed data data warehouse spectrum enables you to access your data in s3 in other words you don't have to load the data into redshift redshift uses local disks where a spectrum allows the queries to be actually run on a cluster that is dedicated towards running these queries so you simply put your data in s3 join a table that's an s3 you create a schema in the glue catalog and then Amazon redshift cluster can simply join a table that's an s3 data for the Qin s3 would actually data that's in Amazon redshift and materialize a query so these are interactive analytics applications Athena is also a server service for running running queries you don't have to worry about servers you simply put your data in s3 you know create your glue catalog schema and then you can point your query at a latina you can send your queries to Athena Athena will materialise a result for you Amazon EMR you can use mini managed frameworks such as pork flank presto test etc about 14 frameworks ours are actually support out of the box you can pretty much like bootstrap actions to bootstrap any you know a large number of big data frameworks on top of EMR your mod manages like things like you know node replacements IAM or also comes with EMR file system which is a consistent file system on top of s3 etc so you give you get a lot of you know advantages for running EMR now what about streaming and real-time analytics you know for streaming using spark streaming on top of EMR makes a lot of sense you know Canisius analytics for Canisius analytics is a fully managed service for running sequel on top of streaming data so you can do things like you've been doing functions on top of fast-moving data and computer you know minute one minute metric or a 10 minute metric and actually take that and put it in another Canisius stream for downstream processing KCl is a library for for stream processing it manages checkpoints it allows you to build applications that actually you know you can use an auto scaling group and run it across you know multiple multiple AZ's therefore you can run Kinesis client library applications using KCl lamda is a fully managed service for running you know think of that as running you know server lists or running functions in the cloud you know lamda can hook up to either you know events that that gets created when you write data to s3 or you can hook up lambda to you know Kinesis streams or dynamodb streams so if something if you write to the stream automatically the lambda function gets called and then it gets handed out a bunch of rackets and you can actually do stream you know stream processing using lambda you don't have to run any servers obviously around around it's a lot of servers some of the covers but you simply write your function and hook the function up to stream and then process the data there and which analytics should you use if you have batch or interactive analytics you know in the case of batch using EMR makes a lot of sense elastic MapReduce you can use either high or pig or SPARC if you're doing interactive analytics which takes typically seconds to process Amazon redshift or athina or EMR with trust o or a spark would make sense stream processing obviously we have spark streaming Kinesis analytics KCl and lambda for predictive analytics basically doing either real-time predictions or batch predictions using either Amazon machine learning or you can use if your API driven services if you are dealing with you know images etc you can use actually recognition to to actually recognize various objects and images in your pipeline actually I'll show you a demonstrate how to use that a little bit later here is again a comparison of various stream processing technologies and you know one thing to note here is that if you're doing spark streaming it is how many of you know that it's a single easy service right when you're running an EMR cluster or on a Hadoop cluster typically it runs in a single AZ so if you want you're running a highly available application then you need to have a mechanism to launch another cluster if an AZ availabilities team the zone goes down whenever we build services at AWS or our solutions that it'll be as we always tell customers to build highly available solutions which means you should plan on multi AZ if in case data at node fails or AZ fails right so in the case of Kinesis KCl if you're using auto scaling and putting in and running the application across multiple EC is it is inherently multi AZ whereas if you use Canisius analytics and lambda it is inherently multi AC right even if an availability zone goes down you know we automatically compensate for that the application is running as if nothing happened right so those are some of the points again I'm gonna leave this as a slide for you for future reference again this is what analytics tools should you use right you know if you're using a data warehouse if you have a classic data warehouse kind of workload I think the best tool to use would be Amazon redshift if you want to combine direct data warehousing with with the data leaks in our year but you're putting files in s3 and you're doing joins across using Amazon Spectrum makes a lot of sense and if you're doing basically if you're putting all the files and simply s3 and you want to create a catalogue on top of that igloo catalog on top of that and you don't want to provision servers etc using Amazon Athena makes a lot of sense because simply you create a catalog on top of that and access the data and with iam are you have presto spark and high or some of the options as well again there's a lot of you know Athena is serverless what as others you have to think about servers provisioning the right instance types etc and moving on to ETL AWS glue is a fully managed service for for ETL it allows you to not only do ETL but also you know you can provide videos data sources it'll automatically look at the various data sources and actually infer what the data is and create a catalog for your data in the glue catalog and the catalog is accessed by spectrum Athena and other services you know and you can even eat more and therefore you can actually join with that data automatically being materialized by glue it also creates you boilerplate code which you can optimize and when you finish the ETL process it automatically runs that ETL jobs on top of clusters for clusters that it creates so it's a it's a fantastic developer oriented platform for running these workloads you know if you you also can use this a plethora of other you know tools such as you know informatica or at annuity and others they do fairly amazing things as well so you can use those platforms as well and now putting this somewhat all together we're very close in terms of the consume I'm focusing a little bit on the AI if you're building AI applications obviously you can use the Amazon AI to build your models and you can deploy those models on top of ECS cluster and then actually you can train those models as well and using the deep learning AMI and and and your Jupiter notebooks etc and if you're doing classic you know visualization classic analytics if you will visualization you can use quick site tableau etc looker MicroStrategy and others I didn't have the space in the slide to include a lot of and there's a lot of amazing partner services I'm not focusing on those specific aspects that you know I think probably this one gives you an idea what else you can use as well now the BI applications you know applications are and the data science notebooks etc are more data science and DevOps Pacific and the and there is the business users typically use things like tableau and others to point and quick site to point and slice and dice their data now that is the grand slide that actually combines all of that obviously it's going to be in your it's going to be it's going to show up in SlideShare as well now let's go through quickly three design patterns this is another concept map of of putting all those stuff together you know going back here what I've done here is that you know your data goes into three different data stores potentially in this case you know hearts data source such as Kinesis and a heart and a warm data sources DynamoDB and and and and and a warm and a colder scenarios such as s3 right and then you can use various processing technologies based on what processing technology that you use right if you're actually doing spark streaming on top of Kinesis it's really a real-time application if you're putting data and s3 and then you are actually copying is to redshift and are running queries through it you know which answer is going to come back in seconds it's actually interactive analytics right and then you can do classic batch processing using either high or test right so generally you know sort of this gives me sort of a mental map of common combining hard data stores with processing technologies that can either run fast or slow combination of that gives you you know three different types of analytics if you will now now this is where it all comes together in the next three slides thanks for holding on and you know if you're if you're building a real-time application the general architecture is like this you know you take your streaming data and you put that in Amazon Kinesis right and you can run Amazon Canisius analytics to do stream ETL in other words your runs you write sequel and you do your group buys and and so on and you put the results of that into another Kinesis stream Martini's is far hosts for further downstream processing and in order to analyze the data in Canisius you're going to either use a Kinesis client library app or you're going to use AWS lambda or spark we've compared and contrasted these things you know based on your needs you pick one of those right and then if you do real time predictions let's say if you have fast-moving data your you know temperature gauge is sending measurements and you want to consult your machine learning on the fly to figure out whether you know this instrument is going to have a problem or this turbine is going to have a problem and dispatch people so you do real-time you know analytics by looking up your model and saying do you see any patterns here I'm getting these readings and then if that is the case you send an alert to SNS and then it notifies the appropriate person or people and then you can also use a persistent you can write a simple lambda function that all it takes this takes the data and writes it to s3 so we have a persistent copy of your data for for further processing and this is another important layer right when anytime you're dealing with this is the board be coupling right if you have downstream applications painting pretty pictures usually they like to look at a database or elasticsearch or something like that to draw pictures you know Cubana nicely works with elasticsearch so you need to nicely introduce a decoupling layer which is that blue that has elastic you know elastic cache Amazon DynamoDB and RDS that nicely decouples this processing layer that stream processing and making it ready for the downstream application to use right and then again in some cases what happens is you want to fan out right you have you know 20 customers and typically if you write the Canisius stream there's one producer and two consumers that's what each shard so you need to scale the shots in some cases you want to take the data and put it in three or four Kinesis streams for other downstream downstream customers to you that's the way you scale out Kinesis if you will in addition to scaling out the various shots in some cases you want to give special-purpose data to a specific team right there team doing a/b testing on a clickstream data only wants a subset of the data it so you want to maybe pipe off that you know to a downstream it pretty much every single you know real-time analytics application is based on a framework like this right so if you know what this is pretty much you can solve any real-time analytics problem similarly interactive analytics so you can use you know you can put your data in you know Kinesis firehose and then transport the data to elasticsearch at redshift you can put your data into files and then have you know Amazon EMR or athina process that and consume it or you can run your classic you know high war Pig on top of the s3 data now that is interacting on batch analytics now putting this together you can also introduce you know either real-time or predict by doing batch prediction using Amazon machine learning or real-time predictions as well into your pipeline and and and that can be incorporated into the answers that you give to the customers right now putting what is the data lake you know data lake to me is a combination of interactive and batch analytics and real-time and with the materialized view if you will right so if you if you take all of that that we did before and put that in place that is in essence your daily so to me a data Lake is a design pattern that allows you to kind of bring together both your interactive real-time or stream analytics and the materialized view to materialize the data in in a form your downstream applications would need it and we walk through the end-to-end stages to how to do that now a couple of things are important you know giving metadata keeping an idea of the metadata in your applications is also important you know storing that metadata in a glue catalog or a classic hide catalog makes a lot of sense and then last but not the least security and governance is important you know encrypting your data using appropriate keys using the appropriate services for authentication and access control such as directory services and Apache Ranger is pretty helpful and using cloud watch you know logs and you know it's pretty important end to end so you're going to use some of the you know compliance tools putting it all together this is the the grand picture of your data like reference architecture and with that I am going to summarize you know built when you build big data systems you know build decoupled data pipelines use the right tool for the job we we went through how and use log centric design patterns and materialized to be used for scaling you know your your applications be always cost conscious that enables you to the right decision and use you know AI and m/l enable your applications thank you so much for hanging out early and I hope that was helpful [Applause]

Info

Channel: Amazon Web Services

Views: 52,944

Rating: 4.8934755 out of 5

Keywords: AWS re:Invent 2017, Amazon, Analytics & Big Data, ABD201, Big Data

Id: a3713oGB6Zk

Channel Id: undefined

Length: 59min 55sec (3595 seconds)

Published: Tue Nov 28 2017