AWS re:Invent 2017: Migrating Your Traditional Data Warehouse to a Modern Data Lake (ABD327)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right good morning everybody I think we're gonna get started first off what an amazing turnout thanks for coming here bright and early bright and early for Vegas course and I'm delighted to be here to talk to you about redshift and what we've been up to recently with recent launches as well as some upcoming launches I should introduce myself I'm Vidya Sreenivasan I'm the general manager for redshift and I'm Co presenting this session with 21st Century Fox and we have Balaji from Fox and he is going to talk to us about their journey is there data warehousing journey from an on-premise solution to redshift on the cloud and he's also going to touch on some of the lessons learnt along the way and some future thoughts on how they expect their architecture to evolve it's a pretty full session let me get started so before I jump into redshift itself I just want to orient ourselves because we have a wide set of people who have different backgrounds in terms of where redshift really fits into our broader big data portfolio so if you look at any big data application it typically has three phases there is a collection phase where data gets acquired and collected there is a storage tier where you have to store the data and then finally analyze it using one of our many solutions I'm just going to call out a couple of these that are relevant for redshift so for example Kinesis firehose is a service that allows you to directly stream data this is hot data that's coming in from IOT devices and such directly into your h of cluster when it comes to storage the key store their data store for big data applications has become s3 because it's scalable it's durable it's fairly cheap and it's pretty much become the default landing zone for big data applications when it comes to analytics we have a bunch of different options redshift of course is the data warehousing service but we also have other offerings such as aw Amazon EMR which is essentially a managed solution to do Hadoop spark hi etc and Amazon Athena which is essentially clustered less query service for directly querying data on s3 and quick site which is our bi solution now Tyrande is all out we have a couple of solutions in the bottom here one is DMS data migration service as the name suggests it's used for migrating data from various sources be it OLTP or other data warehousing sources into redshift and AWS glue this is a new service that we announced during the last raiment and it serves two purposes one it acts as a catalog for the data that you store in s3 so you can know what what you have where you can store schemas etc and it also provides ETL capabilities so essentially you can set up drops to do orchestration and data transformations so that's just a data landscape let me also give you a sense for momentum in the market so we launched ratchet about five years ago at a reinvent and as of second quarter 2017 the Forrester wave called out redshift as being a leader in the data warehousing segment and specifically why they call out is redshift has some of the largest deployments of data warehouses as well as the most number of deployments of any cloud solution out there now this is really thanks to all of you the customers who use the service who have been supporting it over the years and on all the feedback that you've given us to make this better over time so very quickly what is retro redshift is fully managed massively parallel columnar petabyte scale data data warehouse the couple of things we set out to do when we started this service was to build something that could deliver really high performance sequel analytics and we wanted to do this in a way that was cost effective for as well as in a way that did not require a whole lot of management of the database itself now these are still the guiding tenets for us as we continue to evolve the service now in the years that we've been in operation we have seen some changes in the overall Big Data landscape and we've seen some new use cases emerged for analytics that are slightly different so redshift is a clustered solution where you have a leader node and some number of compute nodes and you can pick what kind of how many nodes you want you can even pick the instance type whether you want to have an HDD instance or an SSD instance depending on the cost performance curve that you wish to hit and it requires the data be ingested into this cluster and then analyzed now in the big nailer space in the meantime what we see is people have started to collect and first of all generate collect and then store really you've asked what volumes of data and this data is arriving at a really high rate so now there's all this data pooled and s3 and we see this with our s3 customers this is all this data sitting there and people want to drive value out of that now at Ko redshift is a high-performance sequel processor and we also want to support use cases that leverage the data in this data Lake what's being called as a data Lake so to do that we announce an extension to redshift a new feature called redshift spectrum earlier this year excuse me so spectrum essentially is an extension to the redshift query engine that allows you to directly query your data and s3 without requiring any ingestion or transformations meaning you can directly query data in open formats this is quite powerful some of the use cases that we see popular with spectrum because it has been and used for some time now is people like to use spectrum when they have really hard data and they want to do analytics all that immediately as data gets streamed into S into s3 there is really no time to do the ingest into the cluster and people run a query across data on s3 as well as data on the cluster as the data arrives the other popular use case with spectrum is when you don't access data for very often so it's for the less frequently accessed data that still has to be kept around for compliance purposes or maybe you want to do some analysis once a year to see how trends are over the last 10 years but it's not really worth it to store all of that in your warehouse and replicate the data so data can be in s3 so that it can be available for historical analysis except it doesn't have to be in your active data warehouse and spectrum allows you to do that now as you can imagine the scheme for how you pay for this is also different for redshift you pay for the number of nodes that you use in the time that you use those nodes it has both the storage as well as the compute with spectrum the payment model changes we move to a paper query type model where you pay for the terabyte scanned for the particular spectrum query that got run and we charged $5 per compressed terabytes can't so again it's a lot of flexibility that you can store your data either in thruster or an s3 and you can choose to move move it between these two places but still query across the entire data set as if it were one unit let me spend a minute on the architecture of spectrum so to make spectrum possible we essentially had to do two things the first thing is we created this elastic pool of compute nodes this is the purple layer that you see here that act as that can take tasks from redshift and exq that on the data and s3 so there they would ingest the data as well as execute tasks on the data from s3 and this pool can is can be arbitrarily large and it's an auto scaling multi-tenant fleet that just gets used with every single dredge of cluster as and when the second thing we had to do was we had to build quite a bit of intelligence into the redshifts optimizer itself we had to make it smarter to understand how to leverage the spectrum fleet and essentially execute queries against the data and s3 but still with really good performance and to do that well we had to make sure that we don't scan more data than we absolutely need to and we also have to make sure that the data that gets moved across these layers so between s3 and spectrum and spectrum and the cluster is minimized so a lot of these optimized mize the changes went into making spectrum both very fast and very scalable one of the interesting use cases that have evolved after we had spectrum in the field was customers started to use this pretty much without any local data so they would even just stand up transient less rich of clusters and you can have any number of them and directly operate against data sets and s3 the only requirement is the s3 data sets have to be catalogued they have to be registered either with a high meta store or they have to be registered with your now with the AWS Glu catalog you're now going to go into all the details here at the pretty busy chart but the only point I wanted to make is Spectrum affords us another order of magnitude increase in parallelism to make that point let me just show it to you in this diagram so if you see here from every redshift node in fact every slice in the register of node can I can't access up to ten Spectrum nodes so if you look at ds2 8xl it has 16 slices and each of these slices can siphon off tasks to another 10 nodes which makes for some ridiculous levels of parallelism and this allows us to scale across really large data sets and have very good query performance and we actually did a test where we ran a complex query this had a couple of joints as the Karim Iran and some aggregates across the exabyte sighs dataset and we had the results for this query return in under three minutes I mean there were multiple techniques that were in place to make this happen but essentially the architecture and the optimizer changes were what made it possible and here is a customer example so Nuvi ad is one of the largest attic providers in israel and they do both real-time bidding as well as analytics on and the ads that get published now they had as particular requirement where they had both a large data set they had petabytes of data and they were also creating data at a very high rate data was they were processing 700,000 transactions per second from over 50 channels and so they needed to provide these analytic responses to their customers as soon as the data landed ingestion just wasn't an option for them because it would just slow things down so they leveraged spectrum to essentially get to data as it was arriving in the format that it was arriving and they did this with multiple clusters as well and once you have multiple clusters you can pretty much increase the number of users the number of concurrent queries you want to run to arbitrary amounts so this essentially gave them a way to scale the storage and s3 pretty much independently of their compute needs because they could spin up a spectrum and retrieve classes as they needed depending on the particular workload and query volume at that point in time switching gears a little bit redshift is considered to be one of the core services for a EWS and what that means is when AWS launches in a new region redshift is there as part of the initial launch so we are there in 14 regions and adding three more so pretty it's pretty widely deployed across lots of places and we are being used by a pretty wide variety of customers I just want to point that out we have customers who have couple of hundred gigabytes in the service as well as several petabytes in fact amazon.com stores a very large volume of data and runs a lot of their analytics on redshift Black Friday is an interesting day for the service we also have customers just across different verticals whether it's healthcare finance etc and spending a lot of time on compliance and making sure we had HIPAA compliance FedRAMP sock 1-2-3 all of these things has helped quite a bit with getting these customers the necessary level of security needed so that they can use retrofit as their choice for data warehousing a word on partners so we think so working with partners has always been a part of the service right from the beginning and we think it's super critical that we invest a lot of energy in that because our goal is retro just slides into your existing infrastructure to the degree possible and that's possible only if we work with the tools you already use so be it ETL tools or BI tools we just want to work with the existing infrastructure that you have in place now having a sequel interface and JDBC ODBC helps us quite a bit because these are all standardized but even beyond that we spend quite a bit of time working with our partners to make sure that as we evolve when they evolve they still stay in sync in fact with this fox use case we work pretty closely with informatica to make sure the implementation went through without a hitch all right so moving on to some exciting stuff some more recent launches so we recently launched a new node type for our SSD family called DC to the prior version was called DC one no surprises there and you know if I could do a Black Friday deal for retro of this would be it because this instance offers on average 2x better performance for the same price of DC one so if you're running on DC one I strongly encourage you to try this out and these sorts of big numbers come about you to both hardware changes just the platform is better this is just better instances but also software changes that we've made so in terms of the hardware itself DC tools used nvme drives and these offer better bandwidth but more importantly these offer much higher IAP setting its 1.6 million i ops for a DC twos and that makes a significant impact to the workloads that you run on top of it additionally the bandwidth from DC to to s3 is also double what is available in DC one and that makes a big change to how we run copies and backups and all the operations that go back to s3 these are also based on a custom I'd customized Broadwell chipset versus the Ivy Bridge that DC ones had and here is a customer quote on the 9x improvement in response times that they saw but we've had pretty immediate adoption of this platform just based on performance results the next thing that I wanted pre-announce this is not out yet is results at cashing as the name suggests this is a feature where we figure out results that are worthy of being cashed because they're being used again and we cash them the cash itself resides in memory on the leader node and this if you have a workload where you do run the same queries over and over again which is pretty typical of - boring workloads this can be a very significant performance improvement to what you see otherwise and essentially what this allows us to do is base skip the wlm queues keep processing no need to do any kind of optimizations and directly just gets to the results and pretty much the best part of this feature is that it just works by default it's transparent to you as a user once this gets rolled out which is going to happen in the next couple of weeks the clusters that are currently running will just automatically start using them and the other point where this is apart from the queries that can be cached it also impacts the queries that are running that cannot be cached like the copies in the vacuum and everything else because they just have more resources available to work on them everything around what is to be cached when it's to be invalidated all of that gets taken care of internally by the service there are really no knobs to tune here so here is an example result of course this is a workflow that we have internally this is a workflow that mimics customer workloads so we basically have something that has a bunch of copies and vacuum inserts all of that happening at the same time and it has a combination of queries that can be cached and that cannot be cached and we saw some pretty significant increase in throughput when we turned on results at caching so the red bar here is the new throughput with the cache on and of course it affects queries that benefit from it and not so much where is it don't benefit from it now here is another result and this is something we got from a customer that sure talks to the latency of queries not the throughput by the latency so that the part below here shows the latency so these are all dashboard queries on the y-axis each one is a different query and this shows the latency that they had prior to turning on results at caching and the bar above is what they saw after turning on caching and they actually let us know this is not a mistake it's just that it was so fast that they couldn't actually plot it now granted this is a workload that benefited from crack caching every workload isn't going to be so dramatic but it can still be pretty meaningful because typically we see workers have a combination of queries that get repeated and those that don't okay so the next feature I'd like to talk about is short query acceleration this is another performance feature so the again let's spend a minute about the workload itself so on a typical workload you have queries that are short running meaning they take a couple of seconds to execute typically for short running queries what we see is it's usually a user sitting on the other end writing this query and they expect an answer within a short amount of time and they expect expect consistent performance of these queries because they're probably just getting up in the morning and sending a bunch of queries to see what happened with sales the prior day and then there are long-running queries which could be copies answers but also long dashboarding queries etc but this feature our goal is to build a system where we automatically identify the short running queries and essentially create an express lane for them and and it comes at a cost the cost that you pay for creating the excess Lane is some of the long-running queries the things that take 45 minutes will take a little bit longer they might take 47 minutes but in our experience and in talking to customers that hasn't actually mattered because the response of queries need to come back if something takes two minutes over 45 minutes it's not really meaningful from a customer experience point of view so that so that was the goal of the feature now it is not always easy for customers to figure out which of the queries coming in are short running queries so we take over the burden of figuring that out and automatically moving them to the express lane and the way we do it is one we have an optimizer that gives us a cost that the query will take but the optimizer isn't relief of the wear of the current run times of the cluster so how loaded is the cluster how many users are on it and things like that so we actually embedded a machine learning algorithm a classifier that can basically learn from your system from your running cluster how to take how to take the set of queries that you have and batch them into short running and long running queries and start with an initial classifier and relearn's every 50 queries so that way similar to a result set caching this is a feature that's just transparent it just works once you have this patch on your cluster is just going to be active and speed up the short running queries now here is a graph that shows some results based on this so the orange line is the latency for queries when this feature was not turned on and as you can see the latency for queries that don't take a very long time this is the query latency it's pretty high without this feature on and it goes down significantly once we turn it on but the latency of the long-running queries do increase a little bit but that's the trade-off that we've made all right the next feature I'd like to announce is support for nested data so today you can use nested data with redshift but to do so you will have to ingest the data as a string into one of the fields into a string field and then use functions to essentially query parts of the nested data we just wanted to make this super simple with this feature we are allowing you to query nested data using a very intuitive dot notation using spectrum so you don't even have to ingest the data you can just directly query the data as this using using an extension of sequel to access the nested fields within the data type so here we have an example or I did the wrong thing so here is an example this is a JSON file that's pretty typical of clickstream data it has a nested field for clicks and here is a sequel to worry that basically computes how many clicks per page for every link on the home page how many clicks have happened for each of those things and to access the clicks field we just had to use a dot notation to get to the clicks and then compute the aggregate so it makes it very easy to query the data in this format and we support open formats like 4k or C JSON etc I think one of the side benefits of this feature that we've heard from customers is you know it's great that we can query the data but sometimes I just want to ingest this nested data or portions of it into my cluster I do that today and this feature actually allows you to write Sita's jobs that are far simpler than what you can do today so rather than so if you want to transform and load data that is nested into the cluster into flat to normalize tables this can still help you just run a citas command for that and move the data in using sequel that is far simpler than running a different job to extract those fields for use within your data warehouse so as sort of the side benefit of of this feature so here is a an example where it's an orders and customer table where this I'm just bad with this clicker we're on the top it's a it's a normalized table if you were to put it in a data warehouse in a flat format and this is how it would look in nested format the only point I want to make is with nested data you do get better query performance because essentially the the fact that it's nested is equivalent to have done a pre join of those tables so if your query patterns such that you're going to join the data anyway it's worthy to think about hey maybe you want to leave it in the nested format and query it directly because it'll actually perform better and the last thing that I want to announce today is enhanced monitoring so we have a console for rich of general administrative console where you you get different parameters about CPU usage i/o rates etc we are putting a lot of energy into improving that entire console experience to manage cluster and have better visibility into what is actually happening and the goal here is to understand what is happening with your cluster easier and also to troubleshoot in case you end up having issues for simple problems like if you feel like this cluster is slow we want to be able to give you enough data on the console that lets you understand what might have happened to make that happen now first of all to even know in in absolute terms if indeed the cluster is slow if something else is going on so a couple of fields that we will have immediately is a metric for query throughput and query latency and these are metrics for me these are five-minute metrics that get aggregated over five-minute periods but then you can display them for the time frame you can pick the time frame that you want to display them and this is really just a start this is just two things that we are going to start with but you will see a lot more along monitoring some closing thoughts before I hand off so highly recommend folks who use DC ones to look at DC - especially the eight Excel platform is significantly better in performance and in fact it's true for the event ds2 customers if you fear using a ds2 cluster but it's not very high in storage utilization it's worthwhile to take a look at DC - as we talked talked today's spectrum extends the capabilities of redshift and provides a great deal of flexibility in how you can organize your data and query on it and there are more features coming for it such as a nested data suppose there will be more again if you haven't tried it it's worthy to check that out and there's there are significant improvements I mean we talked about results at caching and short query acceleration today we announced another feature called query monitoring rules earlier this year which essentially allows you to set up rules for your cluster for your queues in the cluster so that you can automatically either kill or hop runaway queries so if a query is spilling a lot to disk or the fish is going to hog up all the resources in the cluster you can just automatically kill it when these features sort of come together and I used in tandem we see customers able to drive pretty significant throughput improvements to their clusters so again something to think about and look at and as always if you have questions suggestions feedback for us please use this alias to let us know thank you [Applause] thank you Jeff good morning everyone I am Balaji muta ramalingam executive director data and analytics at 21st Century Fox next 15 minutes I am going to share our most recent analytics modernization project experiences that we did it for our studio side of the business Fox film entertainment it's a part of 21st Century Fox and we are one of the largest movie making studios in the world we produce and distribute great content reaching global audience in theatres and at home through this customer platforms and broadcasters in 193 countries and 63 languages we have many analytical processes and solutions that support our business decision making process let me give you key highlights of those processes we process about hundred terabytes of data on a daily basis there are about 25,000 user requests run on our system having more than 100 data providers provide data to us also about 3,500 35,000 data pipelines that process the data crunch the data on any given day we have many challenges that go beyond our technology in general there is a huge digital transformation in our media industry our consumer is changing they have many options now in the way they how they consume our content it called our system to be much faster in terms of ingesting the data providing a quicker inside also it called us to process vast amount of data and variety of data on the other hand we were in a traditional data warehousing platform at its end of life having multiple unplanned outages struggling to scale we had many options but we know that it is not going to help us to grow in terms of meeting our business needs so we decided to modernize our analytical platform with a few key principles here are our key principles first democracy in the data again it's a wide and broader subject to discuss but we are very clear we want to bring all our data to a common platform and bring down our data silos and provide a cross-functional visibility to our business so that they can take a best decision out of our own data next cloud leverage the cloud that offers in terms of scale and elasticity through which design our solutions much faster in nimble in nature also we want to have a cost-efficient solution so that we can scale so considering all our complexities enterprise scale in nature we were looking for a solution or a partner with the white suite of analytical tools and it's our natural choice to partner with Amazon we partnered together we finalized our approach in architectures by conducting multiple pilots and also flushed out an aggressive project plan to make this happen in six months it is our conceptual architecture that supports our modernization as you see here from the left to right we collect the information store the data and analyze and do further analytics out of it and at a collect layer we have scheduled and ingest and data transfer the key difference what we are making it from our modernization design point we are bringing your data lake with a vast object storage and having an a sequel MPP platform on top of it it clearly it clearly answered our all our challenges and also met our principles we have a now we can scale our data also with very cost efficiently there is our software and technology stack that supports our architecture traditionally we were having multiple scripts to collect and ingest the informations what we have done it as a part of this project we standardized it in the Python framework and use lambda for server less scale also we were using informatica for our data transformations and data ingestion tool it collects the sources from the source system and provides and put it in our s3 bucket on its raw format also we have a glue ETL for EMR and red shift our data load use cases and also using informatica push downs for our red shift data loads we are using blue gatlocke for tagging and creating a metadata out of our data Lake so if we provide a data dictionary for us and using MicroStrategy as our visualization tool and provide a data product for our end-users it directly reads it from s3 and red ship as you might expect we experienced many challenges also learned multiple lessons here are our key learnings out of this entire project as we are coming from a traditional background has to be limited with our size and scale we and we started with the large cluster and down the line we find out splitting the cluster between the read and write it's the most appropriate thing for our use cases now we have a smaller clusters with a multiple use cases for read and write and also it helps us to get a great flexibility if we need to run and a heavy hit hitting reports during the month and our quarter close we can have a read cluster bigger cluster and then we can bring it down after the period the next thing is the vacuum and tables initially we were seeing a huge growth in terms of our cluster sizing then we realized that already mentioned that we are processing about hundred terabyte of data we were seeing lot of like data delete blocks accumulated in our clusters then we changed our vacuum strategy in a way that we can do it on a weekly and daily in fact enough in a monthly basis now we are managing our cluster size properly similarly analyze in a mid big point we were seeing a degradation in purgatory performance that also like same traditional background we were having an analyze at the end of each week or a downtime when the business was not running the reports then we changed our design in a way that we put our analyzed at the end of the data pipeline and we started to analyze the objects most appropriately the performance difference is a day-and-night difference we got a great performance they commit you it's an another interesting one redshift is so sensitive in terms of commits it always channels all their commits through the leader node we have been doing lot of commits I explained earlier if you are using the informatica don't to do that then later we realized that it's completely clogged our commit cues in the leader node then worked with the partners and changed the design in a way that we make it as a bigger batches and provided a less commits for example yeah it data pipeline for our products it used to run for 43 minutes after this change it change it ran for four minutes very impressive schema design there are also an interesting learning we were hitting like a kind of a disk space issues often when heavy reports are heavy data loads process in our system lately we found it there are a lot of rebroadcasting is happening in between the clusters then we worked with AWS and our partners we redesign our keys in a way that like redistributed across all nodes and changed our sort keys we got a great execution plan now and we resolve those issues another interesting one is workload management we have different use cases we have continuously running reports and some broadcasting reports as well as heavy-hitting data science sequel hating our systems as well what we have done it we divided those by appropriate rules setting up the queues also dynamically we changed it according to our use case change also we closely engaged with the AWS and the processor the rich experience is really important and it's priceless when you are having a project like this scale and going against time this space is so important so here is our benefit we absolutely got a business agility after this project now we are completely empowered now we are meaning our business also empowering now we can take any data at scale and we can scale it whenever we need it as well more than that we are seeing a very tangible benefit in addition to this agility we reduced our overall overhead cost about 10 to 15 percentage and also we decommission hundreds of our servers released our data center footprint now we can probably say that our studio analytics run on cloud also we have 30 to 35 percentage of process improvement and efficiency across the board across all applications I want to mention one more thing it's not just an a platform alone help desk to help get this benefit the partnership how we got together with Amazon whenever we ask for a enhancements or a bug fixing always they listen to us they are properly they released it on time also worked closely with hand-in-hand helped us to met all the milestones on time it is a great partnership so from here where are we going Amazon is innovating fast at the same time at Fox we are embracing those innovations much faster we are just now mentioning about all those new features we are ready working with them and releasing it in fact we released few of them into our production systems we upgraded to DC - we saw a great performance like it's about 50 - more than 50 percentage also we release this shot Korea accelerator and we are seeing a great performance for our MicroStrategy related use cases where our list of value are or a prompts it's comes out really in sub seconds and spectrum it got our data Lake to the next level now we are dividing our compute need as well as data need totally separate we no longer need to put our cold data into redshift spectrum does it for under the hood blended nicely and provided as an end result curry result caching that still we are evaluating it we are seeing a promising results looking forward to implement it also we are expanding our data Lake to our next other business units based on the experience we got it from our studio implementation also working on a project to provide a real-time analytic solutions through using through Genesis so excited to extend our data leg to provide deeper insights and predictive insights using machine Lang machine language in EAS hope it helps for you guys thank you thank you for your time [Applause]

Info

Channel: Amazon Web Services

Views: 6,705

Rating: 4.4545455 out of 5

Keywords: AWS re:Invent 2017, Amazon, Analytics & Big Data, ABD327, AI, Redshift, Migration

Id: 3Xg3yu5xnMY

Channel Id: undefined

Length: 43min 35sec (2615 seconds)

Published: Wed Nov 29 2017