Best Practices for Data Warehousing with Amazon Redshift

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this session we're going to talk about data warehousing best practices using Amazon redshift so quickly I want to go over what Amazon redshift history was and development get into cluster architectures talk about some concepts and terminologies do some storage deep dive and then I want to talk about what are the new features that we launched over the last six months there were some significant features that we launched over the last six months and also what is coming in the pipeline so the history and development those of you are using redshift today how many of you are using redshift today okay not many so what we did is we took a very familiar database positive sequel and package that with will app MPP and columnar added all the dependent AWS services to it like km as VP C's route 53 into it and give you a package of redshift so Amazon redshift is a collection of all these services for you it is not post girls in itself it exposes something like Postgres it's MPP columnar all up as far since launch we launched it in Valentine's Day 2013 since launch we have fallen in love with it been one of our fastest growing services in the AWS ecosystem we have introduced under significant patches to redshift and under and 50 significant features since that time each of these patches are automated you don't have to take downtime for installing the patches we automatically install the patches during maintenance windows for you we upgrade your cluster we keep putting in smaller patches optimizer fixes as we go about doing it so let us look at a cluster architecture let us dive deep into how the cluster works and what it looks like the cluster is split into two pieces one is the leader node and the others are the compute nodes there are one or more compute nodes attached to a leader node the leader node is the sequel endpoint so your tableau servers your your your our tenuity workbench all these guys connected into the redshift leader node and then the compute nodes where the all your data is stored so going in with more deeper into the components of the leader and compute nodes your leader node actually does parsing an execution of your queries so as soon as the query is submitted to your leader node the leader node converts that into a C++ code and pushes it down to the compute nodes for it to execute the leader node also exposes pg catalog tables if you are familiar with Postgres this would be very familiar to you poster sequel exposes PG catalog tables to understand the inner workings of Postgres and we do the same thing on the leader node all the compute node there is curry execution processing engine happening you have backup and restore process happening all backups happen directly out into s3 we continuously backup data into s3 for you there is a replication process that happens between the compute nodes I'll talk about it in a bit we also have local storage attached to the compute node which contains your disks your slices tables columns and blocks in this talk we're going to go deep into these aspects so a few concepts and terminologies before we get into the storage nodes blocks slices so redshift has been designed to reduce are you are you in a database is your biggest concern for performance if you can reduce your are you you are more performant for starters redshift is columnar storage when you access your data with row storage all the yellow boxes are the data that you are storing you need to access the entire row before you can go to a particular column wasted i/o means higher performance latencies for you columnar storage allows you to access your data for that particular column from that particular block alone you only scan the blocks that are relevant for the query you don't scan everything compression compression reduces your are you as well now your overhead if you look at the these encode strings at the end of the column definitions they are basically saying what kind of encoding or what kind of compression is redshift doing for that column it reduces storage requirements and reduces the i/o as well each of the columns grow and shrink independently they don't have to depend on the other column so you can say I'll start with run length encoding on it and then mu2 is it STD or L is zero and we will still do that for you because each of these columns can compress individually the third important thing for our reduction is zone maps in zone maps are in memory blocked metadata they contain per block minimum and maximum values so each of these elope each of these lob oxes are blocks and in each of those blocks we store the minimum and maximum value for that particular block so when a query runs through the blocks it can skip blocks which may not have the data that it is interested in which reduces unnecessary I here for you it's very important that you define your table for encoding you define short keys for your table which is going to improve performance because once your data is sorted your zone maps work far better slice a slices of virtual compute node think of it as a virtual compute node each node has to 16 or 32 slices depending on the instance style that we choose for that node so if you have a two node cluster of a DC one large you will have four slices on it two per node a slice is where your data is stored the slides only processes its own data it doesn't borrow data from another slice to process it data distribution very important as important as hot keys we have three ways of distributing data in redshift redshift is a massively parallel processing system so we need to distribute the data across the cluster so we can distribute the data using keys valued define a key that you want to distribute the data across or you can distribute the data into all the slices in the cluster or ask us to evenly distribute the data across the nodes for you on the key side of things you need to be ensured that your data is evenly distributed your key is even is able to evenly distribute the data if you have lopsided distribution you will see that your cluster doesn't perform optimally discs each node contains a formal disks we have 2.5 - 3 X the capacity that we are to ties on the nodes so when we say that the node capacity is 2 terabytes we have close to 6 terabytes that we use internally so what do we use it for we use it for storing data from remote nodes so each of these nodes are ephemeral storage so we take the data from that node a part of the data from that node and store it duplicate it in another node as well so we have local data storage accessed by the local compute nodes and remote data storage accessed by the remote computer notes we have internal views that exposes this for you you can actually look at those views to understand how much of the data of the remote node is stored in your local computer notes blocks are immutable 1 MB blocks that we use in redshift all data is stored on the block the block has metadata information on it on the zone map information it also stores information regarding MVCC data so your immutable blocks are always formatted when you do an update we don't go and change a record inside of so your block has to be formatted a new block has to be picked up when an update is done so you need to be careful about doing many updates on redshift because you need to do a vacuum or a deep copy right after an update if it has spans many rows because you'll have ghost rows inside the blocks a full block may contain anywhere between 16 and 8.4 million values so if I'm able to extract one single block of data into memory I should be able to read a max of 8.4 million values for that table columns logical structures accessible through sequel the properties of a column include the distribution keys sort keys and compression algorithm that's being used columns shrink and grow independently they don't you don't need to have you may have a run length encoding on a column and these columns may have a different encoding toward another column or within the data itself one column may show run length encoding for hundred rows and then the next hundred rows may not have run length encoding on them the system columns there are three system columns per table per slice for MVCC so few of the new and upcoming features on redshift how many of you heard of a spectrum spectrum anybody attended the earlier talk with Adrienne Adrienne was mentioning that is three is value keep all your data and that is the philosophy that Amazon is going with s3 is your data Lake you put all your data in s3 and then query the data using your redshift cluster query the data using Athena query the data using EMR spectrum allows you to query your data stored in s3 without having to load the data into the redshift cluster it is fully sequel supported and C sequel compliant it pushes the sequel predicates into the spectrum layer so that your cluster is not overburdened with that work so it doesn't have to go and pull like a terabyte of data in and apply predicates to it can push the predicates into the spectrum cluster do all the predicate processing there and only get the amount of data that is needed for your query you can join your redshift tables with your spectrum tables which are external tables which effectively means I can start doing roll out from redshift for data that is not access so frequently and kept inside f3 I can have multiple redshift clusters going and hitting the data on a spectrum on s3 through the spectrum clusters that means that I increase the concurrency for my redshift muster because instead of going with fifteen concurrency I can now say I can have to redshift clusters looking at the data and I have 30 people who can run queries against it so the architecture remains the same we still have a leader node and a few compute nodes out there so whenever you submit a query which involves data from an external store like s3 redshift pushes the data to the spectrum nodes out there they scale out based on a number of nodes that you have on your redshift roster they push the predicates as well down to the spectrum nodes the spectrum nodes gathers the data from s3 remember it doesn't load the data into the redshift cluster your cluster still remains as a small agile cluster it loads the data into this it looks at the data on s3 doesn't load the data into redshift the processes the data applies the predicates and gives the results back to your computer on your compute nodes can further do processing on it by joining the data to existing tables on the redshift cluster and providing you the results the data catalog for spectrum is run from Apache hive metaphor if you are using Apache high meta store you can reuse the high Metis or if you are using glue glue integrates with a redshift spectrum as well there's a paradigm shift enabled by Rachel Spectrum earlier was used to say I will only be able to analyze or visualize a small subset of data because my data warehouse cannot handle all the data in that with spectrum you can put all your data in s3 and query the data through your redshift cluster using proper sequel statements a few of the recently released features on redshift I'm not taking everything here performance enhancements we have increased our vacuum speeds 10x faster now our snapshot and restores our 2x faster our queries are up to 5x faster we introduced a new service called query monitoring rules on redshift recently carry monitoring rules look at your queries they monitor your queries in-flight queries and then say is this query taking too much memory is it carrying taking too much CPU is it returning more rows than needed and it allows you to kill the query which basically means your data scientists cannot do runaway queries on the redshift cluster so they can go and run the query pull data out but they have to be conscious of the fact that other people are using the redshift duster as well you define the rules we will execute the rules behind the scenes for you we have enhanced VPC routing so we support the s3 buckets as a VPC endpoint today so you can put access keys access policies around those s3 buckets and restrict who can of offload data from redshift into the buckets or on load data from Reggie from the s3 buckets into redshift I am authentication via law we pre announced this last 3 in when we had a few customers do private beta with us we had partners come in and help us develop the JDBC drivers for it so it uses a custom JDBC ODBC driver that we provide today for you custom JDBC ODBC driver can actually talk to your 80 federated logins and create appropriate temporary logins within redshift for you to access the data if you're a redshift user using tableau desktop tableau I've released their own connector for this as well and it works with tableau it's very easy to set up we listed on GA two weeks back it is if you are using any of the ad Federation like octa ad FS or ping federate it's very easy for you to set it up and start using redshift on your ad authenticated mode if you're interested in doing this please reach out to your account team we are more than happy to come and explain the whole thing how it works and help you configure it for you as well lots of things are coming these are the top requested feature at this time automatic and incremental vacuum today lot of customers spend time doing vacuums the vacuum is their biggest bottleneck on an ETL process because they update because they delete records from the database they want to do vacuum they want to also sort the data that is being inserted into the redshift custom they do vacuums and a vacuums take lot of time when they do continuous ETL where they don't have a predefined window but they continuously keep doing ETL the vacuum process starts eating into their ETL times and which affects your business users we are we are working on an automatic and incremental approach to vacuum where we will vacuum it for you you don't have to do anything leave it with us and we will take care of it this is very similar to the Postgres sequel auto vacuum feature post-classical doesn't have an incremental vacuum incremental vacuum basis this is something that we are developing within redshirt short query bias we can look at queries and say hey this query may run faster than a then a query that is ahead of the queue and we may throw that queue that query into the top of the queue we keep monitoring the queries we know the execution plants of these queries we know how much time each of these queries took and we would be able to take these queries and put them up into the queue so that they execute faster and your users don't have to wait for those queries [Applause]
Info
Channel: Amazon Web Services
Views: 18,355
Rating: 4.8232045 out of 5
Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, AWS Canberra Summit 2017, Data Warehouse, Amazon Redshift, UTS, University of Technology Sydney, AWS Public Sector Summit Canberra 2017
Id: ZvAqJtZ5JRs
Channel Id: undefined
Length: 18min 3sec (1083 seconds)
Published: Mon Sep 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.