AWS re:Invent 2017: Best Practices for Data Warehousing with Amazon Redshift & Redsh (ABD304)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

my name is Toni Gibbs I'm a data warehousing solution architect with Amazon Web Services and today we're gonna be going through best practices on Amazon redshift so let's get started here I always like to ask this one just to kind of get a feel for where you guys are all at how many people here use Amazon redshift on a regular basis oh wow that's awesome how many of you are evaluating redshift you don't actually use it today but you're maybe looking at it ok this is perfect this is exactly what I was hoping I'd see how many people have never used redshift you're just interested in it maybe something of interest okay a couple of you so this talk is mostly geared for the first you know a couple people are as asking about the rest of you I have tried to make it in such a way that if you're familiar with data warehousing databases you should be able to get something from this enough to at least get you started on redshift so let's get started here we're gonna quickly start out I'm gonna move through this first part you know it's the history and development of redshift where it came from our history you know the last four years of development that we've done then we're going to move into concepts table designs so some of the concepts are mainly to make you aware of you know some of the just how we talk about redshift then we'll go through table design and the next part is best practices on ingestion and data loading and elt and that sort of thing and then we're gonna wrap up the last part will be on cluster sizing so this one will be fantastic for the few of you guys who are never used redshift and maybe you're going to give it a try this or if you're working on a migration or something like that that section will be good for you and then we're gonna finish with whatever QA time is left and you guys can you know ask me all sorts of stuff I will also hang out after the presentation you know outside or wherever and you guys can continue to ask me questions and I'll stay for as long as you guys continue to you know have questions so let's get moving on with the history of redshift really kind of starts out with Postgres if you to redshift for the first time or for all of you guys who have used it you will notice that it returns back a Postgres connection string we obviously rewrote the entire storage engine it's a column or storage engine and we made the system MPP which is why you can scale it horizontally up to 128 nodes then we added a lot of analytics functions proximate count distinct functions approximate you know percentiles and all these other kinds of functions that you typically use in an analytics data warehouse we wrapped all of it up in AWS this is integration with s3 for backups and restores and loading all kind of stuff kms for encryption I am which is you know for authentication and connecting to s3 and such and it really is the combination of all of these components that make redshift what it is we launched redshift almost five years ago when GA on Valentine's Day February 2013 and since that time we have added a lot of features and functionality we've really been innovating and adding to the product we do patches typically on a two-week cadence it is a fully managed service so that means you set a 30-minute maintenance window and we roll out those patches continuously security updates and all that kind of stuff and it's not something you really actually have to worry about just check the patch notes and you get some new cool features and such so let's move on to the concepts and table design I'm going to just start out with the architecture of redshift the top part up here that's kind of where you connect from the sequel clients and BI tools redshift supports we actually supply JDBC ODBC drivers but it does support the use of also Postgres drivers as well so say you're wanting to connect from Python for example you can use psycho PG tool too which is the Postgres Python driver and you can connect to redshift with that your connection connects to that top blue box there which we the leader node the leader node does all of the query parsing it does query coordination and all of that kind of stuff it also stores this PG catalog so if you actually are familiar with Postgres you'll see that the PG catalog in redshift is intact behind the leader node sits up to 128 compute nodes this is just this example has three compute notes you can go from two to 128 of those that's where the data actually resides in redshift and that's also there what do all of the heavy processing and heavy lifting of the data the one thing to take away is is that they do everything in parallel all of them together and that's something what we call a massively parallel architecture so every compute node works on a single query all at the same time these compute nodes also talk directly to s3 so this is typically how you load data in and out of redshift will go much deeper into that they also the backups and restores also go directly from the compute nodes to s3 earlier this year we released what we call Amazon redshift spectrum this gives redshift the ability to query s3 directly and we do this by provisioning extra there's actually another layer of compute that sits between it's an elastic layer and that's what actually does the query unit of s3 and passes the data back up to your redshift compute nodes and then obviously up through the leader node into you you can also load data through the spectrum layer and we'll talk about that in the ingestion section as well so the first terminology are dimension it column ER redshift is a column or data warehouse so what that means is we store data on disk column by column rather than row by row and the reason we do this is because the queries that you typically run against against a data warehouse like you know sums and averages and various other aggregate functions typically only operate on a subset of the columns so when you're only querying some of the columns and across millions or billions of rows a column or architecture greatly reduces IO on those sorts of operations just to illustrate that suppose we have this deep dive table very simple we only have four rows in it and we have a sequel query which we're going to select the minimum date out of that table in a row based database like say Postgres for example assuming there's no indexes or anything like that you would end up having to scan through the entire table to find that minimum date whereas in a column ER data warehouse like redshift what happens is is we only need to read the data for that date and thus we reduce io the next piece is compression compression does two things in redshift one it allows you to store significantly more data in your cluster up to about four times more data it also improves the performance of redshift as well and the reason why is we actually reduce IO we've stopped more data into our data blocks and we're able to read more off disk quickly a lot of CPU and i/o is still the bottleneck so that's why it improves performance we do our best in redshift to figure out the optimal compression for you so the first time you load data into a redshift table we will try to figure out that compression we also have a utility or command built-in to redshift called analyze compression that will find the optimal compression for you just to quickly give an example of how compression works in redshift same table same data if we modify the DDL with these encode statements here that's how compression is applied to redshift so we're going to add you zyy standard which is a new compression type we added earlier this year in the first example byte dict and run-length are just a couple of the examples that we have for our encoding or compression types so the best practices on compression and redshift the first is try to apply compression to all tables if you run the analyze compression command you may notice sometimes you get raw which means no compression for some columns or maybe for all of them if that's coming back that's because the table is so small that there's not really a benefit to adding compression to it so don't be alarmed if you see that we also have this is a github link we have a lot of ton of scripts on github and you're going to see these links throughout the presentation but this is the first one this utility here on github is one that we built and posted that will help you migrate tables that maybe don't have optimal compression in your redshift cluster and they'll rewrite the table and copy the data over that's sequel query there that's one of the PG tables that we've actually added into the PG catalog table catalog schema and that would be an example of how you'd find the compression on an existing table so the next concept to talk about is what we call blocks in redshift this is basically how the columns are constructed they're one Meg immutable chunks that we store on disk for aunt for each column every block will have one of eleven different encodings with one of those encoding x' being raw which means uncompressed when you factor compression in one of our data blocks we can store millions of values so when you think about that that means that a single block in redshift holds the values for millions of rows for that one column the next concept is zone maps and this really does relate to blocks because this what they are is it's metadata about the blocks so the main piece of metadata that we have here is we store the minimum and the maximum values within each of those 1 Meg blocks and the reason we do this is when a sequel query comes in we can check this in memory data structure and we can prune data out the next concept is data sorting this is physically sorting data on disk in redshift so this is you basically picking one or more columns and sorting the table by those and the point of this is to make the zone maps on the previous slide more effective that is the primary purpose of a sort key typically they're put on the columns that you're filtering on so if you have a where clause or like a predicate essentially and you're doing you know where this date is between these values for example that's an example of where you'd want to place your sort key obviously sort keys depend on your query patterns and your business requirements and that sort of thing but a lot of times it will land on the columns that you primarily are filtering on in your sequel queries just to give a quick example of how sorting is done in redshift the same table same data been already already working with we modify the DDL here and in this example I'm adding a sort key to first the date column and then the location if I were to sort this table now the data would end up like this sorted first by the date and then by the location and the point here is to make the zone Maps more effective to illustrate the two working together suppose we have four data blocks here we have the zone maps which are the metadata there which is held in memory and redshift and we have a sequel query where we're just going to basically count the number of Records on a certain day what redshift is going to do is it's going to check the zone Maps first and go I only need to read these three data blocks off disk I can skip reading this one which reduces i/o and that's exactly what the zone maps do if we take that same table and we sort it by the date now what happens is those own maps are in a more optimal condition for the sequel query and we can read we reduce i/o further so the main point is is that sort keys are primarily making the zone Maps more effective usually it will end up being on a timestamp and that's because in data warehousing you guys end up having some sort of temporal column that you're searching or filtering between but so not always the case if you have many columns or more than one column in your sort key ideally you want to have the lower cardinality columns first so this might be maybe there was an organization ID and then a timestamp for example and that organization ID maybe only had a dozen values that would be an example of where you would you know place the change the order of those and not have your date column first we do have a couple of github scripts if you have an existing workload that you can run against your redshift cluster what these scripts will do is is based on the logs in your cluster and what you're typically running queries against those tables on they will look at that and return back what what you're usually filtering on one thing to make a note of is sort keys on really small tables they're not really necessary so if you have a table of like say 10,000 rows you don't really need to sort it because chances are the table only has one block anyways and there's not really anything to optimize in his own Maps the next concept is slices which is a really important concept in redshift so this is how we get parallelism within a redshift cluster so every single one of our compute nodes is divvied up into two 16 or 32 slices we store the data physically per slice and each individual slice will only operate on the data that belongs to it so again it's to get parallelism it's how redshift gets its parallelism within a single node so how do we distribute the data amongst these slices this is what the distribution Keys come where they come in the first one is what we call dis style key what we're essentially doing with when you pick a column this is what happens is you pick a column and you sign it when you use this distribution style what will happen is is the value for that row in that column we hash it and that hash corresponds to one of the slices in your cluster and that's where the data ends up landed the next distribution style we have is even what even essentially is is we're just going to round-robin the data for you on your behalf it's the default choice in redshift the last choice is distribution style all and what that essentially means is we take an entire copy of the table and we write it out onto each node in the cluster so the whole table exists on every single note the reason for that is typically dimension tables small tables that you're joining to and you're trying to reduce the network broadcast when you do joints on those tables so going to illustrate how each of these distribution Styles work now I'm gonna start out with even so the same deep dive table I've been talking about this whole night and there's we're gonna insert these four rows into the table we're gonna start out by inserting the first one there SFO and it's going land on the first compute node here on the first slice the second-row JFK is going to land on the second slice in that first compute node the third one and the fourth one so it's just round-robin E and the data through the cluster pretty simple so let's pick a key here let's use this style key in our second example and I'm picking the location column here which corresponds to those values SFO JFK SFO and JFK there so if I take that first row and maybe it lands here on this slice and the second row lands on that slice what's going to happen is is SFO is gonna hash back to that first slice and JFK is gonna land on that second slice makes sense however this is an example of a bad distribution in redshift remember I talked about MPP massive parallel processing well what's gonna happen here is if you execute a query against this cluster your second compute node doesn't have any data in it at all and it actually does absolutely nothing and so you have one node doing all of the work so that's what we call rows Q so what let's pick a better column to distribute by now I'm gonna pick the a ID or audience ID here which is looks like a primary key that first row might end up hashing out to that slice the second one over there third one there and fourth one over there now I obviously bake this example so it worked out perfect statistically if you have a large amount of rows it will work out fairly well so that's the event that's the idea of how distribution style key works so let's talk about distribution style all what we do here is as each row comes in we write it to the first slice on each note so that there is an entire copy of it existing on each slice in the club or on each node in the cluster so that's how dis style all works so what are the best practices the best practices are to use distribution style key primarily for optimizing joint performance so this is if you have two tables for example which you join and there's the on clause and that on Clause has the two columns that you know you're joining on ideally you want to make that the distribution style for both of those tables the other primary reason for using distribution style key is if you're copying data from one table into another using an insert into and a select from the other table if those two tables share the same distribution key you're gonna find that that insert is significantly faster as well those are the primary reasons for it now there is one thing that you do I mentioned obviously the skew how do you figure that out that's skew this is the query in the system table svv underscore table underscore info and it's called skew rows ideally that value should be somewhat close to one what it means is is if we took a take a look at the slices and we take the slice with the least amount of data and the slice with the most data that number is the ratio between those so ideally we want it to be something you know 1 1.1 1.2 those are totally fine but once you start getting you know off in the twos and threes that's a big difference in data and your query performance will slow down so that's where you want to check distribution style all that is for optimizing joint performance between your tables your usually your fact tables and your dimension tables and as a rule of thumb it's safe to pick or to set that distribution style on a table if it has three million rows or less there are cases when it can be more but as a general rule of thumb three million rows or less is a good number as long as you're not there is one Cal soak other thing is you're also not reading or writing frequently to that table because obviously a right to that table means every note in the cluster is rewriting the same data if the two distribution styles above don't apply key and all just use distribution style even or if you're really not sure uses distribution style even as well it won't really do the wrong thing it won't mock performance up you will get good performance using distribution style even think of the other two as ways to optimize and make things better so this is a sort of a summary a summary plus a few extra points the first point is to materialise frequently queried columns into your fact table and what I mean by this is a lot of times will end up having a dimension table and we join it to the fact table and we end up filtering on that dimension table primarily and using the join to filter on in our fact table this is typically done sometimes with time dimension tables and the reason this I mean it works in redshift the reason it doesn't perform as well is is that we're not able to leverage the zone maps as effectively on in your fact table and reduce the i/o and your fact table so if you can if there is a column that you're typically filtering on materialize it into your fact table and filter on it there it's a column or data warehouse so adding a couple extra columns you know isn't going to reduce performance the next is is calculated values these are ones where maybe you have a where clause and you're wrapping a column and a function you're extracting a value out and then you that's part of your where clause if you can instead take that value and extract it out and write it out into a column and actually query off that and then the zone maps can actually be used one that I do see that we sometimes and unfortunately seen in some customer clusters when I'm helping customers troubleshoot performance issues is temporal columns as a distribution key so think of these as like dates and maybe the number of months in a year is a really bad example for again well it's a good example of how not to do this but yeah say you know you only have 12 values for the months of the year that does not make a very good distribution key because there's only 12 unique values to distribute the data across the cluster so as a rule of thumb I usually just say simply avoid temporal values for distribution varchars chars numeric they are you know data types where you can actually set the length of keep them as narrow as you can read shift does a great job it will store them as efficiently if you declared a varchar' 1000 or varchar' 2 it's going to take up the same amount of space however there's a little bit of query overhead we use up more memory in redshift if you make those really large so if you have you know you're storing an abbreviation that's only a couple characters don't declare it as a bar chart 1,000 declare it you know bar chart 2 or 5 or 10 or something small and you'll save memory and your redshift cluster at query execution time the last two I did talk about that was the analyzed compression command definitely run that and then also make sure sort keys are on the columns you primarily filter on so let's move into data ingestion so the first piece we're going to talk about is discs and redshift one thing that a lot of people don't realize is that we actually are the discs are actually much bigger than what we give you they're actually two and a half to three times the size and that's because the advertised space that we put on you know the pricing pages and all that stuff is the space that you get to use to write your data obviously we need a lot of other space we store mere redundant copies of data which I'll talk about in another slide there's an operating system there's scratch space there's all sorts of stuff like that and that's why the disks are significantly large the important thing is I guess if you're comparing it maybe to an on-premise system and you have a data warehouse on site you know that's why our disks might seem smaller than they are but they actually are quite large so moving on a data redundancy which I I've talked a little bit about or mentioned we store two copies of the data in a redshift cluster we typically call this the local and the remote but when a commit is it happens or finishes in redshift your data has been safely written to two of the compute nodes with the exception of if you're using temporary tables so temporary tables are very important to use because they write twice as fast because they're skipping that second copy so if you are doing things that are scratch space in nature and you don't need to persist them use a temporary table they do write twice as fast also we back data up asynchronously in redshift so this kicks off every five gigs of changed data or eight hours which ever happens first you can also take a backup at any time a manual snapshot as well which is a good thing to do if you're going to be making some drastic schema changes for example you can also disable backups those backups test three on a specific table there are some cases for that for transient tables that do need to be permanent and can't be temporary tables so redshift is an acid compliant it's fully transactional data warehouse the isolation level we use in redshift is serializable you can try to change it it'll say it did but it won't every transaction is serializable and redshift the there are two phases to our commit there is a local commit which happens at the slice level and then there is a global commit that is basically the coordination of all the slices in the cluster to make sure that everything is finished we commit statistics can sometimes be important to take a look at this is a github script for the and that's because their redshift isn't a transactional database it's meant for data warehousing not high-throughput of transactions the one design consideration I'll sometimes or issue I'll see some customers kind of do is they'll have a workflow that's creating a ton of tables and maybe loading little things and they don't wrap that workflow in a transaction and what ends up happening is is a lot of times you'll be running with implicit transactions turned on or something like that with you know your client tools and you'll end up having a ton of transactions and it's a really easy fix just wrap the workflow in a transaction chances are you probably want it in a transaction anyways because if it fails halfway through you don't want it half done so pretty easy fix and it's probably the right thing to do so let's talk about how you actually get data into a redshift cluster the primary way of doing that is with a copy statement the copy statement works primarily against s3 that's how most customers will load data and in this example here I have one of our dense compute it to eight Excel clusters it has 16 slices and we have a single file we'll just say that file is one gig file and what would happen is is if we executed a copy statement and loaded just that single file in a redshift what's going to end up happening is the first slice in the cluster is going to reach out grab that file download it's gonna you know parse it and do all the work it needs to to get it to the rest of the slices and it's not really going to run that fast if however we took that same 1 gig file and we split it into 16 chunks because we have 16 slices what's gonna happen here is this is gonna run 16 times faster and that's because every single slice in the cluster reached out to s3 all in parallel because there's a you know MPP share nothing architecture everything works on everything they're all going to download that they're all going to parse that they're all going to distribute it and they're all going to write it across you know all the slices in the cluster our recommendation here on these files is ID they should hopefully be one Meg in size or larger no larger than one gig after gzip compression few more best practices on copying data I recommend using delimited files yes we support Avro yes we support JSON and fixed-width format delimited files are easy to work with they're also very fast in redshift they were pretty much the fastest file type we have to load I always recommend picking a simple delimiter you know pipe or comma or whatever tabs don't pick some crazy utf-8 character seeing that happen it UK you can make it work but it's it'll make your life a little bit more difficult pick a simple null character use a carriage return as your end of line character things like that few copy options that if you haven't used redshift before and this is you know for the few of you that put up your hands the max errors can be a useful option if you're just getting POC data in it'll drop records except in varchars if you have invalid utf-8 characters in your CSVs so just a couple of handful of useful options so what about with spectrum so spectrum allows you to create an external table against s3 so you can query that external table with a select statement you can also do an insert into and a select from those external tables this if you say have a data Lake and you have you know files in your data Lake which will probably be in Parque ror see you can load those directly into redshift threw spectrum another cool thing you can do is you can aggregate data as it's coming in you can select a sub set of columns or transform the data with scalars and various other things so you get a little bit more flexibility on the incoming data that's coming into your cluster if you you spectrum another thing that spectrum does is you it also offloads some of the workload out of your redshift cluster so typically you want your cluster available for querying servicing reports and such if you use spectrum for this you have more resources available because that you're employing that second cluster so redshift is you know it's petabyte scale data warehouse it is designed for large writes we talked about those one Meg chunks which I mentioned were immutable if for example you come along and you write some insert statement just a single line and I do it all the time you know I'll be in my client and I'll be writing test data in and I add some row to a table we don't ever want to fragment the blocks those one Meg chunks and redshift so what we actually do is is we pick up the last block in each column we clone it we read it we then stuff that new record into that block we rewrite it back to disk we throw away the last block that was at the end of the column and that becomes the new block if you think about it that's a lot of work 1 Meg reading a 1 Meg chunk and rewriting a 1 Meg chunk for every single column in the cluster for that that's just not you know a red shift can do it it will work but it's not what it's optimized for and that's why I will sometimes say that a small write will cost roughly the same expense is writing a hundred thousand or more rows to redshift so redshift really is designed around kind of batch bulk loading updates and deletes a delete we just mark the data as deleted we don't actually delete it when you actually execute a delete statement so that's why it deletes kind of seemed fast that's also redshift uses MVCC which is multi-version concurrency control and that's why if you have a transaction running you will continue to see data exactly how it was when your transaction started updates are essentially just a delete in an insert and redshift so let's talk a little bit about a workflow it's very common and I get this this is one of the most commonly asked things that I have customers ask me so I really want to make sure to get this in here and that's how do you do an up cert and logic are in redshift here how's this logic work so we have those four rows for say and four rows have been working all night and we have a CSV table here our CSV file sitting on s3 and we have these four rows in it and we want to update that one we want to update that one and we have two more rows that we want to add to the end of the table how do you do that in redshift so the workflow is we want to load that data that CSV file into a staging table and then what we want to do is is we want to delete from the production table all of the data that matches the staging table and then what we're going to do is we're going to insert all of the data over into the production table so how does that work the first thing we're going to do is we're going to create a transaction the reason why is we want to reduce the number of transactions and redshift we talked about that but is very important the next is we're gonna create the staging table and it's going to be a temporary table the next is is the like clause or the like statement or keyword is going to pull over the distribution key from the production table to make the right copy faster from the one table to the other it also has the consequence of pulling over or the good consequence is none of the consequence of getting the compression settings from the production table as well which is really important and that's because when we execute this copy statement we load the data in you'll notice I have comp update off there what'll happen here is is we're telling redshift hey don't try and do the right thing in this particular case which is figure out the compression for me I know this you know I don't want you to do that and the reason why is typically this is going to be run repeatedly over and over again maybe every 5-10 minutes and we don't want to have red ship figuring out the compression over and over again every five minutes on the exact same table so that's why we copied it over then what we're going to do is we are going to delete from the deep dive table all of the rows that match that staging table and then we can just safely insert all the rows over drop the staging table and commit the transaction so this is the best way to do this logic so to kind of summarize the best practices on yell tea make sure you wrap your workflows all in a commit or a transaction explicitly create a transaction if you are going to be doing deletes try to use drop tables or truncates instead they won't leave behind kind of ghost rows phantom rows which need to be vacuumed up which we'll talk about in the next slide if you're using staging tables which at some point I'm sure you will be try to use a temporary table if you can't they write twice as fast if you do need to use a permanent table consider turning off backups so that the data is not being synced to s3 keep the same disk keys on both tables for faster performance the compression setting the compression to off or making sure your table actually has compression settings baked in the symptom you will see in your redshift cluster if you are figuring out the compression over and over again and I have a lot of customers who you know I'll be talking and they'll be like yeah my cluster runs at 80% CPU usage all the time and it's because their clusters figuring out compression for all these tables over and over again that's the symptom you'll see is really high CPU usage that you know it doesn't really make sense you just have a couple of copy statements running and the last one yeah make sure you move all the rows over we do also have a command called alter table append it essentially a moove operation it's a little-known feature in redshift not too many people know about it and it essentially moves data from one table to another it's a really good operation to use if you're moving large amounts of records if you're not moving a large number of records it's just like a small small number meant like low millions just do an insert select so let's talk about vacuum and analyze vacuum serves two functions in redshift one is to remove the ghost records or the deleted records out of redshift that's the first thing that it does the second thing it also does is it globally sorts the table in redshift so when a copy or any insert select statement for that matter execute sin redshift and that batch of new rows comes in we will sort that batch of incoming rows locally to that batch and it's written to the end of the table but that doesn't mean the table is globally sorted so that's when vacuum needs to run that's also why sometimes customers will maybe their tables maybe say that they're unsorted if you have a single sort key and that happens to be a timestamp for example you probably don't need to run vacuum if you're loading data essentially in sorted order and that's because we're locally sorting and the table kind of ends up being globally sorted as a result the next piece is the analyze command also very important what that does is or what it's for is is for collecting statistics on the tables in your redshift cluster so this is for the query planner so it can pick optimal plans I'd say most customers run vacuum or and analyze nightly is kind of a typical thing it really only needs to be run as frequently as it needs to be but that is pretty common especially if you have a little bit of downtime in the evening some customers only run it vacuum particularly weekly for example and we do have a really good utility here on github a lot of customers take utility it's a Python script essentially and it will vacuum and analyze your entire table so that's usually what I recommend doing grab that utility and vacuum and analyze on a you know some sort of cadence that make sense if you are changing and you have a table that's changing very frequently and maybe your query plans aren't always coming out the way you want one thing that we will recommend is also to run the analyze command just on the columns that your predicate sauron so if you're filtering on a column you can run analyze all the way down to just a subset of columns and so some of our some customers will end up at recommending for them to put that into their load cycle so if say you're loading on like a five-minute cadence or something and the data is continually changing and thrashing about run analyze after that load just on those columns so let's move into node types cluster sizing we have two different node types in red shift our dense compute two and our dense storage two technically there was obviously a first generation of both of those but we are on to the new generation of both of those the main difference between these two platforms is one is solid state and one is magnetic disks so they obviously slightly different pricing profiles the dense compute obviously perform quicker and like I mentioned the dc2 we actually just released like two months ago so new platform so cluster sizing production workloads should ideally be run on a multi node cluster two nodes or more it is we do allow what we call single node clusters and redshift those are fantastic for POC playing around dev work QA work all that sort of stuff I mean I run a single node cluster myself but you know if you're have a production workload use a multi node cluster and that's because you get that mirrored copy of the data for safety the other thing is is we give you the leader node no additional cost so as soon as you hit a two node cluster you technically get a third node for no extra cost and that extra node actually does do you know some of the work it does all that query coordinating and final aggregation and such the other I really hope that there's no one here but if that has this but if you have a redshift cluster that is still running an ec2 classic please move it out into a V PC there is no downside to doing this and we're typically seen around a 2x across-the-board improvement on queries just by moving from ec2 classic to V PC one customer I had they had a workload that was it was their nightly ETL workload and they called and we're talking and they were like yeah our nightly load is now a 17 and a half our nightly load which was unacceptable then because it was rolling into their daytime business hours and querying and we looked at their cluster I was like you know you're running an ec2 classic can you switch out to a V PC and they're not just that change alone their nightly load one from 17 1/2 hours down to 4 so if you happen to be a straggler running on ec2 classic definitely move to a V PC now the sizing piece in redshift really how we recommend or how every customer when I first have a conversation with them and they ask how many nodes should I use we look at how much data they have usually uncompressed we assume about a three times compression ratio that's you know safe rule of thumb or you know generalization if you're applying compression correctly and we obviously you know usually customers will know am I going to go solid-state am I going to go magnetic disk and we size it for home for the amount of data that they have you also want to obviously you want to leave maintain at least 20 percent of free space or two and a half to three times the size of the largest table in your cluster so for example if the largest table in the cluster is going to take up 10% of the cluster you leave about 30% free space and that's so that that table can be properly vacuumed the next piece is the DC one - DC - like I said we just recently released DC - DC - most customers especially if you're running on the 8 XL are seen around a 2x performance increase it's the same price as the DC one so if you're not on our eyes you can move over very easily just do a simple resize it's a few clicks in the console move from DC 1 to DC to get the performance boost there's no cost if you have our eyes please contact us we will work with you there are things we can do if you are looking to maybe potentially migrate your our eyes over from DC 1 to DC - it's not a straight over migration but there are things that we can do so you know reach out for with me after the presentation or contact support the last point is is that spectrum queries if you are doing a POC and you're working with spectrum and this one kind of comes up a bit is customers assume that you know spectrum performance will be exactly the same on a single node cluster a little tiny DC one large as it will be on a large cluster it won't be performance in spectrum scales depending on how large your redshift cluster is so we will provision up to 10 spectrum nodes per slice in your redshift cluster for each query so there is a correlation between the performance you get in spectrum and your redshift cluster so additional resources I've talked lots about these github scripts that all of this stuff is on github I'll just kind of walk through them quickly the admin's those are scripts that are solution architect team the team that I'm on our database engineering team which is a part of the redshift team they're the ones that you may interact with these are scripts that we've all found really useful when we're working with customers diagnosing issues and that sort of thing and we put all those on github the same with the administer very similar to the amends scripts but they end up being in views I've talked about the vacuum utility and the column and coding utility those are also like I said on github you saw the links earlier so those you'll find those there as well highly recommend checking those out a couple of blog posts worth calling out these are I think are kind of the three are three top blog posts the top one there was written by one of our database engineers it's a series of five blog posts that goes through table design and how a lot of like sort keys and distribution keys work I highly recommend checking that one out if you understand everything that he's written you should be in good shape with red shift and then the top ten performance tuning techniques was kind of the theme for that was what are the top ten things that we see kind of customers get hung up on when configuring a red shift cluster so that one is totally work worth checking out and then the same kind of theme on the spectrum one I think we have about ten minutes left so we can do Q&A

Info

Channel: Amazon Web Services

Views: 19,022

Rating: 4.9583335 out of 5

Keywords: AWS re:Invent 2017, Amazon, Analytics & Big Data, ABD304, Big Data

Id: Q_K3qH5OYaM

Channel Id: undefined

Length: 49min 44sec (2984 seconds)

Published: Tue Nov 28 2017