AWS re:Invent 2019: Deep dive and best practices for Amazon Redshift (ANT418)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright let's get started here my name is Toni Gibbs I'm a database specialist si here with Amazon Web Services and I'm here with Harshita Patel we're gonna take you guys through deep dive best practice session here on redshift so thank you guys very much for coming I know it was a last-minute addition to the catalog before we get started I always like to see where you are all at how many people here are using redshift today day in and day out okay good chunk of you how many of you aren't using redshift but maybe you're starting a proof of concept some sort of bake-off you're considering using redshift going forward okay that's almost like the other half how many people here don't use redshift at all don't really have any intention just want came here to learn a little bit more about it okay few but not very many this session is designed primarily for people who use redshift it is a 400 level session that being said the architecture and concepts is kind of a bit review so this section will be really good for those of you who aren't using redshift if you're just doing a bake-off POC when you just wanting to learn more about it after that we're gonna jump in deeper into the data ingestion ELT that sort of thing I'm gonna then hand it off to her Sheeta she's gonna walk you guys through workload management cluster sizing she's gonna touch on a new feature that we recently released called Amazon Amazon redshift advisor she's going to give you some extra learnings to close out with I don't anticipate a lot of time for open Q&A if there is we will do that if not we'll do it in the hallway there's a nice big section out there we will answer questions for as long as you guys have them I think last year I was kept around for almost an hour and a half so if you guys want we will answer as many questions as you have so let's get started here so redshift really starts out life from Postgres if you connect to redshift depending on the client you're using you might even notice a connection Postgres string come back we entirely rewrote the back end it is entirely column ER so the storage engine is nothing like Postgres it's also MPP so redshift has the ability to scale out horizontally up to 128 compute nodes or 8.2 petabytes of storage we added a lot of OLAP functions linear regressions windowing functions approximate functions and we recently even added geospatial support so if you have geospatial needs there we just added over 40 functions and a new data type for geometry for that we wrapped everything up in AWS integration with KMS s3 i am and it's the combination of all of these pieces that make redshift what it is we launched redshift Valentine's Day 2013 it was originally announced at reinvent 2012 and since that time we've continued to innovate we typically roll out a patch every two weeks to redshift you just set a 30-minute maintenance window we roll that out we take care of all the maintenance of both redshift the operating system underneath it is a fully managed system the last 18 months have been particularly exciting we've been adding more and more features I won't talk too much about them there's a what's new session tomorrow that I recommend checking out but in the last 18 months alone we have rolled out more than 200 new features and enhancements to redshift so redshift is an MPP share nothing column or architecture I'm gonna start up at that top green box there the sequel client and bi tools that's what you use whatever tool you happen to like to connect to redshift you connect to redshift with either JDBC or ODBC drivers that AWS supplies or if you're wanting to you say maybe dotnet or Python or Ruby some other development language or some other tool that needs maybe a dotnet driver you can connect with the open-source Postgres driver as well so redshift still works with the latest open-source Postgres drivers what you connect to is is that one blue box up at the top there that says leader node you connect to that it's a query coordinator it does it holds a lot of metadata it does final steps in the query processing such as aggregate shins or final aggregations order buys that sort of thing behind the leader node are between two and a hundred and twenty eight compute nodes in this example I have three when data is stored in redshift in theory it should be stored evenly across all of those nodes and every single node in the cluster execute the query against the data that resides on it it then passes those results up to the leader node which stitches things together and gives you back the answer underneath the compute nodes here I have s3 redshift you typically are loading data in to redshift through it or you can unload data out of redshift with that back ups restores also work through s3 those are completely seamless and taken care of for you backups just automatically happen in the background a restore is just a few clicks of a button and you can launch an additional cluster off of a backup then last year are about two years ago a little over two years ago we launched something we call spectrum spectrum is a layer of a compute that sits between your compute and s3 and it's provision dynamically a query time and what it does is it exposes your data probably hopefully in a data Lake and that data might be saying parquet or o RC CSV even JSON etc and he allows you to expose that date raw data on as 3 into redshift as an external table you can then query that external table with the same sequel syntax you use for any other table and redshift and you can even join the data in s3 to local tables stored in redshift but the redshift architecture is evolving all of the same stuff that I just talked about up there in the top left but we yesterday announced redshift managed storage this effectively separates storage and compute and allows you to provision each of them independently so what you can do is you can purchase the number of compute nodes you want each compute node can scale its storage up to 64 terabytes and it's completely taken care of for you it's a mixture of SSDs and s3 and redshift just moves the blocks it's at the block level back and forth between whatever makes sense based on your query patterns this is how redshift can now scale over 8 petabytes of raw storage we also announced yesterday aqua in the keynote and what aqua essentially is is it's a ws design analytics processors that are close to that storage they are in the redshift managed storage layer and they take care of things like encryption decompression filtering and being able to do some aggregates so it's hardware accelerated chips that are able to speed up those certain operations and also limit the amount of data movement across the network with this Amazon redshift managed storage so I'll talk we have we previously had to compute node types DC 2 which was our dense compute so those were SSD backed heavy amounts of compute and then we had our dense storage node type which is magnetic disks so previously we just had those two choices and customers had to pick between those depending on their storage or compute nodes we now have this new RA 3 instance and that's what kind of changes things a bit because now what you can do is is you can pick the number you want and then just let redshift or you save whatever data you need and let redshift scale that out automatically behind the scenes we launched yesterday and it is GA the RA 3/16 XL the RA 3/4 Excel which is smaller version of it will be coming soon sometime next year so let's jump into column or storage this is pretty basic concept but I'll cover it because there are some people who may not know what it is in a data warehouse like redshift where typically ant running analytics queries you're usually only selecting a subset of the columns in the table and you're wanting to perform operations across many millions of Records or in certain cases even billions of records a column or architecture helps us reduce IO and achieve better performance results just to illustrate how this works say we have very simple table call it deep dive I've just a handful of sample rows there on the right and I have this very simple sequel query where I'm just selecting the minimum date out of this table if I was in a row based database we could say Post graphs I would need to scan through every record in this table to find this answer assuming I don't have any help from an index in a column or data warehouse like redshift I can just read the data through that date column and get the answer back so that's how one of the ways that redshift is able to achieve the performance that it can is with this column or architecture the next piece is compression redshift we typically want to compress data and redshift and redshift will do its best to automatically apply compression the reason we do this is one it allows you to store more data in your cluster which reduces the cost for you which is really important the second thing is is in almost every case it also increases the performance and you actually get better query performance out of it because you're reducing the amount of i/o you need to do just as an example the same deep dive table the same sample set of rows they're one of 13 different encoding or compression types I'll use the two terms a little interchangeably and we have these these blocks here up ons up for that table so if I were to modify the DDL here for this table and you can see the encode statements up there in red you can see that each column shrunk independently from one another so that means that the compression is applied column by column rather than across the whole table and we're able to do that because it's a column or a storage engine so earlier this or actually only a couple months ago we released a new compression type that we call a Z 64 it is an encoding type that we built in-house it's we set out with a couple of goals here the main one was we wanted to increase compression ratio we also wanted to increase the performance as well so we set out some pretty lofty goals when we went about building this you can see down below here these are some of the performance numbers that we have achieved over two of the most common compression types we used in redshift which was lzo and Z standard that's what most customers were using across the board and so this new AZ 64 is able to beat both of them especially in the performance side and it's able to match z standard for compression footprint it works for all of the integer type data types so integers big and small and dates time stamps etc the nice thing about the now that we have this new encoding type is that I can pretty much give you guys a recommendation now that says if you have one of those data types you pretty much just almost always want to use this encoding type so it's a really nice across-the-board recommendation that I can give for varchars and chars today you'll probably want to stick with either L 0 or Z standard so the encoding types are they're pretty set now on what you'll use if you do want to find the absolute smallest amount of space or squeeze that table into the least amount of storage space the analyze compression utility it's built into redshift will analyze your table it picks up a set of the table and compresses it with every encoding type we have and it gives you back the smallest so I say smallest not most performant so it does not factor in any performance it just gives you the smallest that little sequel snippet there that snippet will if you're looking for what encoding types you currently have on a table that's how you can figure that out so let's introduce you to blocks blocks in redshift are immutable so means we never go back we never change an existing block they're also very large they're one Meg they're encoded with one of 13 in codings I have seen sometimes blocks even in containing millions of values usually want to take a look at a block is usually about 300 thousand or so records in a single block but in certain cases if you get really good compression it can be in the millions so let's introduce you to zone Maps zone maps are an in-memory data structure and redshift what they are essentially are the minimum and maximum values for each block red chip automatically keeps track of these you don't have to do anything it's more just so that you're aware of it and how redshift works we use these zone maps to filter out which blocks we read off disk at query time so that's the purpose of them the next piece of terminology is data sorting data sorting is meant to optimize the zone maps that is like when 90% of customers will use data sorting for so you want to basically make the zone Maps more effective and reduce i/o further it's in mohit most cases and redshift it will be on some sort of temporal column usually a timestamp or a date something that you're frequently filtering on this is data warehousing so there's almost always a timestamp in your fact table so a simple example of how data sorting works the same deep dive table the same four rows there if I were to manually modify the DDL for this table and apply this sort key which would be the sorting the table first by the date then by the location we would end up sorting it first you can see by the date then we have a tie so then it goes over to the location which is this JFK and then SFO so that's how data sorting works in redshift tying all of these concepts together and data sorting and blocks and zone maps say we have these four blocks I have the zone maps printed out beside on the minimum minimum and maximum values we have this really simple sequel query I'm just counting the number of Records on a particular date what red shift will do is it checks the zone maps and it knows that it doesn't need to read data out of that one block and we've reduced IO now if I were to take these same four blocks and I was to sort the data in them I would end up with might end up with some zone maps that look like that you can see they go in sequential order now so now if I would rat run this query I know that my data for doing this count would only fall in that one block and now I've reduced IO further so this is the main purpose of a sort key in redshift so just kind of a bit of summarization on sort key best practices it's usually on a temporal value if you do frequently filter on something else sometimes I've seen customers they have like a location ID or something like that or an office ID and it's a really low cardinality value the lower cardinality values go first so that might be a case where you would place that column ahead of the time stamp if you ever have a high cardinality value maybe you have a time stamp and it goes all the way down to seconds and it's really high cardinality because of that it does not make sense to add additional columns to your sort key after that I also usually recommend keeping the number of columns in a sort key usually somewhere between one and three once it starts reaching four five six it's all diminishing returns at that point and it means that the sorting takes longer we have a couple of scripts here on YouTube if you have an established workload that can give some recommendations as well for sort keys so you're talking about a little bit about materializing columns now that you guys understand his own Maps I usually see this one come about with migrations from existing legacy systems in a lot of legacy systems that are especially coming from row based databases we typically would be normalize our tables very heavily redshift being column er and having these own Maps work the way they do it actually makes a lot of sense to denormalize quite a bit so in this middle example here or in the in the middle of the slide here where I have these two sequel queries the first one I have this date dimension table and I'm joining to the state dimension table which probably only has ten thousand or so records in it and what's happening is is I'm running the filter on that dimension table then I'm taking the values I'm probably doing a hash join back to my prompt my fact table and I'm not able to leverage the zone maps as effectively in that second query there what I've done is I've taken my dimension table and I've materialized the values into my fact table and now redshifts able to fully leverage the zone maps and reduce io so that is one recommend that a lot of times when I'm helping customers with migrations I recommend making the second example applies pretty much to every database it's pretty much if you have you're doing calculations on a column in this case I'm extracting out the epoch out of that column in that case like that if you were to instead it just simply materialize that and write it out into its own column that will also greatly reduced a couple kind of things if you're especially if you're moving from legacy systems so this next concept really important in red chip slices it's how we get parallelism within each one of our compute nodes the easiest way to think about them is their virtual compute notes so every one of our compute nodes divvied up into either two or sixteen depending on the size of it virtual compute notes that we call slices the data when you write data out to redshift it is essentially spread out across all of these slices you know kind of shard it out and that's how we're able to kind of spread all of that data and do everything in parallel so I'll talk a little bit now about how we actually get the data spread across all the slices we have three kind of four ways of doing that the first is distribution style key what that does is if you take one of the columns you put a distribution key on one of the columns and for every row in that table for that column what we do is we take the value we hash it we then run a modulo by the number of slices in the cluster and that's where that entire row goes in the cluster and I have an example for that in the next slide this style all is kind of a special case one what it essentially is is its make a complete copy of the table on every single node in the cluster it's typically used for really small tables dimension tables we define that as small as being three million records or less the last one or the next one is dis style even and what even is is it's kind of like saying redshift I'm not really sure what to do just round-robin or just spread the data you across the cluster for me and that's exactly what it does dis style Auto what it does today is it combines even and all together so if you have a small table it's gonna start out life with this style all as the table grows and reaches a certain threshold it automatically converts into an even table so that's what dis tile Auto does so just to illustrate these say we have same deep dive table and we have a st. some rows and down here on the underneath here I have to compute nodes each with two slices and I'm gonna distribute those four records with this style even like I mentioned before it's just gonna round-robin the data one slice to the next really simple how it works so let's move on to dis style key I'm gonna pick the location column so these are these values SFO JFK SFO JFK and what this is gonna do is we're first gonna pick up that SFO and maybe it hashes out to there and JFK maybe it hashes out there SFO is going to go back to that slice zero and JFK is gonna go to that same slice one because they're gonna go to the same place this is an example of a poor distribution key and the reason why is if I were to execute a query against this cluster it would only execute on node 1 and node 2 would do none of the work so we can do better I'm gonna pick the audience ID here which looks a little bit like a primary key and if I were to and it's now distribute by this one might go to slice 1 2 might go to slice 2 3 goes to slices 0 and 4 over there on slice 3 I baked the example obviously so it works perfect but if you have millions of Records or billions or whatever it is a very large number and it's a high cardinality value like that something like a primary key the data will be mostly evenly spread across the cluster so what about dis style all like I mentioned we write every row to both nodes in the cluster in this case so in this case we write it out to the first slice and that's how dis style all works so just summarizing this dis key is typically used for joints what we want to do is if we have two large tables and the on clause in your sequel statement whatever those two columns are those are the ones we would like to distribute both tables on that will do what we call Co locate the data onto the same slice and that makes the join really fast so we want these co-located joints the other common use case that customers use for using disk Keys and this comes up a lot in ETL is if you have two tables and you're inserting data and selecting it from one table into another if both those tables have the same distribution key that operation is much faster dis style all primarily used again also for joins but with small tables there's also a little bit of I don't know sure what to call it nuance I guess in that in certain cases if the table is small it can actually also reduce your disk usage as well with small tables so if you have these small tables which we define usually as being three million rows or less it makes sense for them to read this style all even like I said it just evenly spreads the data across the cluster and auto which is the default in red shift is just a combination of even and all so this summary slide is pretty much everything that I've just been saying I'm mostly stuck in in here because these slides are all going to end up on line and it repeats a lot of what I just said one extra point that I put in here was is this distribution of T's on temporal columns something I pretty much never recommend even if it does doesn't cause skew in your cluster and all the data is evenly spread and that's because say like you pick something like a date for example a date being you know every single day and will say maybe you have seven years of data and it is spread evenly across the cluster in a case like that though you might run a sequel query and your your predicate says hey just run on just this one day what's gonna happen is is now only a single slice in your cluster is going to execute against that day and it's kind of this form of in-flight query skew that's happening at runtime so pretty much in all cases I always recommend steering you away from distributing by anything really temporal so that other than that the rest of this is just review so let's move into a data ingestion so redundancy redshift has these it's a two-phase commit which I'll talk about on the next slide but when a commit happens the global commit in redshift all of the data is at least written to two places in the classroom we do this for done it redundancy reasons and safety we also back all the data up to s3 asynchronously so that just happens automatically today it's every 5 gigs of data or eight hours whichever happens first the important piece or the reason to understand that these multiple copies of data is is that there's an exception to that and that's temporary tables temporary tables don't only write data at once there's no second copies of them and that has an important impact on performance they happen to be around twice as fast to write to as permanent tables so let's talk about transactions I mentioned that transactions there's a two-phase commit there's what we call a local and a global commit and in redshift because it is an acid compliant it's fully transactional database the because it's MPP we happen to only implement isolation level serializable so you can ask for you know repeatable reads or whatever it is at read committed but you'll always get serializable isolation level it'll actually say yeah okay I give you that but you are still in the isolation level serializable the one thing to be aware of with this then is that commits are a bit expensive redshift it's not an OLTP database it's a data warehouse sometimes customers don't realize that DDL is transactional in redshift so if you have workflows where you're modifying a whole bunch of tables maybe you're creating a table renaming a table and you know swapping things around each one of those if you don't explicitly wrap the workflow in a transaction is implicitly creating transactions so it's a best practice in redshift if you have a workflow particularly in ETL to wrap everything up in a transaction so let's talk about how you get data in redshift there's a couple different ways the primary way is with the copy statement that is the primary way everyone loads data in this particular example I have an RA 316 Excel which isn't technically a valid cluster config but we'll just go with it and I have these 16 slices up there which are represented and I have this one file on s3 and I execute the copy statement against it what's gonna happen is is redshift is going to pull that one file up and then it's going to distribute it and spread it out across the cluster and that's not really all that efficient if instead what we did was we broke that single file up in his 16 pieces now what's gonna happen is is all 16 slices in this cluster are gonna reach out to s3 and pull up that data this will run 16 times faster than the previous example so ideally you want to have as many files as you have slices in your cluster or a multiple of that so if I have 32 files that were all the same size that would also work just as well our rule of thumb or rough recommendation is that files in s3 should be somewhere between 1 Meg 1 gig after compression so usually most customers will compress the data with gzip these are just a handful of recommendations that I've put together over the years just what I've seen customers run into things keep your files simple I always recommend delimited files that could be you know tab delimited they could be comma what whatever it happens to be pipe keep things simple though don't pick crazy utf-8 characters on printable characters it makes life painful pick a simple character for null you can always rap strings and double quotes there's escaping all of that kind of stuff and all of that works in redshift if you want to know how many slices your cluster how's that simple query there will return you back the number of slices another way to ingest data into redshift is through spectrum so what I mentioned spectrum earlier how it's typically used for external tables well you can also create an external table and you can do an insert out of that external table into a local table as well and this is really nice if you want to be aggregating data off of s3 on your ingestion step or maybe you want to only select a subset of the columns or you want to do some row filtering or that sort of thing or you happen to have data in a format that the copy' statement doesn't support so let's talk about large how did redshift is designed around big data so small writes which are typically done in OLTP they are as expensive and redshift typically as ingesting hundreds of thousands or even some cases depending on the size of your cluster millions of records read ships not designed for small updates inserts that sort of thing and the reason why is because of those really large one Meg immutable blocks so if you want it for example do an update in redshift what we end up doing is you're updating maybe just a single record we have to find that record first which isn't a big deal then we have to go and figure out where all the blocks are across we've read all the values out of those we mark that record for deletion and we add it to the end of the table it's expensive operation it technically works but it's not something you want to typically do repeatedly if you need to do it once in a while it's totally fine deletes in redshift are actually quite fast they just simply mark records for deletion and it's a very fast operation we now have auto vacuum for deletes that will automatically take care of deleted records so if you are deleting a lot of records the auto vacuum process will come along and clean that up we are some other best practices around vacuum which I'll cover in the end of the sec so this is a very frequent ask that I get customers that asked me on this one which is deduplication or up certs how do you do that so I have the same four records in this deep dive table that I've been going through for this whole session and I have this CSV file here we'll say the CSV files sitting on s3 somewhere and I want to basically update those two records and I want to add two records to the end of this table how do you do that with a copy statement which is append only the workflow is that I'm gonna load the data into a staging table that's CSV with a copy statement I'm then gonna delete the duplicate records out of my production table and then I'm gonna insert everything out of this staging table into my production table so that's the steps that I'm gonna go through now the sequel for it I'm gonna start a transaction and the reason I'm gonna do this is remember I was talking about wrapping workflows in a transaction reduce the number of transactions it's gonna start with a begin statement here I'm gonna create a temporary table because temporary tables they're twice as fast so it makes sense for a staging table I'm actually gonna also use this like keyword here and the reason I'm going to use that is it's gonna copy over the distribution key from my other table for my prod table it's also going to copy over compression settings as well which is nice so the staging table now has the same distribution key as the prod table I'm gonna then copy the data into my staging table I'm gonna delete the data on that I'm gonna join on that a ID there I'm gonna delete those two records out then I'm gonna insert all of the records over then I'm going to drop the staging table I'm going to commit the transaction so this workflow or variations of it for the optimal way to do this absurd dedupe logic so I'm going to just walk through confidences of really again summarization staging tables try to use temporary tables if you can if you can't there's a backup no option all it does is it disables the asynchronous sync to s3 so you can disable that like I mentioned keep the same distribution keys there's the copying over with a like statement or using that like keyword will copy over distribution keys and compression and such there's also a it's kind of I wouldn't say it's special it's in the documentation it's called alter table append if you are moving or need to move very large volumes of Records like I mean hundreds of millions or maybe even billions of records in your workflow instead of doing an insert into with a select you can use alter table a pen it's a DDL command and what it does is it it takes the blocks in the staging table and it appends it into the onto the columns in your prod so it's all metadata rewrite and it allows you to essentially move data almost instantly so it's a really useful command if you're doing that that operations like that on very large sets of data so I'm going to talk to you guys a little bit about vacuum vacuum does two things in redshift vacuum removes the deleted records and it also globally sorts the table so there's two tasks that it does auto vacuum delete has been out for some time now so deleted records they should just automatically be taken care of by vacuum there is also what we call Auto table sort it's not quite the same as vacuum but it also runs in the background and what it does is it looks at the hot portions of the table that you're typically querying and if those sections aren't sorted it sorts them so it how it's kind of a little bit like auto vacuum sort but in a light way then we have this other command if you do need to do a manual vacuum and you're able to do it maybe in the middle of the night or some off hours we have what's called vacuum boost and when vacuum boost does is it's the boost part of vacuum is it's it's up significantly faster than just running vacuum without the boost command but with boost you won't really be able to do much else on the cluster running other queries and such so if you do need to manually vacuum just consider that extra keyword there and tacking that into your workflow the last step here analyze analyze is largely taken care of entirely for you by Auto analyze in certain ETL cases it can make sense to analyze just a specific column if you're frequently filtering on that there's a utility on github you can also use their that manually vacuums and analyzes all the tables in the cluster I think it's pretty redundant now that we have audit this auto vacuum auto analyze and Auto table sort and I am now gonna hand it off to her Sheeta hello now we'll dive into workload management WLAN when you have an environment with mix workload a workload loading the data performing ETL and ELT another set of workload coexisting your dashboard and reporting query is being run by your MicroStrategy in tableau users and in the mix you have your ad hoc workload being run by your data science team or your data analyst team the purpose of workload management is to allow the resource separation to these different workloads so say during a given business day you want to give priority to your dashboard queries and your reporting queries and throttle down your ELT workload which is a low priority this is the purpose of workload management in red chip what we have is called queues when you execute a query in redshift that query will execute in a queue based on the user or the user group you belong to or by setting a query group in the session that you are executing that is what determines which queue that query is going to execute in a queue is divided into slots each of the slot gets a percentage of memory the number of slots determine how many concurrent queries you can run or execute in that queue we have this concept called a short query acceleration you can either enable or disable it I would recommend you to enable sure query acceleration the way this feature works when the query start cueing retro will detect that this particular cute query is going to execute in few seconds it would tag is tagged it as a short query so instead of allowing the short queries to wait in the queue it's going to route the short queries to a special queue which is the short query queue so you're short queries can continue to run even if your system is fully loaded concurrency scaling is a significant feature which was released on March 27th of this year this allows you to handle your spiky workload so let's take a look at how concurrency scaling works you're submitting we'll use the dashboard query as an example as that workload we are submitting queries to redshift as soon as redshift detects that queuing is occurring it's going to take I'm going to build out the slides a little bit here redshift is going to take an automatic snapshot an incremental automatic snapshot which is very quick push it down to s3 is going to spin up a transient cluster which is a secondary cluster and the cute queries are going to route to the secondary cluster so this is shifting the workload from your main cluster to a secondary transient cluster as in when you are submitting more and more queries it can spin up additional concurrent clusters so from the point of detecting the queries are being queued spinning up a transient cluster routing the queries and when the depth of the queue goes down those transient clusters are relinquished all of these is fully managed by you by the service from your prospective you choose whether you want to turn on or off concurrency scaling and when you turn on concurrency scaling you can choose the option of how many transient clusters do you want to read redshift to spin up you can choose from one to ten ten being the it's a soft limit it's not a hard limit with every 24 hour use of your main cluster you get one hour of concurrency scaling capacity for free so say for example you have your cluster up and running 30 in a month for 30 days you can accrue total of 30 hours in a month which is the free capacity that you can use what we believe is if 97% of the customers if they turn on concurrency scaling they will be able to leverage this particular feature completely for free so leverage the free capacity for your spiky workloads let's marry the concepts that we talked about with W LM using this scenario this is a very common scenario we have seen with customers where there is a need to continuously ingest the data throughout the day during the business hours there are reports and dashboards running but there are also peak ours within the day but there will be a spiky workload say in the morning or in the afternoon hours you'll see a peak that requires four there is a peak in terms of queries being submitted for the dashboard workload and the night time or the nightly batch is where your heavy ingestion or ETL workload is running so let's map out for this use case how are we going to configure W LM so we'll start off by creating different queues will always the font size is very small for this but always ensure to turn on short query acceleration to allow your short cuase to continue to run even though your cluster is fully loaded we'll add an ingestion queue to this queue we are going to allocate two concurrent slots to slots that means two concurrent queries can run and provide it 20% of memory so each of the query is going to get 10% of the memory to this we are going to add a q4 dashboard to this we are going to increase the concurrency or how many concurrent queries can run will put the number 10 and provide the memory allocation of 15 so each of the query is going to get 5% of the memory say for example the dashboard queries are typically expected to run less than two minutes you can also set query timeout for this and because this is a spiky workload we are going to turn on concurrency Kaling for redshift to handle it for you when you see a sudden spike in the workload and the finally the default cube which is the catch-all queue for your ad hoc workload the ad hoc queries are data intensive will have will set it to concurrency of three and provide it a memory of 30% so each of the query is going to get 10% of the memory we also have a special queue superuser queue which is not visible on the console if a person is an admin they can set this query group in the session and when they submit the query those queries will be routed to the superuser queue the purpose of the super using queue is to run administrative tasks say you want to drop a table or cancel a query I would recommend not to use the superuser queue for long-running queries dynamic attributes of workload management what we looked was a configuration during a given day there are certain attributes in wlm for example the number of query slots the percentage of memory where they want to turn on concurrency scaling or off those are all dynamic in nature that means with a simple API call you can make the changes it does not require a cluster restart let's take a look at exactly the same example that we started off but we want a toggle for a nightly window where we want to give higher priority to the injection workload and we want to throttle down the dashboard and the ad-hoc workload so for the ingestion queue we are gonna set the concurrency to 5 and percentage of memory allocation to 80 so we are shifting the resources more to the injection workload which is a higher priority during a nightly batch and we are going to throttle down the resources the number of concurrency in the memory for the dashboard and yet hoc you what we talked about is manual WLAN automatic wlm the goal of automatic wlm is the same to allow you to prioritize your workload but the main objective and purpose of Auto wlm is to simplify the configuration of wlm where red shift will automate or dramatically manage the memory allocation and the number of concurrent queries that are going to run so this is a screenshot of the set up default setup of Auto wlm where you will see both the memory and concurrency on the main is set to auto you are not configuring the percentage of providing the slots redshift is automatically going to manage the memory as well as the concurrency you can choose to enable the option of turning on concurrency scaling or disable it in every data warehousing environment you will find some queries poorly written queries which you can which can be very resource intensive a commonly asked question is how do we detect fully written queries and on detection we want to automatically terminate it so this brings us to query monitoring rule this complements workload management with qumar you can set up rules to detect those poorly written queries say for example there is a Cartesian join or a query returning billions of records to the desktop you can automatically when it's detected automatically take an action to abort it or terminate it the second common use case for qiyam are that we recommend the customer is to log your long-running queries what this provides you is you can go and review your long-running queries look at the profile of the query and make adjustments to make them performant here is one simple example of how you can set qumar you can stitch the rules together we just want you to highlight this this is return row count of more than 100 million records a typical use case when somebody is trying to extract the data from redshift when you extract the data from redshift say hundred million records it's going to go through the leader node versus if you change this query to use an unload command with unload you can push the data to s3 and you are able to leverage the parallelism of compute nodes and the slices to work in parallel and that will be dramatically be faster so to summarize all the best practices as a key takeaway if you forget what we talked about in this session with wlm consider using Auto wlm if you are using the default old default queue please switch it to use Auto wlm if your workload is very predictable and if you have a need to toggle your wlm between your business hours and night or lightly batch use manual WLAN I will recommend you to start with three queues and have the total number of concurrency across the slots 2:15 for query throughput so this brings us to cluster sizing a common question asked is how do you rise the right size a red shift cluster either for a POC offer a new workload so we start off with estimating the uncompressed data set size when the data is ingested into redshift it provides 3x to 4x compression to be conservative we'll assume 3x compression on top of it keep 30 to 40% of disk space or additional capacity please ensure they when you all have your cluster if you are using your achieve cluster the percentage of disk utilization is below the 80% watermark based on the workload based on your performance requirement you can either choose to have dense compute or then storage and now our a3 cluster so with this recipe let's map it out if we start with 20 terabyte uncompressed data set with 3x compression you get 6 point 6 7 terabyte that's the size that you start start with and depending upon the performance requirement you can either use for dc28 Excel or 5ds ds2 extra large cluster and now to our a3 instance so with our a3 you get the elasticity of the compute and the storage say you started off with one of the the compute node based on the recommendation when you ingested the data maybe you get more compression if you wanted to change the number of compute nodes use resize if you started off with ds2 and ds2 was not meeting your performance requirement and you wanted to see switch to our a3 or dc2 again you can use resize the cluster in reg if there are two types of resizes one is the classic resize which is the old resize where redshift provisions a new cluster transfers the data from old to the new and with classic resize you can choose to enable or disable the encryption on the cluster an elastic resize is the compute nodes are added or removed from an existing cluster so let's map this out of how classic resize works we'll start off with dc28 Excel 3 compute nodes each of the compute node has 16 slices when you instantiate a classic resize an entirely new cluster is spun up the primary cluster goes into read-only the data is then transferred from the old to the new that data is redistributed across the compute node and across the slices and then the DNS is changed and it's pointing to the new compute node so with classic resize we started off with 3 compute node 48 slices and we ended up resizing up to 4 dc28 Excel so total number of slices is 64 let's map the exact same example but using elastic resize so with elastic resize we want to add storage so when you instantiate elastic resize a new node is added to the existing cluster an incremental snapshot is taken push down to S 3 and then the metadata is moved or the slices are moved to the new compute node and this duration is of 4 minutes if you have in-flight queries they will be part if you have a query in a transaction for write it's going to rollback once this stage is completed the data gets hydrated from s3 on to the new compute node redshift is aware of the frequently accessed blocks termed as hot blocks those are the blocks which are going to get hydrated first and during this entire hydration the cluster is available for read and write so by the lastic resize there are limits and ranges that you can resize up or down the cluster say if you start with a cluster initial configuration either you can go to 2x or 1/2 X if you are using the smaller instance family which has fewer slices if we started off with 4 compute node you can go to 8 or down or half-size which is 2 if you are using the larger instance family which has more slices you have more options say you started off with 4 compute nodes you can go all the way to 8 8 and 1/2 size to 2 but every intermediate point in between so you can go from 4 5 6 7 8 or from 4 down to 3 & 2 so when do you use elastic resize versus classic resize one common use case of elastic resize is for a known heavy workload say you want to do you you might have a weekly refresh or month end reporting or quarterly reporting which are high demanding workload customers will resize up the cluster to shrink the duration of those heavy workload and when the workload is completed they will resize down the cluster with Amazon redshift it was just announced last week redshift scheduler so now you also have the ability to schedule your elastic resize using the scheduler on a cadence you choose to use classic resize if you want to change the instance family and the primary distinction between both of the three sizes is the duration in the amount of time it takes elastic resize four minutes of park connection and classic resize it's based on the amount of data that it needs to transfer so consider classic resizes an overnight operation let's put all this together I will recommend for your production cluster you use two or more compute node because you want to have the redundancy of an additional compute node in the mirrored data copy for fault tolerance keep 20% of free takes disk space if you are using ds2 consider migration to our a3 because you will get more price performance now that red shift you can do cross instance restore so if you have a snapshot from ds2 you can restore that snapshot directly to dc2 or our a3 or any other instance type if you are using dc1 I would recommend you please upgrade to DC - because DC - provides you twice the performance and the upgrade is free of cost this brings us to one of my favorite feature in redshift advisor Amazon redshift advisor we talked about a lot of best practices we talked about table design copy best practices but what if the all these best practices are provided to you are recommended to you on a daily basis based on your workload in your cluster configuration this is what redshift advisor does for you this is available on the console on the left hand side it scans through the metadata every day with the lens of best practices makes observations and recommendations which are actionable and of high impact so to list out fear of the recommendation not all of them are here to list a few of the recommendation Tony mentioned about the copy command to make it performant split your file compress your files or all the compute node in slices can work together as well as skip the comp updates tab which does compression you when you're doing copy so this recommendations are captured and available to you on the console for your workload if you are using manual wlm if your queues are not completely being utilized it provides you the recommendation that the queues are not utilized so you can make a shift it also provides your query which would give you insight into how your wlm queue setup is performing if you are using redshift and if you're not used utilizing the cluster it would make that observation and recommend reduce the size or take a snapshot and delete the cluster for cost savings the same thing with enable short query acceleration and now with redshift you can alter distribution key using an alter table statement redshift advisor looks through your query patterns makes a recommendation on a distribution cui key which is going to provide you optimal query performance it will provide you an alter statement and you can copy and execute that alter statement if you are using redshift if you access to redshift advisor if you don't see any recommendation you get a thumbs up and 5-star that means you're implementing all the best practices to leave you off that additional resources here is a redshift github toolbox this has been developed internally by us and this is available to you for use and specifically the admin scripts the admin views are the one that we internally use a lot of our customers use to gain insight into the redshift queries and how the workload is performing here are the list of the blogs to leave you with the first blog advanced table design playbook this blog is an in-depth blog which walks you through it's an excellent blog - which will walk you through advanced concepts and topics if you are you so the top performance tuning techniques for redshift a lot of the best practices we have covered today but because this is one hour there are additional tips and tricks that you will find in this blog and if you're using redshift spectrum for your workload this last vlog will provide your best practices it's a great blog which will give you insight into if you are using spectrum what is right how to optimize your query as well as how you calculate the cost of spectrum queries if you are interested in taking certification and training or build your tool kit we also have this new database specialty beta exam available that you can you can take a real one would thank you all for taking the time in joining us today hope you all are enjoying reinvent thank you thank you guys I'm not sure if there's any microphones or not but if there are I can do some Q&A I think we have two minutes left or we can just go out in the hall and do it out there that works as well because I don't see any microphones set up for this so we'll just do Q&A in the hall then and clear out so they can get the next session in here thank you thank you
Info
Channel: AWS Events
Views: 32,775
Rating: 4.9367948 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, ANT418, Analytics, Amazon Redshift
Id: lj8oaSpCFTc
Channel Id: undefined
Length: 57min 37sec (3457 seconds)
Published: Thu Dec 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.