AWS re:Invent 2019: [REPEAT 1] Deep dive on Amazon EBS (STG303-R1)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right good morning everyone louder it is Wednesday at reinvent so welcome welcome this morning you are in the EBS deep dive with me on stage I'm mark Olson I'm a principal engineer with EBS been here for eight years and hope you guys are having a great reinvent this year yeah and I'm nauseous pellicer I'm a product manager on the EBS EBS team our world revolves and rotates around customers and just by show of hands how many of you have used EBS everyone good thank you right and and thank you for as you can see we have we have coverage with a wide number of verticals wide number of use cases and and and so we're gonna try to do our best to give you the wide gamut of what EBS does and and how it does it our agenda for today's session mark and I do this deep dive we've done it for I think do three years under trot mark based on feedback from you from last year we're gonna change things around a little bit and what we're going to do is focus on specific areas and try to hone in on the improvements we've made but also our best practices and recommendations in those areas and I think the feedback we get a lot of your developers want to use these tips and tricks in your environment and so that's what we're going to focus on I'll start off with security and encryption then mark will dive into performance availability durability I'll come back in again talk about saving cost and and then we have a real-world use case with our customer care data to walk you through how they put all this together for their solution we have a ton of content to get through so let's get started mark and I will be here after the session happy to answer any questions you guys might have so with that let's talk about security and encryption on EBS so when we think of security EBS integrates with the Amazon key management service we use AES 256 encryption and we use customer master keys right and there are a few things that are important to understand when an EBS volume is encrypted it implies that data at rest inside the volume is encrypted data moving between the volume and the instance is encrypted snapshots created from that volume are encrypted and volumes created from encrypted snapshots are also encrypted in in other words that boundary of encryption allows all these other capabilities to flow directly how do you encrypt well if you go to the create volume console and and right there you can see sort of the options there and one of the options is to create an encryption right and select a master key for that how do you select a master key you go into the kms console and within the kms console you can now create a new kms master key for your for EBS creating the master key allows you to define key rotation policy enables Cloud trail auditing controls who can use the key controls who can administer the key right and once you once you define that key you can now select it within that master key setting in your create volume console from that point as you add storage you can just select the encrypted option and select the key okay if you're like me and you like see allies then that's how you do it on the CLI you can select the run instances command select block device mappings and in the mapping JSON specify the key ID that you want to use okay one question we get asked often if we use encryption does my EBS optimise performance get reduced the answer is straightforward it's no the rated performance on our four series in five series of instances which is the c 4 m4 r 4 c 5 & 5 r5 is exactly the same whether your encrypted or not encrypted right so in other words encrypting in EBS volume does not reduce your rated performance on those instance families right and and and so that trade-off doesn't doesn't exist later on mark will talk about why that is but it's important for you to understand you're not making a performance and encryption trade-off part of the EBS journey is also snapshots and and within snapshots while we won't go into depth in snapshots specifically snapshots of encrypted volumes as we talked about are fully encrypted volumes created from snapshots that are encrypted are also encrypted you can encrypt an unencrypted snapshot by copying it and you can rien crypt a snapshot while you're copying it as well so why are snapshots different and this is hugely important because of some of the security implications snapshots can be shared across ounce they can be copied across accounts they can be copied within accounts there can be poor copied across regions and snapshots are used to create armies right and it's super important to realize that that changes some of the security posture related to snapshots so let's look at sharing of snapshots when sharing snapshots you can go down the modify permissions tab as you as you think about the create snapshots console or if you like see allies here's the here's the CLI command to do it you can you can describe the snapshot attribute and create a volume permission the permission tab there actually is permission for sharing it publicly and I'll go into that a little bit a little bit later here's what this looks like on the console on the console sharing looks like you have two options you have public and private and then your private that you can share with an account be especially cautious about the public sharing right when thinking about sharing snapshots and armies public sharing or is a reasonable use case for armies right but be super cautious about why you're sharing those armies think about marketplace armies that's that's a reasonable use case in almost every other case you want to share it with a specific right and if you want to launch a volume from that snapshot one of the things that we see often is you that customers don't copy that snapshot into their region so you need the snapshot in region in order to launch a volume from that snapshot so be aware of that one question I get asked often is how do I check what my snapshot sharing permissions are two ways to do it here's on the console within the snapshots console there is a permissions tab you can click on the permissions tab and as an example here's a snapshot that I'd made public that is an example of the snapshot that is public if you pull the described volume the create volume permissions you can see that the group is all again that's a publicly shared snapshot again very few reasons to actually do that make sure you understand why you're doing it and how all right so we talked about copying snapshots and how you can encrypt in re-encrypt let's see let's see how you do it so you go into the course of snapshots console pull down the copy snapshots tab and within the copy snapshot you can specify the encryption right if doing it where CLI that's how you do it but basically you can select B if it's an unencrypted snapshot you can encrypt it if it's an encrypted snapshot already you can go change the key what's a use case for that this is a use case that we've discussed with multiple customers and we see this pattern a lot customer is has snapshots for that particular snapshot they then copy it across regions and then in the second region using the resource level permissions that are supported in snapshot take away those permissions and then lock it down why do they do this right first multi region to sort of protection against regional events if there is a regional event this protects them against it but the locking down of the permissions is so that they can protect against malicious or unintentional delete of data right and we see this pattern a lot in the copy snapshot use case so earlier this year and in May we launched three features guided by feedback from our customers about how to make encryption for snapshots easier what I'm going to do is walk you through those three features and and how they're useful to you so three features were launching encrypted volumes from unencrypted snapshot to our Army's cherche snapshots encrypted with custom cmk is the cross accounts and then encryption by default for an ebay account within a region with a single click let's walk through each of these so encrypted volumes from unencrypted snapshots or armies so previously if you had an unencrypted snapshot or army what you would do is copy and encrypt it into a snapshot or army and then launch an encrypted EBS volume from it show of hands how many people have done this quite a few of you okay bunch of you complain to us about this and said why do we have to do do this it takes time it takes effort and really we don't want to do this so now with the new feature that we launched what you can do is take an unencrypted snapshot of Army with a key directly launched an encrypted volume so no more copying of the snapshot and Andry encrypting it for this particular use case you can do exactly the same thing if you want to change the key so how do you do it remember that create volume console we were talking about same thing here specify create volume within that create volume now as you're launching a volume from a snapshot you have an additional option to encrypt the volume right and and so if you like me like the CLI you can use the create volume command and specify the - - encrypted flag and that's it the volume will now be encrypted from an unencrypted snapshot so sharing of the encrypted snapshot or army across accounts okay here's what you did previously if you had a snapshot in an army that was encrypted with a custom simce you would have to copy that snapshot across accounts okay you would then create volumes from those snapshots and then then get to an encrypted volume in account - okay by the way encrypted snapshots could only be copied across accounts they couldn't be shared so what we've now done for snapshots are armies that are encrypted with custom seam cakes you can now just share them across accounts right by sharing them across accounts you can now launch the encrypted volume across that account so in a single step you can launch an encrypted volume in account - with a snapshot in an account one one note here this only works for snapshots that are encrypted with custom seam case this is not currently supported with snapshots that are encrypted with default seam case how do you do this so step one is you have to share the key and the way to do this is you go into the kms console and within the kms console you specify the key and give permission to an account for that key that account that you want to launch the encrypted volume in you can share the permission for that account from that point on you can launch a volume by specifying the key within the target account and that's it right so you can go from taking an encrypted volume in one account and encrypt a snapshot in one account and launch an encrypted volume in a completely different account all in one step these features are independently useful but they're even more useful together how so here's how let's say you had an unencrypted snapshot or Army in account one you now have to copy that snapshot re-encrypted in account - and then you launch the encrypted volume in account - right with these two features here's what that looks like you have the unencrypted snapshot our our mean account one you can launch a fully encrypted volume in account - ok all in one step and again our goal is here to make encryption simple so while this is simple you still gave us feedback that this was not simple enough hey and so what we launched is a third feature which is account level encryption by default at a regional level what's the problem so many of you had come and told us well my my my business has a need to make sure that all my EBS volumes are encrypted ok the way that administrators used to achieve this was within the account they would set I am policies so if your end customer launched an unencrypted volume it would block it from launching unencrypted volumes the second way that administrators used to manage this is actually by taking the unencrypted volume creating a snapshot copying and encrypting the snapshot creating an encrypted volume and attaching it back to the instance pretty complicated stuff right and and part of what we didn't like about this was it makes it punitive from a usage standpoint and and so pretty much prevents customers from launching volumes when they should and so team worked hard and tried to figure out ok how do we solve this this use case and what we launched and I'll show you how we achieved how we sort of do this is a single account level regional setting right and what it does is with that single account level regional setting all your new volumes in that account and snapshots from from those volumes are fully encrypted from that point on right there is no change to your existing workflows and in fact what you'll see is even then you don't specify the encrypted tab your volumes still continue to be fully encrypted right so how do you do this you think it's complicated right well turns out you go to ec2 settings pick the encrypted tab and within that there is now a setting that says encrypted by default right and you can then specify the default encryption key and that's it from that point on all your volumes in that account all your new EBS volumes in that account will be fully encrypted right pretty cool to actually actually see how that how that works for you guys so before I hand off to mark on the performance head if you remember nothing else from the security piece do things that I that I want to make sure you take would take away with here one please monitor your access make sure that you're not sharing snapshots with accounts or with or publicly if you don't have to and then second account level encryption is literally a checkbox use it okay mark so now that we know how to secure your data and your customers data we're gonna talk a little bit about building your application so we're gonna start off by understanding your mission and super important to understand what you're doing before we talk about how you're gonna do it so let's take a different example not from storage or compute or technology Sam planning a trip from Las Vegas to to London Heathrow it's about 8,500 kilometers right I gave you two choice of aircraft we have the queen of the skies the Boeing 747 cruised at about 900 km/h or we have the de Havilland beaver cruises about 200 kilometers per hour if that's all the information I gave you I would bet that most of you in this room and unless you're an aviation geek would choose the the queen of the skies but what I didn't tell you is that there's actually a beaver convention in London and so with all this data we realized that we're going to take the the beaver instead take our time going over there it's gonna take us a couple days probably but that's going to be more effective for what our mission is and so much like this trip you guys all have a different different types of workloads and so we're gonna walk through a couple different broad categories to help you understand what might fit in these categories but also it's important to recognize that not everything fits in these nice neat little boxes and so one of the things that EBS enables is the ability to pick and choose and we'll walk through how you do that so EBS does have two main families of volume types today we have our SSD back volume types and we have our hard drive back volume types and so on the SSD side we've got GP to general purpose or a provision I ops product and then our hard drive brought hard drive products st-1 and sc1 are available as well and so which ones do you choose so let's start off with database workloads this is a lot of what we see most applications have some sort of data store behind them typically this is where your performance requirements come into play mostly random i/o most databases even if there are no sequel or sequel based have some sort of right ahead log or a journal and that's going to be mostly sequential but the the real workload pattern is highly workload dependent or the real the real application pattern is workload dependent so what your customers are doing oftentimes drives the requirements of your database so typically we see customers have the best experience on SSD volumes here but sometimes hard drive back products work well here so let's dive into the the two different SSD products I start with GP two this is a volume that we spent a lot of time with analyzing data analyzing customer workloads analyzing patterns and developed it to achieve and and to hit the right performance sweet spot for about 70 to 80 percent of the workloads it means why do we gave it the general purpose name so if you don't know and you want to start somewhere I highly recommend you start with this one and so it's got a performance that scales with the size of the volume part of our data collection told us that the more data you have the higher performance needs you have and so this is 3i ops per gigabyte is what it scales up for smaller volumes you get a baseline or you get a burst capability of 3,000 ions for a throughput this is up to 256 megabytes per second and we do that with the logical merge size so if you're doing some larger we'll keep track up to 256 K if they're sequential and try to try to make it so that you can achieve that that throughput for you you can provision these volumes from one gigabyte all the way up to 16 terabytes and so these are good for boot volumes low latency applications bursty databases just about anything really so we talked about burst I'm going to give you a little bit of an intro to how it works because a couple of our products do have this notion of burst and so on gp2 you have a baseline performance and that's your through our apps per gigabyte that's always accumulating every second of every day that your volume is provisioned we fill up a bucket and when that bucket gets up to 5.4 million credits you can't accumulate any more everything just kind of spills over and then as you as you're reading or writing to that volume you can spend those credits at 3,000 per second and so that's how the burst works and so the performance like I said scales with the size of the volume we start out of 100 I ops for all volumes so if you're 33 gigabytes or smaller you're going to get a hundred I ops scaling all the way up to 16,000 I ops with just over a 5 terabyte volume there and that's your baseline you always get that 4 volumes up to a thousand gigabytes you can burst up to 3,000 die ups and that's the three hives per gigabyte that is depositing into that that bucket so if we take an example of the 300 gigabyte volume you get a baseline of 900 IUP's but you can burst a 3000 so when you need it sometimes you might have some peak workloads it's there for you a question I often get asked is how long can I burst and so with 5.4 million credits and you're getting 3 hours per gigabyte it's actually a curve that's nonlinear and so the bigger the volume the longer you can burst so that 300 gigabyte volume you can burst for 43 minutes if we go a little bit larger to 500 gigabytes you've got an hour of burst and then as we get bigger and bigger and bigger and approach that thousand Giga Byte mark you get a longer time period so nine hundred fifty gigabytes gives you ten hours or most of the day our other SSD based product is provisioned I ops now the thing that's unique about provision I ops is that you can provision the performance separate from the amount of storage so if you need a lot of space and not a lot of performance you can provision that so you need 16 terabytes but only a thousand my ops you can do that or the opposite at a fifty to one ratio so if you need one hundred gigabytes you can get five thousand miles on that volume and so these are provisional from four gigabytes to 16 terabytes and we have that same 256 K logical merge there's no burst bucket on provisional I ops and the other thing about provisioned I ops is it's got a higher latency consistency profile so this is typically good for mission-critical applications things that have sustained workloads less bursty workloads and we'll talk a little bit more about some of the what we mean by critical in a little bit moving on to media so this is your rendering farms your transcoding any sort of streaming product typically you have higher throughput requirements mostly sequential and pretty sustained especially when you're talking big render jobs so for these products our throughput optimized HDD rst one product might be a good fit and also not called out here if you if st one doesn't have enough throughput for you provision i ops with its thousand megabytes per second of throughput might be helpful most of the time that's t1 and here we'll talk a little bit more about that s t1 is good enough for you so much like gp2 this has a baseline that scales with the size of the volume however the the scaling is different we use throughput instead of eye ops on rst one volumes and so you get 40 megabytes per second for terabyte up to 500 megabytes per second and then a burst that's designed to allow you to scan the entire LBA range a few times a day and so you get that up to 500 megabytes per second the difference on the burst with our hard drive back products I don't have a slide dedicated to this but I do want to cover it quickly on GP to the burst bucket is a fixed size on our hard drive back products the bucket actually expands with the size of your volume to give you that multiple times a day scanning rate these are not designed for boot volumes and you have a minimum capacity of 500 gigabytes but you can go up to 16 terabytes an on the logical merge what I mean by logical merge is we don't actually hold on to your i/o and wait for it the next one to come in before we complete it we just keep track of where it was and if the next one looks like it's in the same place or sequential or the next LBA range over then we'll count it as part of the previous i/o and so this is good for large block high throughput sequential workloads data and analytics this is a common application you've got log analytics that you want to do cough ghost Blanc Hadoop maybe data warehousing a different type of database pattern typically these are higher throughput requirements usually sequential i/o but not really sustained applications so there there might be some daily or hourly or weekly periodicity to it and so these are good for one of our hard drive back products either st 1 or even st 1 which is a little bit colder this is designed for one complete scan a day so the baseline throughput is is lower the burst throughput is lower and it comes at a lower cost as well everything else is similar to our st 1 products so we've got the one megabyte logical merge 500 gigabyte minimum capacity and up to 16 terabytes file-sharing is another common workload so this could also be web servers things like that sifts NFS maybe a near line archive oftentimes these are super low throughput very bursty unpredictable but when they do happen there's not a whole lot of traffic that goes on and you're designing these with with cost sensitivity in mind and so for this workload sc1 is actually a super super compelling case for it so how do you know if you don't fit into one of these buckets what your application is doing well the first thing we can do is fire up one of my favorite tools - and I use this all the time to give me a quick glimpse of of what's going on with an application and so this is iost ad it's a linux utility and we see here that over the course so this I've had a simulated workload to 1 EBS volume and on the read side so this is the the our /s is read requests per second and then the the read throughput I'm doing about 25,000 I ops and getting about a thousand megabytes per second of throughput and so that comes out to about 40k per requests so under that 256 K merge size under that 256 K limitation on the right side much smaller iro size so I'm only doing 6 megabytes per second 1,500 writes and so this may simulate a database workload where you've got a small journal with with small writes on average across the entire block I'm getting 40 39 kilobytes per request and that's averaging the 40 and the 4 and that an i/o stat represents this in in sectors and 512-byte sectors instead of K so you got a divide by 2 but what I don't know based on those numbers is is my workload sequential is it random am i just doing 40k or do I have some larger thrown in and so we've got a Swiss Army knife in Linux called block tres if you really want to know more about your workload you can run block trace and capture exactly everything that's going on now there's a number of stages that go on with an IO and in the kernel and block phase captures all of them from submitting it to the queue from when the cube picked it up to when the the device actually put the completion back on and to when the application picked up the completion and so it captures the time of every single step that's a lot of data and so there are some tools to parse it block parse will take these binary dumps and parse them into a human readable format and output them in a in a format that's more easily consumable the BTT program that you see here is also a standard linux utility and it allows me to extract some interesting parts of that data and for this use case I just wanted to know the offsets and the sizes so I had it write out the block offsets file I wrote a quick little Python script to take that file and analyze and figure out sequentiality of my workload and also been IO sizes so I kind of had an idea of what my application was doing and so I run that on my application and it shows that my reads were pretty random just a little tiny sequential but a mix of i/o sizes which is kind of interesting so I've got 8k 32k and 64 K iOS and then that write was mostly sequential for K as since these were intermixed IO stat didn't show any merging or any sequentiality to them but with this what I can do is I can identify maybe there's part of my application that's sequential that I can carve off a lot of databases will allow you to put the the journal file onto a different device and so this could be a use case where I put the read workload or my data tables on say a gp2 or provisioned i abs call Youm and I put the journal on an ST one volume since it's just workload and so EBS allows you to mix and match like that you can take an instance you can attach any number of volumes of any shape any size any performance characteristics and really fine-tune what your application is doing so we're talking a little bit about instances now instances are also important to the mix and with the combination of volumes and instances that you can attach and mix and match together there's a really big number of combinations that you come up can come up with but I'm going to simplify it down into a few categories I don't have any of my offload instance types up here some of you may be doing GPU or FPGA workloads those are available as well but for the broad category of applications recommend you start with one of our general-purpose instances and so this could be the m5 t3 a one even the upcoming m6g and these have a pretty balanced ratio of CPU to memory if your application is more compute intensive so you need more CPU threads or higher frequency we've got the c5 c5 n z1 Deon and that category or even the the c6 G if you need to go the other way there's a lot of databases do we've got our higher memory instances which have a much larger amount of memory we've got our 5 r6g and even x1 e and the u metal instances that have terabytes of memory and the one thing that if you've been around AWS and then using ec2 for awhile you'll notice that i'm focusing mainly on nitro instances and i think it's really important that if you're building an application today and even with your existing legacy applications that you migrate to nitro as fast as you can and there's a number of reasons why we've done a lot of work to understand the workloads understand how our data centers work understand how customers use the applications and we have added things that make the the platform more efficient and so the nitrous system really boils down to three parts we'll start in the center and that's the nice nitrous security chip and so this gives us a hardware root of trust lets us know that the hardware is what we expected is increasing the security of our platform also allows us to offer bare-metal instances and now i'm not going to go into a super deep dive on nitro today there are other sessions for that and i highly encourage you to attend them if you want to get out about some of the stuff our nitrile hypervisor so we built it from the ground up we started with KVM but it doesn't look anything like KVM that's an upstream we've highly customized it for for our infrastructure one of the things that we recognized is that if we offload all of our devices we don't need any of the device emulation code and so that greatly simplifies the hypervisor and it removes a lot of surface area for security the way that we did that was by building nitro cards and so we've got V PC networking offloads we've got EBS offloads instant storage as well as our main system controller are all offloaded on nitro cards now since this is an EBS talk I'm going to focus in a little bit on the EBS nitro card and so I just mentioned earlier that we can offer security encryption for EBS volumes with no performance impact on our C 4 C 5 or 4 series 5 series and our upcoming 6 Series instances and that's because of the nitro card so we've had these in our instances since the c4 but we didn't expose it directly and one of the reasons why is that the nvme interface that we present back in 2014 when we launched c4 wasn't mature the driver stacks in the operating systems weren't mature enough two years ago we realized or noticed that the drivers were getting a lot more mature a lot of bugs being fixed a lot of performance getting improved both in Linux and Windows as well as the BSD platforms and so we were able to take that one last remaining thing that we weren't able to present into the the instance and move it into a PCI device in your guest and so on the EBS back end the Nitro card does provide that that encryption offload and that's a hardware assisted offload and so all the crypto keys are stored in that module not accessible by anybody we implement the the fad today is nvme over fabrics we've been doing it for a while we present anything me to the guest we have some sort of fabric back-end might not look exactly like you would expect but it really is nvme over fabrics and this allows us to do this super efficiently over our own dedicated uplink which gives us the ability to do EBS optimized by default and so if yes optimized is a dedicated uplink gives you dedicated EBS bandwidth up to 14 gigabits per second are 1750 megabytes per second and on smaller instances with nitro one of the things that we were able to do is provide birth capability so that you can actually burst up to a higher amount for some period of time and I shall go into that later what those numbers are giving you the ability to use a smaller instance size perhaps now one of the things that we did and I love this that they were continuingly iterating these nitro carts have enabled us with some software improvements to now give you up to 19 gigabits of throughput for instance and this is the EBS optimized this is available today on your cm and r5 instances and coming soon to the rest of the Nitro family over the next few weeks so when you're building your application it's really important that you don't just be stagnant continue to experiment what we have today what your application has today you may grow your application performance needs may change as you're scaling you've onboarding more customers so keep experimenting use the scientific method when you do it though so that you have repeatable processes if you're using benchmarks that's a great first step it gives you an idea of the performance here combination but nothing works and really shows you the performance like your real-world work and so if you have the ability to do a/b testing or modeling your your actual customer traffic maybe it's a one box as well in a larger system it's a great way to actually see if your changes have improved now you can monitor your EBS volumes with Claude watch and so one of the things that's super interesting is we have a burst bucket metric and this tells you for GB to st 1 SC 1 volumes how your burst is doing and if you're utilizing it and so I've got a workload here on a 500 gigabyte volume and you'll see for the 35 minutes or 40 minutes that I'm getting that full hundred percent of my burst bucket the burst bucket is depleting in the blue and I'm getting the volume performance 3000 I ops there in the orange and then when my burst bucket depletes my eye ops trail off I stop my workload my burst bucket comes back there's a really easy way to see if you're using your burst bucket if you need to scale your your volume or scale your performance needs you can also combine metrics with Claude watch you can monitor a raid set if you want or other metrics of your application and provide a higher level view and so on scaling you can use elastic volumes to either get to the right type or get to the right performance level maybe you started with an i/o 1 volume you realize that's more than you need you can go to a GP 2 volume with GP 2 just ensure that the size gives you enough performance for for your workload so you may need to scale the size up depending on where you're where you're going there so in EBS we actually think about availability and durability separate this is a little bit different than most people think about storage but if you think about DBS as more of a distributed system it kind of makes sense we actually do think about a desert it should be the system and so EBS is designed for five nines of service availability what do we mean by availability so when we say availability it's the ability for the instance to get to the hosts that store your back-end data so this accounts for the actual ec2 hardware the network and the the physical hardware that stores the data now once we get there we think about durability and so durability is the ability for us to actually get to your data or store your data now EBS is designed for a failure rate of 0.1 to 0.2% and the way to think about this is on average if you have a thousand volumes for the course of a year you can expect 1 into 2 of those volumes to fail and knowing that failures happen it's really important to think about how to design to accommodate that now I mentioned earlier that I talked about what critical databases were in things like that so in AWS and Amazon broader scale we actually have built a ton of distributed systems and we think about these systems in one or two ways and the way that we bucket these is we ask ourselves this question would my customers or my business be impacted by degradation or ahtur or an outage if the answer to that question is yes we call it a Tier one or a critical system now the question is why would we bucket this why not build everything as a Tier one system think about Tier one systems is there's going to be a higher cost involved either you're going to have an EBS volume that has higher performance characteristics maybe it's scaled for your peak instead of your average maybe you have an active active or an active passive or some sort of clustered solution there's also a human cost five nines of availability or higher is very little downtime so you have to think carefully about your deployment strategies and how you're going to operate the system and so we have the everything else bucket for things that might not be business impacting and so this could be analytics ETL jobs maybe there's a finance pipeline that can run at night so that it's ready during the day and if you have an hour of downtime at midnight nobody really notices except for the engineers that have to actually deploy the changes in the middle the night so when your volumes fail it is important to be able to recover your data and so for that we've got EBS snapshots which are pointing time backup of the modified volume data now these change blocks are stored in s3 which is a service with eleven nines of durability EBS snapshots are incremental and crash consistent and one of the neat things that we released this year was the ability to take a crash consistent view of all of the volumes attached to your instance so previously you had to one-by-one take a snapshot of every volume attached to your instance now with one API call you can actually take a consistent snapshot of all of them and so with Amazon data lifecycle manager you could actually automate this and reduce the amount of data loss in a volume in a failure or improve your ability to recover from it and so this automates your lifecycle and integrates with cloud formation and also integrates you can do EBS volume or instance level snapshots now once you have these snapshots just a couple weeks ago we released a fast nap shot restore which gives you the ability to enable snapshots to have a higher recovery rate and so this could be the snapshots or the proper pronunciation which is Omni up to ten volumes at a time and so you just create the snapshot we hydrate that in the background and then you'll get near real-time performance or near expected performance as we hydrate that and with that we'll talk a little bit more about ways to save cost on EBS discussion around performance and availability and durability one of the things that we get asked by customers is okay so how do i how do i save cost on EBS as mark alluded we have for volume types and and really selecting the right volume for the right workload is key because each of these volumes comes with a different cost point right so you have GP to at ten cents per gigabyte month and all the weight and you have a c1 going to do and a half cents per gigabyte month and it you can mix and match based on your workload based on what's important you can mix and match these volume types to meet your business needs and that is a huge cost lever in terms of in terms of thinking about volumes mark mentioned elastic volumes you can use elastic volumes to size your volumes correctly you don't have to size for what you ultimately think you need if you have workload is going to scale and grow you can start small so you can start with an Iowan volume with a limited number of i/os and then as your needs grow grow that i of one volume in terms of capacity and in terms of i ops and as you need it rather than all up front and again that changes your overall cost profile right and making that change is as simple as a single command you can modify volumes and and increase the size and increase the i UPS as your business needs it a one thing when you do increase the size and I see this quite often is make sure that you expand your file system to take advantage of that new capacity I didn't that's that's one thing that I do interact a lot with customers on that front another common pattern we see is selecting the right instance to instance size to match their needs so in this case a customer has c4 2x large connected to a provision I ops volume that's 10 terabytes in 16000 I ops what's wrong with this picture well the 2x large can do 8,000 diets and what that means is it is mismatched to the volume that it's connected to now the ten terabytes might be fine but you can see that the C for 2x starch cannot take advantage of the 16,000 I ops that's been provision to the provision I of swallowing here's one way to fix it you could go to a C for 4x large which does 16,000 IRS and now you have an instance that matches your provision I of swaram I ops you could go the other way you could go to a see you could keep the C for 2x large and reduce the I ops on your I or one volume to match the instance like both of those options are available but keep tabs of which instance type you're matching to your storage needs we talked about optimized bursts earlier as a benefit but how does it work and what does that mean for you as customers when you think about nitro BBS optimized burst we've enabled it on a nitro family so our 5 family I 3 en p 3z 1 d if these instance sizes have if these instances have sizes less than for excel then what we've enable is is this new capability called abs optimized burst your burst I ops and throughput can run up to 30 minutes every 24 hours because we find that a lot of workloads need that short spike of iOS or bandwidth and can benefit by reducing the instant size to get a sense of what that looks like here's an example so if you're sustained iOS are 4,000 right but you need a burst capability for a half hour you can use a c5 large and that can give you give you that performance need that that you have whereas without the burst you would have to go to potentially in this case a c4 2 X large because 20,000 is your iOS need right so in other words by balancing and understanding your workload and understanding sort of what that spike looks like you might be able to go to a lower instance size again helping you save money a third pattern is tagging volumes and snapshots on uncreate one of the part one of the things that we see is customers create volumes they spin up volumes create snapshots instances go away the volumes and the snapshots remain and then everybody is left asking the question what was that volume doing and what was that snapshot connected to right both volumes and snapshots as you can see from the commands their support tag on create use tags on create to make sure you understand what the purpose of that volume was what that purpose of the snapshot was and as your needs change you now have track ability and traceability you can also use cost allocation tags on your snapshots to keep tabs on how your cost changes for your customers show of hands how many people know and delete on termination fear fear but not enough right delete on termination is a flag on a volume so on boot volumes that gets set to true by default on data volumes it gets it set to false by default but here's what it what it means if the leader on termination is set to false when an instance goes away the volume stays behind when it's set to true if the instance goes away the volume also goes away so if you have a workload in which your instance lifecycle matches your storage life cycle set delete on termination equal to true right but you use with care right you may delete volumes that you might actually need so it is super important in this case that you understand how your use case is and what your lifecycle needs are data lifecycle manager again this is this is something that we launched last year to enable customers to keep track of their snapshot and set policies on snapshots you can set policies based on on on taking snapshots in the periodic interval and then keeping certain counts of snapshots so here's an example of a policy that I set which is it's taking snapshots every two hours and keeping track of 24 snapshots which gives me two days of snapshots and that way you have a lineage that's fixed and gives you an upper bound of how long your snapshot lineages are customers came back to us and said well we need more be warned snapshot retention that is based on the time that the snapshot exists and that's precisely what we've launched so this is now available we now have time-based data lifecycle manager policies where you can now select snapshot retention based on days weeks and months and this allows you to meet your business needs so in the same tab that you selected policies you can now select time-based policies and do account are based on days and so this sets the upper bound on how long a snapshot can be retained okay another thing you can do is lower cost on the volume type if the volume is not in use so if the instance dies you or you terminate the instance you can move your GP to volume and store it on sc1 and when it's time to connect the instance back up again connect it to a GP to volume and or elastic vault move uses modified volumes to move it to an elastic to a GP to volume and the way you would do this makes it such you have to keep keep track of certain things one is there's a six-hour eveytime limit and the amount of time it takes to modify volumes this is especially helpful if you have a predictable schedule right either weekends on month ends and something that that you can plan for the last piece delete volumes that you don't need now I always put the use with care tab on it because you need to understand your use case in your patterns but in this case if you have a volume that's detached in in general you can use cloud watch events or tags to figure out how long these volumes have been detached set an explicit organizational policy that says how long you're going to keep volumes detached we strongly advise that you take snapshots of these volumes and then you can proceed to delete them right these are all sort of techniques and patterns that we see customers employing in order to get to lower cost points on their on their EBS bail so to put all these best practices together and and to give you a sense of sort of how how they use EBS up up Nexus teradata we're not Manish thank you oh sheesh hey everyone I'm Vinod Raman and the product manager for Terra data products on AWS give you a quick overview booster data tera data point in the data warehouse 40 years ago and we've been a leader in this space ever since right now we are focused on delivering today the products as a service on AWS to our customers because that's what our customers want that we have multiple production customers running today on AWS our customer segment spans multiple verticals here many of our customers the common characteristic here is that they have very very demanding needs high availability resiliency and performance at scale our core product on AWS is called Terra native Vantage so the Vantage value prop is pretty simple we want our customers to use any tool any language of their choice to access 100% of their data and the core of Vantage here is three engineers the advanced sequel engine which is our attire data database and we've also added machine learning and graph engine also the heart of the system is the data store the data store is very important plays a central role here because all of these three engines will be able to hit the same data store so that you can unlock more value out of that data they're so different use cases that we see for Vantage on AWS production analytics where we have at scale millions of queries hundreds of users thousands of users milli a hundreds of applications running at scale discovery analytics where data scientists want to explore the data located in their s3 buckets or tests and F systems where line of businesses want to spin up systems and quickly try out new things disaster recovery is a huge use case for us as well as we see customers migrate from on-prem to AWS they want to do business continuity and then having a second system on AWS is a great way to achieve business continuity coming down to the technical details of things here our Vantage software runs on ec2 instance to emphasize the point here Vantage does not leverage not just leverages one storage type but multiple storage types by that I mean depending on the needs it can either use memory EBS or s3 so for frequently accessed data that will access quickly and then performance is very critical memory is leveraged including local memory plus nvme memory of the instance supports that EBS we use for persistent data that needs to be updated quite often but you still need performance Escalades on it and then s3 based data leaks if data scientists and others want to query a large amount of unstructured data in data lakes a couple of points I want to emphasize here that was made earlier the right choice of instance types we choose our instance types first made up based upon a couple of factors num number one how many V CPUs our instance type has how much memory per V CPU is available and how much throughput or bandwidth to be EBS available in a Purvi CPU basis we originally started off on the four series of instances and we've recently switched pretty much all our deployments to the Phi series of instances and we see a big difference in throughput and performance and it also sets her up sets us up to get the latest and greatest as AWS innovates and we partner with them including the latest recently announced nineteen gigabits per second throughput EBS so it sets us up to receive the latest and greatest from AWS by choosing the Nitro family of instances with that I'm going to hand it over to my colleague Mahesh thanks for a good morning my name is Mahesh Subramanian and manage the engineering and operations for Teradata advantage as a service on public cloud AWS specifically I'm going to be talking about how it shows the EBS volume type and also some of the EBS capabilities that we use within Vantage but first wanted to start off just with the recap that teradata Vantage is a very critical system for our customers it's a source of truth and most of the business critical data is stored in the sequel engine and the analytics is actually performed on that particular data so that system is very critical so therefore availability how we solve it is we saw have multiple copies of data on both EBS and s3 to solve for both local and zonal failures to understand how we chose the EBS volume type we need to understand the workload characteristics the tera date advantage work load characteristics tends to be CPU intensive or memory intensive or throughput to storage intensive a use case for a CPU intensive would be a IOT sensor related workloads memory would be machine learning related word clothes and complex sequel queries are typically the throughput the store throughput to storage related workloads but the key here is the other is the workload profile the teradata Vantage to work load profiles tends to be 90% read 10% right but the most important part as we look for a balance the key word again is balance of high throughput and I ops so based on those considerations we went with the GP to SSD volumes for Vantage for the best best price and performance for our customers now if we compare it to some of the other volumes that were discussed it's important to note that Teradata Vantage the i/o size is typically 96 KB with the range of 4 KB to 5 and 2 KB that means it is typically small and random nature of i/o and because of that the the sd1 EBS volume does not have the necessary performance because of the random nature of Io that we have the IO one is very performant but the price and consistency that it gave is not required for a system like Vantage and therefore again because of the balance we went with gp2 now let's just want to talk about a couple of services that we use and capabilities that we used from EBS the first one is expanding volumes expanding EBS volumes using their API so this has been a very useful feature for us for us to expand the volumes of our customers without having any downtime so we did bit earlier we had to redeploy cause clusters for them you don't have to do that anymore because of this particular feature and we have been an early adopter of this from AWS more recently we have been now working with crash consistent ec2 wide snapshots it helps us with faster and smoother backups and early tests have been showing that we have been able to give a faster RPO for our customers but what I've liked about this feature is and this capability is the other aspects other other other capability or the use cases that can be used for specifically around disaster recovery to meet specific RTO RPS inlays for our customers finally I wanted to invite you all to our booths generator booth is number 4 0 5 as you get into the expose on the left hand side we have Terra date advantage presentations and demos there and of course swag thank you and oh do you wash all right almost at the end of our journey we have training in certification that's available for you there are free digital courses on storage and the entire storage family do take them thank you for being patient through the session and mark and I and the territory team will be around here please stop by to ask any questions we take your feedback very seriously and you put a lot of effort into sort of corralling this based on your feedback please complete the survey in the mobile lab look forward to your feedback and your questions thank you for being a BSN AWS customers thank you you [Applause]

Info

Channel: AWS Events

Views: 7,826

Rating: 4.8400002 out of 5

Keywords: re:Invent 2019, Amazon, AWS re:Invent, STG303-R1, Storage, Teradata, Amazon EBS, Amazon EC2

Id: wsMWANWNoqQ

Channel Id: undefined

Length: 61min 19sec (3679 seconds)

Published: Thu Dec 05 2019