Deep Dive on Object Storage: Amazon S3 and Amazon Glacier (126393)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I know we're next to I think next to last session so thanks for coming and sticking around for this my name is Mark from a shudder and technical business development for storage services for AWS in other words I helped Adam an extension of the storage team out working with customers like yourself so feel free to we're going to save questions probably till the end but I'm happy to have some conversations afterwards we also have some engineering leads and product managers here as well for Q&A if you like so how many of you are familiar or using s3 today you just get a little sense ok good that was like to tailor the content so I'm going to get a little bit of an overview on s3 and then the title is deep dive on s3 what I'm going to go through are a number of different areas including performance management and some security considerations as well as new capabilities and tools we have not only in s3 but also in our entire object storage portfolio including glacier so for the 20% of you that didn't raise your hands I'm going to spend a minute or so on AWS storage so hopefully over the last couple days you've heard a little bit about the AWS compute and different in the entire family of 70 plus services that we offer but I wanted to do a little deeper dive on the different storage offerings so for those of you that are familiar with more enterprise traditional storage solutions like San and Nas we also have block storage that's EBS or directly attached to ec2 we have instance stores for those of you that are familiar with Nazz or heaven as we have ef-s how many of you are using EFS today ok probably smaller amount so this is our managed now as we launched it last year and if that's 4.1 and really it's great for any type of application that needs a POSIX compliant file system right so you think if you have multiple easy two instances multiple compute instances sharing common datasets that's exactly what EFS was designed for also a lot of customers that I talk to have maybe older applications they want to port their you know they don't want to have to re architect them say for an object store this allows you to easily lift and shift those applications and take advantage of a file system based solution which is EFS so those are going to the tradition block and file equivalents from enterprise today the session I wanted to focus mostly on our object stores so we have s3 I think everybody's familiar with that anybody shout out there's one bullet missing here what goes between s3 and glacier si a right s3 infrequent access so that's so s3 really has two pieces we have a top tier called s3 it's been around actually we just celebrated or 11 year anniversary and s3 infrequent access which we introduced in 2015 for more for less frequently used data and then finally for the colder data that's why we the name glacier we have glacier so those are you know the object storage components and then one of the big things that we talk about is data has gravity takes a lot to get in once data gets there it's hard to move it out so seismic data obviously but genomics data videos increasingly right they're going from hi-def to 4k to 8k there's a lot of density and so what we're seeing in general is it takes a long time to move data from point A to point B and we've been working with a lot of customers you might have heard on digital globe they've spoken by moving petabytes 50 petabytes of data plus FINRA who's moving a lot of data like data in that data takes a lot to move over a traditional wire right so we have a working with customers to move that data in and we'll go through a few of those solutions to expedite that move either continuously meaning in an online mode or via a batch mode you know if you don't have one at a time so snowballs example a batch snowmobile is an example of a big batch and and things like storage gateway and transfer acceleration are examples of more of that real-time where you want to maintain that connectivity so basically today if you get hopefully it's usually if you get one out of five I feel it's a good it's a good session right so one of these hopefully you'll be able to depending on how much experience you have with that with AWS you'll take at least one of these five bullets away and one nugget away from the session so first of all we want to make sure that you understand the different storage classes a lot of times there's some misnomers we're also continuously innovating so there are some nuances of the different storage tiers that we want to get into hopefully you'll even if you're using s3 today maybe you'll think about using a3 and you can access or even glacier automation we've added a lot so you know I I joined AWS a few years ago and one of the things that was awesome was just s3 just worked right but it was also a black box right so what we heard from customers a lot was more visibility more automation more optimization into really helping them and their workloads and workflows and lifecycle management so automation and optimization were things that we really focused on and we had a set of a number of things that we introduced the last reinvent to address those management as well and it goes with visibility and the finally migration tearing and bursting moving data in and out oh I broke it damn air Hannah oh yeah it's all going so well thank you thank you three buttons push to the right one we'll get it set okay so first of all we have every standard that's for our active use of case data we have standard infrequent access and that's for in frequently accessed data and then we have glacier and really as you go down the curve you there really three different use cases that you can think about and we've mapped some of both commercial as well as public sector customers in use cases so so that's three standards the one that's been around for a while you can think about that as fairly hot data data that you're going to access regularly in mo in what we call us major regions and also in Dublin in Europe a little bit different in gov cloud that's priced at about 2.3 cents per gig per month and with that basically you can access all you want you can access and all the data so if you think about you have big data analysis where you're touching the data quite a bit you're doing content distribution like streaming data like Netflix you're doing static website hosting so anytime that a good percentage of the data is touched on a regular basis s3 is the perfect you know very highly durable highly stable platform for that when you think about though many other types of data don't fall into that type of data set because only a small percentage of data might be accessed on a particular Martha doesn't matter what data but if you say have a petabyte of data and on any given month maybe only 10 percent of that data might be ten percent one ten percent one month another ten percent the second month of third ten percent maybe another month fifteen percent that's when you want to start considering f3 infrequent access they say oh I'm still accessing data and I got like maybe 10 files are super hot well but step back a little bit and look at the overall data set and say you know is it 5 percent 10 percent 20 percent anything under even almost a hundred percent you're going to benefit from savings from s3 infrequent access or si a because si a is priced at one point two five cents per gig per month so a little bit over half of what s3 is but every retrieval every time you touch a file or an object and pull it out we charge you a penny per gig for retrieval so if you do the math really quickly you can see that if you access 100% of your petabytes every month it'll be one point two five cents plus one gig or one cent per gig so it'd be two point two five cents per month if you access 100% of the data so you can actually break even at about 100 percent access so again it's very again it's here that we introduced that I think has maybe a little less knowledge but it's a very great it's a great tier for backup and archive dr files link and share and any longer retained data Pinterest is a good example they have a lot of images their access very frequently and then very quickly there's not that interesting anymore and they move down the stack and their touch less frequently huddle is another customer of ours that has videos so anybody doing video data it's really hot for a period of time and maybe after a few weeks then it basically goes much colder and it's less frequently accessed so anything with a life cycle of that to consider at three infrequent access now there's a demarcation within glacier so if you need data and you need data synchronously or in milliseconds right then either a3 or s3 infrequent access are the ones that you want to go with however if you want and you can tolerate just a little bit more time minutes to hours you can get a lot of additional cost savings with Amazon glacier so you can think about Illumina loses us for genome sequencing long term and genome data so needy ADC actually reinvented their entire workflow they went from they basically come to say you want to new a spider-man movie and you're an Indian broadcaster yeah you say I want ten different bit rates of the newest spider-man movie what they used to do is go to their tape library pull it out take five days to transcode it into the twenty different formats that Indian broadcaster wanted and then ship it out either on a disk or something using a sparrow for example to the data transfer protocol for broadcasters out to that Indian broadcaster today they've compressed that five days - 24 minutes the original target was one hour and they store all of those large assets on glacier and then the transcoded assets they basically pull those or jets out transcode those on the s3 and frequent access and then distribute them out so again you can think about these things complementing one another glacier is four cents for four tenths of a cent per gig so you know one one-third of the cost of standard infrequent access so you get a lot of cost savings there and a penny per gig for retrieval and we'll talk about some other options here as well so one thing underlying all of this is when you're thinking about can traditional enterprise storage compared to traditional enterprise storage I you know I've a long history of storage and I'll think about the traditional enterprise storage backup and number of copies and and dr scenarios and typically you have maybe a primary site you have tape backups you also then maybe replicate that data some of the data is replicated synchronously some of its asynchronously to another site right so these are pretty common scenarios for traditional enterprise well and so when you think about it when you're storing a gig petabyte you know an exabyte of data with the traditional enterprise storage you're storing it once maybe making another copy of it maybe making a third copy of it and then maybe still making a backup of it so you're storing it many times and so one thing that underlies all of our storage is this notion of durability and one of the things that we talk about is eleven nines of durability and what the heck is a lot of noise of durability so I wanted to kind of drill in that just a second and talk about what eleven nine two durability is so if you think a traditional we worked with a consulting firm and consulting firm worked with a lot of media companies who have long term preservation requirements and what these media companies do are one of two things they either copy make two copies so they have a new video they can make two copies of that and so that's two tapes or they might have a primary site make a copy and then move it off to a second site and make another copy those when you look at durability and durability is the mathematical probability that an object will be lost or corrupted in some way the first scenario has four nines of durability the sec so now we have about five nines of durability when you do the mathematical computation of that that possibility happening we have eleven nines which means that we're six to seven orders of magnitude less likely to loss to lose in detecting others it's billions of years would have to go by to lose an object is the math and it's actually there's a little analysis on our website about exactly in the formula we used so so that's the durability aspect under the covers as well the other thing that you get is when we think about durability its multi-site durability so every single object anything that we write is stored across a minimum of three different availability zones three different physical data centers so even if you are they can have a power-cut or a more likely connection cut or a flood or a whole facility goes away basically the data can still be accessed and reconstructed from the routes of remaining two facilities so durability is a lot more than making copies I wanted it when you're storing one gigabyte of data what you're getting is essentially essentially many copies of that data and a very highly durable version that makes sense so okay great so one of the things we talked about s3i a versus s3 so I talked about it when when do you use it when to move to it for all of 2016 I spent a lot of time we introduced three and frequent access and we talked to customers and they said it's got all this data on s3 and I don't know what's hot and what's not so hot not and so we had some customers who are very advanced and adventurous they took the log files of from s3 ran them through an EMR job and then took basically the output of that EMR job and did an analysis to move the data manually through into s3 in frequent access why because half the cost right so what we introduced so that I had to go through this process a lot less with customers was a one button tool called s3 storage analytics storage class analysis and so this allows you to essentially automate a lot what if I talked about about what customers are doing with their EMR jobs before which is scan through push the button have it scan through for 30 days and then make up recommendations on how much and how often it should the data should be teared so Amazon glacier so that was s3 Amazon glacier is the lowest coldest here it's optimized for that very infrequent retrieval it's even lower cost has the same durability that we talked about is s3 and s3 and frequent access a lot of customers in fact even here today about a lot of evcs customers moving off Cape say traditionally your combination of disk and tape have looked at Amazon glacier especially for that long term highly durable retrieval requirement the additional thing especially for public sector customers you can think about we have a feature within glacier called vault lock vault lock passes the FCC 1784 and other requirements for compliance essentially this provides a layer of retention option so you can if your attention is one year 2 years 7 years 10 years and healthcare for example at 7 years 21 years depending on education and I know that different industries and you know different different organizations some are just forever and retained forever we can do that as well so that's what vault lock does is allow you in addition to storing the data and making sure that it doesn't get changed we can also put a retention period on it to keep it for any number of months years and not allow any deletion of that data for that period of time so the cool new thing that we introduced at the end of last year was when we worked with a lot of customers they said you know it's you know s3 and frequent access is great and s then glacier is great but you know glacier typically takes 3 to 5 hours to get data back so I really need it I don't need it in milliseconds I have say an archive you know I'm using a tape system and I've images or I've documents but I need to retrieve maybe you know gigabytes terabytes or petabytes of data but then hours and then a few of them now and we're talking a lot to broadcasters and media companies but you probably some of you have this requirement as well said no that's the majority of it but I really would like to get a couple in minutes under 5 minutes actually would be awesome so we introduced expedited retrieval so that allows you to sort in glacier super low-cost so still cost four tenths of a cent per gig but you can retrieve it and we will retrieve then those files in one to five minutes so it allows you now when you need that emergency access and we have a couple customers that were actually responding to RFI's and rfp's on now in public sector where they're only using expedited retrieval as the primary so it's you know again I think we looked at it as quote-unquote emergency but if you look at it and look at the economics and balance it out in many use cases it could also be you know the pic they could be the primary type of retrieval that you might use depending on your use case the opposite is well said customers that said I want to rip through you know a hundred petabytes of data and I want to do large-scale analysis or I want to do machine learning where I wanted to trance code in the broadcast industry from one bitrate into something else so we at the same time that we introduced expedited super fast we introduced super slow bulk retrieval option which basically allows you to get your data back in five to twelve hours well why would you want to do that it's a lot cheaper so for it's a quarter of a cent four gig so you can retrieve an entire petabyte for a little bit over two grand so again dependent we're trying to match essentially what your SLA is for retrieval and retrieval requirements to your cost with these different solutions make sense okay cool so FINRA think you've seen them before so they're a great example of pulling all of this together in a great use case so using us for data Lake so a data Lake for those of you and that aren't familiar with the term so data Lake is essentially taking all of your different silos if you will of data that might be stored in a number of different if you will silos of Hadoop clusters existing on-premise might be new data coming in log files different types of of just raw data and when you run it on premises what what finra was running was kind of on prem they had a number of different silos they were they were actually analyzing about seventy-five billion events per day and they moved that entire solution over from their on-premise in to AWS and the benefit had a real high level and we can go to the weeds this is not a big data and data Lake a deep dive session but the takeaway is when you have multiple silos of data and multiple analytics tools accessing that data what happens well I do first of all you have to make three copies of each and if you have multiple new clusters you actually might get a multiplier effect of having that data then multiplied and multiplied multiplied again multiple data sets so when you move that and consolidate that into a single lake of data immediately you get the benefit of a lower cost of storage because you reduce it by a factor of three or more the other is just efficiency of being able to share that data so typically in the unpress big data and analytics environments you have AIT's a a tool accessing a data set with AWS what you can do is spin up say any more instance or you know any other analytic services store that in s3 pull it out of s3 stick it in and actually use the data and analytics tools and I might be saying in EBS instance or Emaar redshift in depending on what your particular use case is process the data and then dump it back out to s3 and that's a very typical way a the customers insure basically the the low cost and high durability and using that for s3 but then use that if you will analytics on demand share across multiple analytics instances and application types so again real simple consolidation and then sharing are the real things but then scale right so those are kind of Isreal takeaways and there's a whole case study we'd be happy to share with you on FINRA and efficiencies and we have a lot of customers now looking at doing exactly that either a building a data Lake from scratch we have log files for example is a great example where customers are beginning to do that or migrating existing on-premise siloed analytics cases so so that's data Lake example and that's one of the applications that you can run and one of the big one big growth areas that we see for s/3 s/3 and free can access and even glacier because we can bigger the results of that analytics and dump it down the glacier for example and keep that for long term at a very low cost and at a highly durable tier as well digital globe we have J little page I think has been speaking here he's been just a great partner and DigitalGlobe has been a great partner they reinvented their bids this so a lot of this is about also digital transformation so similar to Sony I which is in media digital globe obviously shoots a lot of satellites up in the sky takes a lot of images and very actually interesting similar analogy to Sony instead of movie or broadcasters being the the V requesters or customers they have other government organizations both domestic and international that are their customers as well as some commercial organizations so they might have sir example somebody in France might request hey I want to see what the ozone what happened to zone depletion over Paris over the last 20 years and get a time series analysis of that so they'll put that request into digital globe and what they used to do is go to their tapes maybe go to off-site have somebody in you know five to ten days later package it up stick it on a hard drive ship into France what they did is automate that whole thing moved 50 petabytes of their image library into AWS with a snowmobile and they've automated that entire workflow so now and the the not the expressions of J little page and the team they're a digital globe talked about was taking that data off of tape allowed them to free their imprisoned digital assets they're they use this term in prisons and if you really think about it whether you know the further it is away especially if it's off-site and assault line somewhere you know it there's a high friction level of getting that data back and actually using it certainly it's possible to do it but the frictions high as you move that and make it instantly available all of a sudden new workloads and new use cases can become available so now there's all kinds of new things a digital globe of thinking about in terms of the workflow in terms of different types of partnering distribution of that data now that's constantly and instantly available so keep that in your minds it's not just about the pure cost savings upfront and certainly there was that versus tape and so on and efficiencies and getting the data but it's also about digital transformation thinking about transforming as you're migrating the data I have a lot of conversations with customers don't just think about taking what you have and moving it but also what new use cases might come out of it as you move it into the cloud so those are the right storage classes and examples so one thing we wanted to do now is manage the data so I think I talked earlier one the new capabilities was basically the ability to look at the heat map so this is a good example of s3 customer data and how much of the data is being accessed in any particular month and so you see kind of on the left to the right you have because my eyes are small and or my eyes are bad I have to look on the screen here you have October 30th to about end of November again a time series analysis of how much data and you can see the data retrieved as a percentage of the total storage right and you can see data retrieval to petabytes and three petabytes their total size so based on this we then can run an analysis usually we recommend about a 30-day analysis and it generates a heat map and we have a solution called lifecycle management that says great after X amount of days move the data from s3 to s3 and frequent access after Y amount of days then you can move that from s3 infrequent access into glacier or you can say after Y amount of days purge the data delete it say if it's just scratch data and what this tool which is s3 analytics allows you to do is automate that so comes up it's a wreckin thing but as a recommendation engine for your lifecycle management so in this case and we took a look and we said how much of the data is in frequently accessed analyzed over the past 120 days you can certainly see that a lot of it was accessed within 30 30 45 and so on but then it's a huge drop-off between 90 180 days right so so there you know we come up with a recommendation of what lifecycle policy to set now you can use our tool but the cool thing is also especially if you're large amounts of objects you can export this into any of our tools or any third-party analytics tools excels a real popular choice pivot tables I love them I think many people do so yeah we support that you can export it into really any format including our own quick sites but also third-party tools that are out there for those analytics so once we have this data now you can plug it into what we call data lifecycle management which is another advanced capability that we have so data lifecycle management allows you to set policies to move data from s3 to s3 in frequent access and then to glacier and you can set this by a number of days from - creation dates you can also match on different different types of attributes so you can match by bucket so a fundamental concept of s3 is this bucket of data so objects so you can say entire this object policy applies to this entire bucket - our category you can do a prefix meaning any object or file name that starts with X you know apply this policy - so you can say everything that is you know DoD this anything that is ABC is this anything that is you know you know FBI is this and with prefix and have different policies based on that um the third thing and I haven't talked about it yet but we have a new capability called tagging and so you can also take lifecycle management and create the tag a object-level tag that is a new capability and I'll talk about that in a minute to say okay great all of this data set is say you know highly use highly by a certain department so we want to keep it on s3 forever but this one we know that it's used for like 10 days and so this other department you want to be able to move it down very quickly to f3 and frequent access or even delete it after 30 days so based on tags or any of these other contribution set these life cycle policies so the way you do that is very simple you can do it programmatically or you can do it through the console so here we show an example put in and call it a rule and basically say by prefix we're going to basically and then by a tag and you can enter it there you can say take current version we also offer typical versioning and we'll talk about that so if you turn on versioning within our solution basically and you delete an object you say whoops what do you do which that can retrieve that previous version so think about it as a recycle bin but you can actually with lifecycle either have just through this version or for previous versions as well and then you simply say and walk through it and say object was created after X days move it to s3 infrequent access so in this case 90 days and then after one year move it to glacier so really really simple to use the lifecycle management within aw sf3 to get those cost efficiencies and then you can also configure expiration so one of the things that's not necessarily jackal is you know I think about lifecycle management as just tearing but another example of lifecycle management is also deletion or purging or expiration of that data so in this case we can say that 10 years after object creation we can delete the current version as well as the previous versions of the objects and that's this use case again very simple very straightforward to do this either via CLI or via the console so speaking of versioning versioning protects users deletes or application logic failures so you know obviously if people think well a lot about deletion and malicious deletion but the reality is we had a lot of you know where what this application accidentally deleted you know five thousand or we had one case like a million objects right luckily that customer had versioning turned on because what they can do is basically when you delete that you could say go back to the previous version and we were able to restore what they were able to restore you know that those previous versions are those objects so it's real good kind of you know prevention mechanism for accidental or malicious if you never delete anything it doesn't cost you anything so best practice we always recommend that you turn it on if you start if you delete you want to keep multiple versions but if you think certainly there's a cost to store each object but again if your application doesn't intentionally and there's no kind of deletion mechanism or certain buckets or I'm designed to do that then great basically way to prevent fat-fingers right so or from either an application or malicious or not malicious user so um so that's one big tip is turn on versioning always if you can to prevent accidental deletion that prevents a lot of egg on the face and a lot of may be calls to a to support so another best practice our notification so so this is pretty common we introduced and expanded last year event notifications so what we can do is whenever you put so in f3 parlance putting is writing and deleting is something l is deleting so and anytime that you basically put or any of these actions are conducted with s3 we trigger an event notification so what what does that mean well we can do cool things so for example when something is put in the most common use case out there is you put an image up basically it'll trigger a lambda everybody know lambda and a server lists compute so basically it's a you don't have to stand up in ec2 instance we basically trigger compute on demand run a function and it goes away so it's like you know this ethereal compute serverless computing and we can trigger lambda function to do for example a trance code so very very typical use case drop in a photo and transcode it to a thumbnail or take a video and transcode that to a lower bitrate version you can also do things like trigger messaging so get notifications whenever certain objects have been updated that can trigger and SQS or SMS message so again lambda very very powerful because you can build a whole library of functions around and that can be triggered via these events automatically rather than writing a whole process around it and standing it up in ec2 so another best practice is cross region replication so many customers have a requirement where either for compliance i used to work a lot with some of some of the agencies and you know they want to make sure that they have data just from a compliance reason there you're mandated to keep it both on the East Coast and the west coast right so so for compliance oftentimes though more I see you know the application either users or owners of the application itself requires lower latency access to that data so in these cases what you want to be able to do is perhaps move all or part of that data from one region AWS region to another and so we have something called cross region replication you just turn it on on any bucket and basically you can replicate the data from that region to another region in the background automatically cross region replication so you can do the entire bucket you can do it on a prefix and it's a one-to-one replication between any two regions so what's there we also have some new capabilities that are coming out that are that are going to be enhancing this cross region replication we'd be happy to talk to you about those after the session so these are that category now if you think about automating management tools so versus just black box and just storage and having you do the heavy lift we have a number of solutions just make it easier to automate your management tasks so whether it's copying data from one region to another the lifecycle policy the event rather than you having to do that in standup instances and then also with Virginie obviously easily recover from accidental deletions so what I hear a lot from customers is now and I know a lot of you are using s3 and some of you might be power users right so that's where this next section comes in is performance the poor performance and management kind of go hand in hand so one of the first things we talked about is to get data in we talked about data having gravity so how do you get data in faster so we're going to talk about a couple other ways to get data in faster in a little bit but one of the s3 primary capabilities from an s3 platform kata standpoint and something called multi-part upload so I used to work in networking and LAN optimization and so the analogy for any of you that our network propellerheads is right and do you ever want to open up one flow over a highly latent link right what happens it goes really slow right especially if you increase latency even if you've got a really fat pipe so if you think about it likewise here s3 is this massively parallel system where we have many many compute front ended by storage on the back end but if you're uploading one object or one file it's going to go it's only going to basically hit that one front-end node and therefore you're not going to take advantage of the multiple ilda parallelization of the system of the massive scale so big big tip is if you're not using it today and a lot of the SDKs just incorporate this please use multi-part upload especially for larger objects will chunk it up and then will basically upload those objects as multiple parts in parallel the other benefit is that that's going to go over the wire it will be able to especially for highly laden links here they're going to be able to saturate that wire or your no it's a lot more than if you just upload one large file again you won't get all the efficiencies especially for those higher latent links so upload and what goes up comes down right so probably this is more common for you is you know you probably like to do the uploads every once in a while in many use cases it's you know 98% down and 1% up but you can also get faster downloads by using again parallelization parallelization is your friend have to repeat that because you could say parallelization that would be paralyzing but parallelization is your friend but again you can now parallel lies your guess as well as the puts and the way that you do that you use something called range gets so that allows you to get pieces of an object and you can set up say if you're doing it with ec2 you can say setups been a belt you see two instances set up multiple threads and basically each one of those then can grab a part of the object and again you'll parallelizing that retrieval so that provides much faster throughput again versus the pig through the snake scenario where you have a big object and you're trying to squeeze it you know through one this allows you to spread it out across also big tip is use cloud front because if you've accessed it especially if you're accessing it out of AWS we cache the objects at the edge so therefore the user doesn't have to go all the way back to s3 it could just access it also say it's you cost because you don't have to pay for that retrieval from s3 again so this next one is probably one of the most common things that I see especially the customers have been using AWS for a while which is you're coming from a file system you have a naming convention a naming convention is accounts name or our number plus a date and the date increments by one in the next one and the next one or the next one so what happens well in a file system that's cool really doesn't matter and s3 matters a lot because what we have is essentially when I talk about the parallelization under the covers what we do is we essentially segment out the data in partitions right and so what happens is if you have it now we've talked about the front end we're getting into the innards now of s3 a little bit if you do this what will happen is you will again put all of the energy on one part of one partition or one part of a called the index of s3 and what happens then is you're going to say oh s 3 isn't performing I'm only getting X you know hundreds of transactions per second or TPS Wow s rael it really sucks right you know so and and again back to parallelization as your friend and you know in using scale what you're doing is you're hitting essentially a hotspot or touching one area of this massive infrastructure you're not taking advantage of that scale so what we recommend is the opposite basically create as much randomness as you can in that file name and that prefix especially at the beginning of it it's just four or five characters actually will create enough entropy and what that will allow you to do is basically spread across the s3 index infrastructure across those to allow basically that to be serviced by a number of areas within the index so you're not creating a hot spot and so this is probably one of the more common when you think about it it's really important for this audience so think about it when you're architecting upfront because if you do this and this is all your file names like this and then then you call and say well geez we're only getting X amount of fine and we're getting you know throughput it's really hard sometimes to go from this back to this but if you're architecting upfront and so you're doing a file server migration or architecting a new application up front it's very simple to create some entropy and canary conversion table say between or just flip it around how with reverse timestamp to do that in the logic harder to do later to comb through perhaps billions of objects really important to do it and think about it up front so summarizing performance there's a lot more but these are some of the basic ones we talked about uploading and downloading optimizing with range guests we talked about distributing the key names and then we didn't talk about sequel query on s3 with Athena that's another capability that we have introduced so basically what you can do is dump all of your data and it's a native format and be able to do sequel queries directly on the data within s3 without moving it so that's another kind of performance I wouldn't call performance as much as kind of a cost savings and time savings I'm having to move the data around into because you can do the query of the data where it resides and then for networking other some networking and there's some good best practices that are out there but especially with mobile applications any of you that are developing mobile applications it's good to use the TCP stack for the past lossy connections and then we have really long fat networks just make the window size those are just date I mean those are your networking you know these things but just common things where you want to be able to improve and these packs two things together I'd say the one the two things out of this whole list to remember are use the mold apart upload and use the the the prefix the proper prefix design those two things will save your life and improve your performance as you're deploying on s3 so I talked about performance we talked about automation what about tools to manage your your storage so we talked a little bit about lifecycle management data analytics but we introduced a whole set of new visibility and management capabilities at the last reinvent one of those is object tags and when I heard about object tags I was saying well this is cool or interesting but as I started talking to customers it becomes really cool and interesting because if you think about your data what what you had before were let's call it you know you know sticks and rocks and good crude tools to categorize your data you had prefixes and you had buckets with an s3 with object tags and specifically what we introduced our mutable object level tags up to ten per object that now allows really almost an indefinite way to categorize your data so example one you have highly compliant data and you only want a certain set of you know internal constituents to take to access that data well you tagged it is highly classified or highly compliant and what you can do is take that tag and you can either write the tag a time of object creation or any time after object creation that's a big difference with these object level tags before at three you write the data if you want to change the data yeah actually can't you have to copy and rewrite that data so that's one thing now if you want to say write and objects later right a tag and then update that tag later you can do that with this Moodle object tag so it's a lot of flexibility what it allows you to do now is really a couple things one of the access control so as I said we've got one constituency you can take a security policy and say I only want with our iam policy this set of internal constituents to access this data nobody else so it interfaces with our I am secondly life cycle we talk about life cycle you can say great all that's really super hot data big data I don't want this accidentally to be moved after 30 days and at 3 or you know glacier going to be pulling it back so this is always going to stay here and s3 so you can apply a tag that says hot and keep hot forever in s3 or don't even have a life cycle policy and the third is storage metrics and analytics so we talked a little bit I'm going to talk a little bit more but before you could do our analytics and metrics based on prefix or bucket now you can do it on tag as well so now you can also in for reporting you can say you know many of you I know probably our internal service organizations to other organizations you might have 10 50 100 thousand organizations what you can do is tag now these objects and then very easily run reports and have metrics by these tags at a very little at a much higher level of granularity and you were able to do before what's just buckets or prefixes so that's tagging and here's an example of setting user permissions for example by tag so you can say for example resource example bucket name here and allow this particular tag and allow that particular tag to be able to access that object you can also another policy is the opposite of access it's and a lot of what I hear from especially financial institutions and regulated industries is I really want to ensure that objects don't get deleted and certainly we so there are a number of different best practices to do that one of the most common things that we recommend is versioning because what happens if you version it and somebody deletes it let's get it back and then we have a solution called MFA multi-factor authentication delete right which so versioning combined with MSA delete which means that you have to have a device and that device has a number on it that match the number to be able to delete the object those two things together provides a very high level of certainty that any random user isn't going to maliciously or not maliciously delete the data so a best practice especially I talk to a lot of financial institutions who want to make sure that the data is there also a lot of government institutions have a compliance requirement to prove and so on policies in place to prevent deletion or modification so so it's a good best practice to think about as you're deploying an s3 so the other piece of compliance is okay every time I touch an object or manipulate an object and we want to be able to monitor that so we also introduced this AWS cloud trail events and basically create a trail and on and that then of every single activity performed at a per object level so object level requests enabled at the ball kit level and then logs of that data so now you in addition to preventing deletion right if you decided that for some reason you didn't want to do that and somebody did delete an object you would have an audit trail of that but more importantly also have an audit trail of anybody that's touched an object access an object etc so very important for some some customers so we talked about visibility monitoring metrics and so on so the types of clouds and cloud trails for compliance but one of the things that I also heard from customers was a you know all sudden I'm getting latency spiked or hey I'm getting throttled well you know what's going on with s3 and it was a black box so what we introduced was we have a solution called cloud watch and we introduced a whole slew of s3 metrics but allow you to very granular level to understand what's going on so how many requests buddy get requested different types of head post requests excel all you know all the quests that were able to and how many of those over X period of time bytes downloaded and then the ones that are really interesting are the throttling and latency and here's an example of for example that total request latency are modeled over time so you can generate reports of these cloud watch metrics as well or export them so again providing a lot of visibility add divisibility how many of you have more than say a million objects today and s3 that you can share so okay don't don't don't mention who you're so but so if you want to do a list on those which is a listing of all the objects that are in your bucket that can take a long time when you get into millions or billions or trillions of objects as some of our customers have so what they said is we'd love to be able to do this but we'd like to get it back within a day or a week or a month right and not continuously run it so what s3 inventory does is take the power about three in the background create an inventory of a particular bucket and deliver either a daily or a weekly report for customers who have many many many objects so it's a really cool capability because then you can perform different you know kind of Delta's different actions and everything on objects that have changed and it allows a lot less a lot less lift for you and if you would do the math especially with lot billions of objects or even millions of objects is actually a lot less expensive how to do an inventory versus a list so it's going to pull it all together we had a company called alert logic then the security monitoring space commercial and they have a SAS solution that runs on AWS they migrated from they already had some data on AWS but they migrated off their file earth or on-premise nav solution and it was multiple petabytes and they're growing at a couple petabytes per month 4,000 customers and what they wanted to do is migrate up their file system but they wanted to be able to take advantage of all the capabilities of s3 so they needed to use the proper prefix design which was the randomness in the keys so the problem was then they couldn't use lifecycle management AHA because it's web cycle management say matches on the prefix and the prefix is random how do you actually move different objects into different tiers if you can't match it because it's completely random so they use tagging to basically tag each one of their 4,000 customers and attribute it and now they can use lifecycle management gain performance they're using cross region replication and they're tearing up to glacier and then retrieving it with our retrieval tiers I kid you not like we came out these features and they used every single one day one so again they're really great customer gorilla great partner but it really exemplifies how can pull all these different things together to use all the different capabilities that we just talked about so the final thing is we talked about how to manage it how do you get it in in the first place right so data has a lot of gravity as we talked about 100 petabytes over a 1 gigabit per second Direct Connect 27 years you know I'd be dead by then so Oh actually I'd be more dead over the hundred petabytes and a few be more or less dead I guess they could prop me up and I could somehow yeah so moving the data in to s3 we have a lot of different again online and batch capabilities to do that and if you have say in the more than 10 petabyte range we introduced AWS snowmobile so for customers that I'm going to give you a couple examples of the remaining minutes so you have multiple locations and you're constantly online uploading lots of data we have something called s3 transfer acceleration rather than going over the Internet you can basically upload it to a local pop we have no I think it's up to over 70 edge locations now globally will find through our red 53 dns service the local one to you say you're in Singapore and instead of moving the data all the way through and over the internet to s3 it's going to upload it locally and then use our high-speed backbones to get there so this can generate efficiencies ranging from not a lot if you're in Singapore it is Singapore to a lot if you're going to the other side of the world takeaway here is highly relating connections lots of people uploading lots of files to one central location s3 transfer acceleration can help a lot so snowball I think you've seen it very simple go into your console connect the snowball copy the data to snowball securely and then it's with your own key so your data is moved and appears in s3 we introduced snowball edge which takes that data shuttle concept and improves it by adding compute and a little additional capacity so we had a 50 terabytes snowball when we started we improve that to 80 terabytes and then the new snowball edge is 100 terabytes plus also the computer equivalent of an ec2 mx4 Excel instance so what this does if you think about the general kind of how snowball was used before in the data shuttling environment now what you can do is collect data processed the data locally and then actually do some local pre-processing of the data and then add that if you will to the data shuttle opportunity so Oregon State University is a great example so anybody here which probably many of you that has mobile kind of disconnected or really poor connectivity you can take the data like Oregon State University did do local analysis on their ships and then basically when you come back and you a better connectivity move that into s3 and AWS so Storage Gateway is our last solution I want to talk about bridging the gap basically if you have older applications or even new applications that you can't re architect for the cloud s3 native we have three modes this is a virtual machine that runs on premise it presents either an NFS interface and I scuzzy mount point or a cozy interface or a of ETL meaning it can look like a tape device so what we do is you can write to that on-premise and then it translates that and move that data into s3 and continue that to glacier or journal pharmaceuticals is using the file mode and this is the coolest new mode that we have because what you can do is write a file locally on premise they have lab equipment and it translates the output of that lab equipment directly into an s3 object on the other side and they can process that data say with analytics tools that's one of the only solutions out there the takeaway from this is if you want to keep your data and you want to kick off a process and analyze that data in the cloud the storage gateway file mode is one of the only solutions ours or any third party that will allow you to do that and maintain that object transparency VTL is another mode so this again just plugs in anybody have tape anybody to do backup here ok so plug and play looks like a tape device and we can now basically backup with Oregon Southern Oregon University did was allow you to backup to the cloud but also had a dr scenario in that they actually and the the data not only was uploaded to s3 it was uploaded to an s3 bucket on the other side of the country so they get dr and backup in one shot so finally if we can't do it there are third parties that can help you get data in so whether it's primary storage meaning tearing backup recovery or archive we have many many many storage solutions we just launched a new storage competency so that gives you a little both for us systems integrators partners as well as third party is fees that give you trust and quality that they've been vetted by AWS so thank you I know about when about a minute over hopefully you learn one thing appreciate the time [Applause]
Info
Channel: Amazon Web Services
Views: 6,731
Rating: undefined out of 5
Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, 126393
Id: bfDpK45Faa0
Channel Id: undefined
Length: 52min 12sec (3132 seconds)
Published: Mon Jun 19 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.