AWS re:Invent 2014: AWS Innovation at Scale with James Hamilton (SPOT301)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

crap- looks like the time stamp didn't take.

Fast forward to 34:39. You see a tiny photo of a rack with 864 disks wehigh 2,350lbs. 9 jbods in each rack so 96 drives. If you had 3TB drives that's 2.5PB per rack.

👍︎︎ 19 👤︎︎ u/coffeesippingbastard 📅︎︎ Jun 25 2016 🗫︎ replies

absolutely amazing scale they got there!

👍︎︎ 3 👤︎︎ u/PulsedMedia 📅︎︎ Jun 25 2016 🗫︎ replies

I don't know what's more impressive, that this is probably outdated and probably is several Petabytes bigger today, or how relatively easy it is to get a 90 drive server.

👍︎︎ 3 👤︎︎ u/SarcasticOptimist 📅︎︎ Jun 25 2016 🗫︎ replies
Captions
thanks for coming in for those joining on live streaming thanks for virtually being here it's a fun time every time I come to reinvent I just go man this is this is this is bigger than I expected that's bigger than was last time super exciting my goal for today's session is to convince you a couple things one is the cloud is fundamentally different and I'm going to give you examples to show why I think it's fundamentally different if we achieve my goals what will happen is every company represented by people in this room will make sure that you have one production app at bare minimum in the cloud and the reason why you've got to have it is this is real this is huge this is the next decade of our industry and you've got to be getting that experience now for those that are in the room good call and for those that are already running apps better call it's really important if you're come if you're not personally involved writing one of those apps at your company today do it at home do it yourself I'm funny story I am I was asked a company I worked for some time back people are having trouble believing this has three thing it's impossible so cheap couldn't happen and so I wrote a fairly substantial app against s3 I tested the heck out of it because it was downloaded all over the company all the way up to the CEO it matters that you be CEO demos never go well you you test and test and test and test so what went okay funny thing is I got a bill for three dollars and 11 cents whoa this is different this really is different it is phenomenally different okay big transitions do happen big transitions that I've seen I was lucky enough early on to be involved with the transition from mainframes to UNIX servers I was lead architect on db2 when we ported to UNIX it was a great time it's a wonderful time because you get to go to customers and you actually help them you get to help them get better value is phenomenal I get to do it again on sequel server and helping customers get to x86 servers these big transitions happen rarely we're super lucky to have super lucky to have seen those two we're super lucky to be here for this one together it's a it's a big deal they're rare if you can make the call on what's what's irrelevant and which of these transitions are real and actually going to be a matter in a big way then you're right then you you know you're on the right side of that wave it's phenomenal for your career it's made a huge difference for mine and it's just so much more fun to be on the on the growth side of some of these curves so big Trent what's different on this one is the speed the speed with which this is happening you see why what's what's different partly it's two things in my opinion should one thing is great value if a transition yields great value it happens faster make sense second less blockers in in the other transitions you would have to buy a unique super server you'd have to find a way to get Oracle installed on that server you'd have to build an app you'd have to get it deployed I mean this takes months if you're smart and creative and it might take years of slower companies the difference is in the cloud all that friction is gone you don't have to install software you don't have to get hardware you can have Oracle up and running this evening and have it up and running tomorrow if you choose that stack and if you choose a different stack you can do that too and so those two things make this one really different a couple couple metrics on on on growth just because everything I'm going to talk to you today is got a foundation and scale it's only possible because of scale and so let's get a couple data points on scale Amazon s3 one hundred and thirty-two percent growth ec2 ninety-nine percent growth overall business overall business over a million customers why do you care about a million customers if you're on a platform with a million customers it means there's huge ecosystem of providers that are running on that platform it mean if you means if you ask a question other developers are solving the same problems if you're piling on the big ecosystem where everyone else is it's easier to get your job done fast and when I'm going to show you today is that volume allows us to reinvest into the platform and the reason why you're seeing such growth this week week in the services that we're offering is because of the support that you've provided us over the last three years thank you because of that we're able to keep reinvesting deeply into the service and keep innovating one more data point Gartner estimates the overall plot all the competitors of an in the cloud industry all 14 of them have one fifth the aggregate capacity of Amazon AWS it's a pretty phenomenal Delta when you think of it and again thank you for making that possible let's look at one more data point on scaling I happen to be in the room when we came up with this one and I'll do I'll give you the background on it it's it's it's a I find an amazing number the background on it is I met Rick Dell Zell and Charlie Bell at high-performance transaction systems in 1999 Rick and Charlie ran amazon.com infrastructure at the time we had invited them it's a small Invitational conference held every two years we invited them to come in because the industry as a whole was blown away by the scale that Amazon was running it it's a big ecommerce system in the year in the year 2000 so I was thinking how often do we bring that capacity online if that was every three or four weeks that would be notable that would be very notable I was wrong on two dimensions first dimension it's not every three or four weeks it's every day the second dimension it's not the year 2000 when Amazon was a three billion dollar company it was the year 2004 when it was a seven billion dollar company every day every day think of what that means that means all of the component manufacturers have to get geared to our server and storage manufacturers the server and storage manufacturers have to produce the gear and push it into the logistics Channel it has to get from a logistics channel over to one of our data centers the right one it has to arrive at a loading dock people have to be there to wheel this the racks into the proper locations in the data center there has to be power cooling networking ready to go the app stack has to be loaded up it has be tested it has to be released the customers and then we got to do it again tomorrow it's amazing if you weren't innovating if there were no new services I'd still want to tell you how we do this I think this is actually interesting all by itself what's changed in the last year we've done it 365 more times yeah it's there's a lot of scale ok I'm going to cover a couple major areas of innovation the we reason I've chosen these two areas I've chosen networking because it's a it's a problem networking is a red-alert situation for us right now industry-wide there are there are big cost problems in networking and so I'm going to tell you what how we dealt with that red alert and what's happened because I I think it's big work and I think I think it's notable and I think it has great customer impact second thing I'm going to talk about is database and the reason I'm going to talk about database is because database is hard database is the most likely reason you'll get woken up in the middle of the night database is the most likely reason why your application may not be running it's probably the most expensive servers that you have it's definitely the most not definitely it's likely the most expensive applications that you're running that's where all the interesting hard issues are if we didn't have state we all be running huge scale applications it wouldn't be that hard so that's the two that I'm covering let's look first at networking because networking as I said is a Red Alert for those that know me you know I track I track and and watch and drive my work on metrics I like to I like to use metrics I especially like financial models and it sounds kind of boring but a truth is a financial model is remarkably educating on what's really going on and this model I won't I won't go into all the details on it this model proves wrong three or four of the most common beliefs in our industry just is not true it's just simply not true the real data I mean people speculate a lot in our industry having numbers tells you what's really going on here's one that's interesting that you don't necessarily see in this chart but if you look at over two years you see it fast and that is problem number one the cost of networking is escalating relative to the cost of all other equipment its ante more all of our gear is going down and cost we're dropping prices all the time and networking is going the wrong way that's a big problem it's a super big problem it means you know I like to look out a few years and I'm saying over time the size of the networking problem is getting worse constantly can't let that happen just it doesn't work it's I mean we're not going to have the right story for you two years from now if we don't solve that so we can't do it second problem is and it's a perfect storm the second problem is at the same time that networking is going anti more at that same instant the ratio of networking to compute is going up yeah partly it's driven by the fact that there's more compute in the given server generationally because more is working there okay that makes sense partly what's driving it is the cost of computing falls as the cost of computing falls the amount the amount of advanced data analytics that get done go up data analytics are networking intensive if we're solving big complex problems over many servers it's a lot of networking traffic it's referred to in the industry is the east-west problem rather than north-south problem so we've got two things we have a perfect storm and networking what do we do about it what we did it's a little bit audacious at least it felt like it when we started nearly 5 years ago as we said what if we built our own networking designs we had ODMs build the networking routers themselves we hired a team to build the protocol stack all the way to the top and we deployed them ourselves in our network well if you work at a lot of places people would get you a doctor put you in a nice small room where you're safe and can't hurt anyone it's a big deal that's a big deal but I told you the data actually says you kind of got to do it and so you got to do something what we do what we did that we did that and today if you're using our services today in every one of our data centers worldwide you're running on this gear what happened well the first thing that we learned from doing this is this won't surprise you at all it's a lot cheaper it is a lot cheaper no surprise at all in fact it's just the support contract for networking gear was running tens of millions of dollars so this is it's great it's great value but let me surprise you at least it surprised me the availability went up how could you put together commodity networking gear we're professionals we're good at what we do are we reasonably good at we do and would try hard but but why would it be better why would the availability go up how's that possible you say well you dig below the surface you kind of understand what's going on the way the networking world works is enterprise customers give lots of complicated requirements to networking equipment producers who aggregate all these complicated requirements into tens of billions of lines of code that can't be maintained and that's what gets delivered we don't use all that stuff you don't use it we don't use it nobody uses it but in aggregate it all gets you somewhere and it's hard to do and the answer of why our gear is more reliable is because we didn't take on as hard a problem and that's okay that's that's it your guts anyway that wins is a good way to win so we took on an easier problem and it's more reliable no surprise if you think about it another one is we love metrics we love to measure everything we have rules that say if you ever have a bad experience using our systems our metrics have to show it think about that if a customer has a bad day your metrics have to show it and we are very religious about that and so what that means is that our metrics are getting better all the time if ever a customer has a complaint and our metrics look fine you're going to have a discussion with the senior level of the management team going to happen for sure once you get metrics that really accurately measure how the customers experiencing our systems then what you can do is you can set goals and how am I making them better and every week we can relentlessly drive these down do we do we wait until release seventeen point three point two that comes out every 18 months heck no we're going to look at this every week and we're going to crank the code very frequently and so it improves faster even if it didn't start off better it gets better final thing it's that that I think is super important is our ability to test it proves it proves how the cloud works if you want if you're a network provider and you're and the way all networking providers that I have worked with in past years work is they buy data centers and they install servers and they test how it's done well think about this how often do you release new Netgear not so often and so this is inefficient it's hard to buy big data centers I was nervous because I can't afford to have this system not work extremely well and so we took three megawatts of capacity 3 megawatts of capacity 8,000 servers that was a test environment we ran that's probably worth 40 million dollars nobody tests at that scale and so another reason it's better as we test it more and what does that cost technically it costs 40 million dollars but not in the cloud we rented it it's like we used it for a couple months it cost a couple hundred thousand dollars and the job's done and we're and we're back to work again so that's what we're up to let me take you through a little picture of our network from the very top all the way down to the network interface card and we'll walk through level by level and say how does this system work and I'll point out to you a few things that may be a little bit unusual okay worldwide backbone there's eleven regions worldwide in AWS you choose regions trying to get close to your users close to your customers or meeting jurisdictional restrictions having eleven regions is a real asset second thing is we use private links to hook up most of our of our major regions why do we do that well it turns out networking companies are invariably somewhere in the internet somebody is fighting with somebody else over who pays who won peering and there's not enough capacity somewhere I always going on and somebody is not the world's best at capacity planning and the road of capacity and somebody bought this wonderful big smoking piece of gear with huge buffers and it's buffering I mean whatever the cause it's just slower and it's no matter what you do it's going to be slower and so the first reason we run a private net is it's just a faster second reason is we actually can do capacity planning we do make mistakes but not that frequently and we're always trying to do better and so what happens is it's a more reliable link it's a cheaper link and it is a lower latency link it's just a happier place to be so that's where we've chosen to be let's dig down and say ok that's good let's look at a region let's let's take one of those eleven regions I've selected us East which is a very large region and pulled it out and say what's in US East what does u.s. deist look like well in all of our regions have at least two availability zones u.s. East happens to have five availability zones a couple things you'll pick out on this one is the way we the way we wire up our facilities are different from most instead of having a data center being a region we have availability zones being a region I'm going to go into detail on why that is second thing is unusual is we have separate transit centers two transit centers are completely redundant those transit centers are connected connected up to private connections to customers paid paid or unpaid peering and and paid for transit that that's where our connections come to the rest of the world if you lose one of those it doesn't matter if one of the AZ's ever had a fault of any sort all the rest of the ACS keep working if one of the AZ's goes down if one of the links go down it doesn't affect everyone else because we have redundant paths all over the place the paths I'm showing you there just because I had to show you something those real that's exactly the way it is and the reason why not every line is there is because that's just that's the way it is there's a lot of redundancy in these systems and every one of them is not got wire running in the same logical spot and so the ditch digger although they will come will not come for both at the same time and so that's our approach to regions let's look a little bit and say well what's in there the first thing is I chose a kind of what I think is an amazing number it's 82,000 fibers in there just a phenomenal number the the AZ's themselves are less than 2 milliseconds apart and mostly less than one millisecond apart so they're from a from a latency perspective they're very close but if you play around with speed of light you'll know that actually multiple kilometres is how far the part so we're actually quite a distance apart from a safety perspective fairly close together from a latency perspective because fortunately speed of light alone gets in our way all the time it's not that bad okay so we've got 25 terabits per second of inter AC traffic that's not the traffic inside a data center that's not the traffic inside and AZ that's the traffic between a Z's so Wakko number if you told me that five years ago I just operate big number why do we have a Z's like it just doesn't if at first glance you wouldn't think it would make sense because customers today do not use a Z's and most of our competitors do not use a Z's and AZ's costs there they're big DWM dense dense with them dense waive division multiplexing metro area networks they're bit too complicated they have cost why would we have a Z's if if that's not what customers use and that's not what competitors are offering that's an additional layer that you have to justify and here's the justification the way most customers work in fact every customer I have ever talked to except Amazon work this way and that is an application runs in a single data center and you work hard as you can to make the data centers reliable as you can and in the end you learned that about three nines is all you're statistically likely they get over a large number of applications over a large period of time it's just a 3:9 problem and you know you might get a tad bit better you might get a tad bit worse but rough numbers it's three nines as soon as you've got a high reliable app you run on two data centers everyone knows it and so the customers that are running high value apps are going to run into data centers and the way that's done in our industry is they're geographically widely dispersed because it would be absolutely saying to have two data centers side-by-side and try to call that that that's the way you're getting more redundancy so you want some space and so usually they're a long way apart because they're a long way apart the return trip time is fairly long New York LA is about 70 milliseconds committing to an SSD is about 1 to 2 milliseconds you cannot wait 70 milliseconds for transaction to commit so we know for sure one thing we learned right up front it's not synchronous it can't be no one's running synchronous replication because it doesn't work nobody can wait that long and so the way it works is you commit to a single data center and absolutely as fast as possible you push to live we're done the data center works pretty well what do we know about that for sure it loses data well it doesn't lose data in the in the non failure case but in the failure case if you failover between these two facilities you do lose data now you really don't lose it because every cuss every company is complicated it has audit logs there's lots and lots of other tracks but it will take a week to get the system back to correct again what does that mean that means you don't failover unless you absolutely have to just don't do it and so this system works really well at maintaining availability when something very rare and very bad happens if a data center burst into flames burns to the ground gets hit by a tornado gets run over by truck any of those things happen if it's gone this system works splendidly so it's excellent protection against a rare problem and we like it and we use it for that purpose however what happens way more frequently is somebody makes it running error someone's app goes wrong a load balancer is sick something like that goes on and this system cannot solve that problem and the reason they can't solve that problem is something goes wrong it's been down for three minutes you think it's going to be only down for 10 minutes do you failover heck no it's going to be a week's worth of work it's going to look bad to customers it's going to cause an interruption you do not do that you're thankful you can do that if the burden if the building burns but you do not do that unless you absolutely have to so what happens you lose availability and for all those common events that happen not all the time but happen a time a year those ones you have no protection against that's what AZ's are all about in nazy you've got you've got data centers that are a millisecond or two apart you can commit to both at the same time synchronously that is empowering what that means you can do is you can fail you can you can commit to both at any moment if you pull the plug virtually on a data center the app keeps running if you thought there was something wrong with this data center and you failed over and you were wrong then matter customers can't tell you did that if you if you test it and says oh I was wrong fail back customers can't tell it's invisible that that's why I fell in love with with amazon.com back in 2000 is they describe this system to me and I went astha nominal that's just phenomenal it's game-changing and how an applications run and so every customer of AWS has the option to run in this model it's harder to write code to this model than it is to not do it and so you won't do it for every app and there's some that are that where you're concerned that an airplane could touch down destroy a data center bounce very unlucky land in a second data center completely wipe it out that's that's a bad event but for high-value apps you want to be protected and so for those you can run into region replication and so we got that capability every competitor has that capability every customer has that capability but this is the capability that's fundamentally different I think it's special and we work very hard and there are some costs associated with providing this option let's keep diving in what's inside an availability zone what years ago people were speculating while you know availability zone isn't the data center it's probably just like two racks on the same data center not so much an availability zone is 100% always without ever an exception a completely independent building it is a different data center so 28 availability zones tells you we have 28 plus data centers and the plus is a good sized number why would we have more than one data center in an availability zone given we promised you that they would be independent failure modes these data centers have independent failure modes why wouldn't we just have more AZ's and it turns out customers don't want them if you have to availability zones it's great and we always have to if you have three it's even better if you have four it's fine if you have five I'm not exactly sure what to do with them anymore it's just it's just it's it's kind of more than you need and what customers really want is they want to know that if I deploy my app in the availability zone number 16 don't tell me your other capacity in that zone let me throw it somebody else I want to keep adding to my app and so we we don't want to run out of availability zone capacity so we add data centers that's what's going on we're taking on that problem of making sure that your virtual data center doesn't run out of space and and that's why you see believe it or not we have some availability zones that have six data centers these these are fairly substantial data centers as well okay let's let's have a look at one those data centers now we're drilling down inside an availability zone now we're drilling down to a data center itself how big is a data center well it's it's pretty big it's not as big as it could be but 25 to 30 megawatts rough numbers 50,000 to 80,000 servers why don't we build bigger we can easily build bigger I've been and I've been I've been in data centers as big as 16 megawatts totally easy to do the thing is the return on largeness the advantage of scale as you build a data center bigger and bigger the early advantages are huge the leader advantages are less and so if you go from 2,000 racks to 2,500 racks it's a little better as you start adding more and more racks to a data center the value of protection to a customer let me show you a value of scale goes down I mean a tiny data center is too expensive a really big data center is only marginally more expensive per rack than a medium-sized data center and so and as they get bigger there's a risk the blast radius if something goes wrong and that data center is is destroyed the loss is too big and so the value of getting bigger goes down and the cost of goes up and so in our view this is around the right number we bought and we've chosen to build this number for an awful long time so that's why we run bigger how about networking capacity provision to a data center this is the capacity not inside the data center it's wildly higher than this this is the capacity coming in to a data center 102 terabits per second a fairly substantial number again you want to have control of networking cost so we can afford to keep doing something like this let's dive into the inside of data center swoop down to a rack from a rack grab a server from a server grab a NIC this one's important if you look at you know I like to look at where the problem lies I talked to you earlier in finance if you look at the latency between two servers if you want to send a message from this server to this server where's the time gets spent and I'll just go quick it turns out the software stack you have flowing from your app down through the guest OS through hypervisor down to the network interface card the latency in that area is milliseconds the latency of passing through the NAP nick is microseconds the latency of spanning the wire crossing the fiber is nanoseconds which is to say the only thing that matters is is the software latency at either end in a small facility not a small facility in in any of the facility sizes that I'm showing you that's where all of it that's where all the cost is that's where the problem lives and so ok we know how to solve that there's a wonderful technology called single route IO virtualization it allows a network interface card to virtual to provide virtual hardware to virtualize in Hardware network network cards that means each guest gets its own network card all by itself phenomenal and not that hard to do and so you know you look at that you go it's not it's not it's not mind-boggling why didn't we do this years ago why is this just what we're deploying right now on our newest instance types it will be everywhere by the way why are we doing it now when we knew about this for years well the part of using SR iove is not hard anyone can do it what's hard is wait a sec wait a sec this is this we need isolation in our network we've got to virtualize the network we've got to be metered on the network we have to actually keep track of who's using what on the network we have to have DDoS protection we have to have capacity limits so you can't use more than what you've purchased and impact another customer negatively all that stuff well if the guest goes straight to the NIC you need somewhere to do all that very important stuff that's the magic that's the reason why it's taken us a long time to get to this model it's hard to do is is is a short answer let's look at the value what do we get from going to this model well technology guys should know better than the demo anything - using a logarithmic scale because it makes everything look boring however some of the gains are sufficiently big I need a logarithmic scale first thing to notice we're looking at network latency in the TP 50 case this is average what's the average latency improvement and what I showed you is about 2x so it's it's pretty relevant I mean it's not absolutely earth changing but it's pretty darn relevant that's you know you start to talk about a factor it's a good thing we're excited by that but what we're really excited by we more excited by is look out to the limit this is what talent comes to play is if you get out towards the limit you see what's in the rare case what's the maximum latency if you look at the 99.9 percent I'll what's the difference in latency out there it's those outliers that are super annoying for apps it's those are the outliers that cause customers bad experience those outliers are the thing that's things that find bugs and applications that mostly don't show themselves the outliers are down by 1/10 they're down to a tenth of what they were on our own network and this is measured on our gear in our facilities using previous previous generation versus the current generation so I think it's a pretty big impact and and I think it's pretty important another thing that you probably know is the server ecosystem and a storage ecosystem is way more healthy way more healthy than the networking world but it's still not that efficient and it's nobody's fault it's this the server OEM sell to tens of thousands to hundreds of thousands of customers with very diverse requirements all over the world well that's complicated you have to have a big distribution channel to reach all of these customers and to meet with them and then to demonstrate your servers and sell the servers and get support data back that's a big expensive system and that channel that distribution channel costs about 30% it's nothing you can do and maybe something you can do but not a lot you can do you kind of need to talk to that number of customers if you're selling to a cloud provider or if you are a cloud provider there's one customer you just don't need a lot of discussion I don't get need to get taken to the Super Bowl this year I don't need to play golf with anyone we'll just buy our servers and that's a cheaper system that works better that really works better and it's a consequence of working that better those costs are gone it's just gone it's just and we just stopped magic it's just they're not there anymore that's a good thing more than that we more important than that the servers are actually designed for what will you do normally servers are designs for hundreds of thousands of users that have all these different requirements we know exactly what we're going to use the server for we know precisely the environmental conditions that will run under we know exactly how we're going to use it in which application stacks going to run on it and if anything ever goes wrong we know exactly when that happened and because we've got all that data it means we can actually design with less engineering Headroom if a server oh oh D M ships a server and there's something wrong with that server that has in the Box has to be opened think of the cost of opening that box at tens of thousands of customers sites individually spread all over the world if that ever happens for sure someone's losing their job and it does happen occasionally and it is really expensive and what happens when you get burned really badly you get careful you get really careful and what's careful mean careful means is huge engineering Headroom that's not getting used because it's just so risky if anything goes wrong well we're replacing thousands and thousands and thousands of discs every day we have whole teams that doesn't do nothing but but are really darn good at opening servers and and we know exactly where the data centers are we ought to travel a long way we're already there it's the cost of a failure is just not that big for us and so we don't need all that wasted engineering Headroom again it's it's just it's not magic it's just it's a different world this is a different world here's another one that's different wild from my perspective we know our environmental as well we we know how to build servers to a certain specification we know exactly the mechanical design we can influence the mechanical design we can decide exactly what level of cooling we never put the processors um shadowed behind memory we just design good servers as a consequence the processors can be pushed harder through partnership with Intel we have processors that are actually running faster at a different given core account than is available in the open market again it's not magic it's just because we're running in a simpler world and I think that that's going to keep playing out where we're going to keep finding situations like that where the world has a lot of protection because the cost of a negative situation is so high this is a different world it's just different and so I'll give you one example this guy couldn't resist this is a storage rack this is what this is what people this is what companies do not buy so three years ago we did this design three years ago we do this design and this is probably not a huge demand for four racks excuse me that are over a ton 19-inch rack two thousand three hundred and fifty pounds with eight hundred and sixty-four disk drives I suspect the reason why that design didn't exist three years ago other than ours is because nobody wanted it well it turns out for some workloads this wonderful design this is a game-changing design this is helping us get better prices in some areas and so we have that we have that beast and I love it let's jump into relational database relational database is hard hard hard hard hard it's hard that they're hard to install they're they're hard to afford they're hard to take care of they're hard to learn about they're very hard to switch once you choose one there's just a million reasons why relational databases are challenging I can be critical because I've worked on them for so many years one solution is just don't do it let me just find out don't do it go just don't go no SQL and that's a perfectly fine solution on that model we've got dynamodb I showed you this last year I showed you exactly the same chart I said let's take one region let's take just one region it's not the whole world just one region and say what's the request rate the DynamoDB servicing and at the time is a little over two trillion per month it's now climbed nice and steady to over seven billion per month last year I showed a scatter chart of the latency and it just danced around at around three to four milliseconds closer to three just in a state they're the same and I was proud because the team had grown at the rate that I showed last year and it just stayed boring and flat the latency never changed the charts are same have a look at my last year slides it's a same chart and they've growing by 3x in requests for exon storage it's amazing at the same time they're adding features so we've got JSON support we've got global secondary indices the product it keeps getting easier to use and but never ever ever is it going to become a relational database we're not going to sacrifice scale and ever and so this is always a scalable solution with very predictable latency okay but you still need relational databases some customers still use them we still use them I still use them there are places that you can apply a relational database in a productive way okay what can we do to help make that easier I mean for if you don't have to don't do it but if if you do have to there's times when just so much value in a relational database it makes an app so much easier what can we do well the problems that we had is what we had cost is a problem we can provide we can provide a open source alternative so we can drop the cost okay that's good that's progress another one is administrative complexity okay we'll host it an RDS and that takes over a lot of the administrative complexity helps with that second one but we've still got the availability problem availability is where is where the high-end databases are most enterprise features exist that's where the real money comes in and so what can we do on that with RDS we solved a lot of the administrative problem let's go after let's go after the multi a-z reliability problem years ago when I entered this industry EMC came out with a product called s RDF it was driven by the financial district in New York where they wanted to run real-time replication between New York and New Jersey EMC produced this log shipping solution and they printed the money with this thing people would pay anything with for it and in fact most of them did incredibly expensive unbelievably important for super high availability applications Oracle's done the same thing again a very enterprise level Solutions of good solutions very very good very reliable but very expensive same thing again it's which is do the same thing but we'll do it a lot cheaper so multi a Z RDS says we'll ship the changes between two availability zones if one fails will fail over the other recover the database bring you back up and usually sub minute you're back and running again so phenomenal capability because it's so cheap rather than service rather than serving the 3% of the world that hat that really has to have that level of reliability I can't understand why we're not servicing the whole world because it turns out any database that doesn't come up for some reason it's going to waste somebody's time somebody has to go take action and so I'm loving this number is the number of multi a-z RDS databases because we're doing such a good job of driving the cost of this down it's good technology nothing wrong with it was just too expensive and so because we're making that broadly available we now have 40% of customers using it my goal I'd love to see you love to see us all get there is I'd love to see this number around 70% there's a few is a few applications with very disposable databases where it does it's not an administrative tax but even tests and dev databases are worth running in this mode in my opinion ok we solved a little bit of the availability problem we solved a lot of the administration problem we saw a lot of the cost problem one of the problems that's left is performance and performance of a single MySQL day bass has limits and the commercial systems are good you get something for spending thousands and thousands and thousands and thousands of dollars no question so what we've done is we produce what you heard about this morning called called Aurora it's a new storage engine for MySQL it's drop-in compatible all MySQL apps just runs same as they ever did what's unusual about this there's quite a few things that are unusual about it I mean just doing a current state-of-the-art storage engine would have been worth doing take Jim Gray's big black book on transactions implemented because you would you would all I would be happy we'd all be happy it'd be a wonderful thing what the team chose to do it's a little unusual is they separate the storage engine away from the relational engine which allows them to fail independently they put a storage engine in each of three data centers there's two copies in each data center and they're shipping Delta's down to the to the storage engines so they're minimizing the amount of network traffic and because logs are not the most efficient thing to operate on directly they're transforming to a read optimized form independently on each of those three different boxes it's kind of a cool design unbelievably available six copies if you have if you have one data center completely wiped out and a hardware failure somewhere else it just keeps running it doesn't care it's fine no problem if you have one of those events where the plane hit both data centers it's not super likely but at least it doesn't lose any data every transactions there it won't run through that one but every transaction is been committed it's there and so it's an amazing step forward from an availability perspective the thing you should criticize least the thing that you jumps out is but can I perform because this looks like an expensive model and it turns out it is an expensive model but it turns out done well latency speed of light it's not that frightening this actually can be done with pretty good performance let's look at the performance believe it or not partly because MySQL is is not absolutely the most modern technology bail right now partly it's that's one advantage but it's running it at more than 3x the right person performance and more than 5x the read performance and so I showed you that system talked about how available it is talked about what it can operate through and it's actually forget about hey I'll just I'll take it at the same performance it's actually it's solving the performance problem in a pretty big way so this is big this is big this has also got enterprise level features db2 MVS is able to find pages in its log if you have a torn page in db2 MVS that runs on the z series mainframes it is able to patch that page while you keep running phenomenal most databases don't do that they just it doesn't happen that often and so they just stop running when there's a torn torn page this one will do that as well kind of kind of impressive redshift I worked on db2 years and years and years ago 20 years ago we were working really really really hard on making a scalable data warehouse where you could have a hundred and twenty eight nodes running a single statement in parallel actually scale and this whole mess looks like a single parallel database system it's the thing that we've always wanted from high scale data warehouses it is incredibly hard to to to make them work and of course whenever I say it's incredibly hard you know what they charge for it yes hard is expensive these are incredibly expensive this product is able to run on 128 nodes in parallel this is running a lot of amazon.com data warehouse workloads it's running all AWS data warehouse workloads and it's running at $1,000 per terabyte per year for the cost of these systems it's rounding error I mean think what's possible if you can have a data warehouse on anything is the interesting to you for example what what can I learn about disk failure rates over different batches just go try it just get the data and try it's great it's absolutely phenomenal again game-changing EBS won't spend any time here other than saying watch us whenever you like something and start using it heavily we learn and so you loved SSDs we made SSDs available behind EBS customers just loved it and and so we said okay well we'll provide more SSD options so the first thing we did is we provide provisioned I ops which is really ideal for a database workload where we sign up with three nines of reliability to absolutely give you the number of IUP's that you bought it's just like buying a storage system and so okay that's great general-purpose high up says listen I'll give you burst capability it's not good for a constant load that never changes like a database but for almost every workload it's an ideal choice and will give you much better value for that what we've done recently is we were able to support 20,000 iOS per second against against a single volume so fairly big number power infrastructure worth spending a second here why would we design and build our own high-voltage power substation they can save a tiny amount of money doing that and so we've done we've done couple you could save a tiny amount of money probably wouldn't do it for that what's useful is we can build them much more quickly the I showed you the rate we were growing at well that rate is not really a normal rate for for utility companies it's not it's not the pace that they're used to operating at and so the reason we did is we had to is its bottom line it's cool that we actually can and we can have power engineers and and this becomes a skill center for us pace of innovation remember Andy showed you this morning four hundred and forty two different services and major features released this year antes number was so this morning stated stuff glad you came 449 just keeps cranking and the thing that I'm really proud of is we knew we were going to we knew the cloud was going to be important those of us that have been involved with it for a while we know there's value here we know we're going to get big we're all scared to death that we don't want to become a big company and get slow and stop serving customers so we're proud on two things we're proud that as we grow we keep delivering at the same pace in fact we're delivering at a faster pace we're and we're getting more reliable rather than less reliable because sometimes pace gets negatively translated into quality so we're actually have a better quality record than we've ever had we have a higher pace than we've ever had and the final thing that we're proud of is we're working really hard to make all the same types of assets that are helping us move fast that same test system that I use to test the network with 8,000 servers that same system we're making available to you we believe we can help every one of your companies have that same chart and keep doing it at any size that's why I think the cloud is pretty special and why I'm glad to be able to talk to you little bit today thanks for being here
Info
Channel: Amazon Web Services
Views: 60,561
Rating: undefined out of 5
Keywords: AWS, Amazon Web Services, Cloud, cloud computing, aws-reinvent, reinvent2014, Spotlight, Advanced, Auto Scaling, Performance, Database, Architect, Technical Decision Maker, Startup, Public Sector, Partner: Technology, Partner: Consulting, IT Executive, Enterprise, Developer, James Hamilton, innovation at scale, cloud computing event
Id: JIQETrFC_SQ
Channel Id: undefined
Length: 47min 51sec (2871 seconds)
Published: Mon Nov 17 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.