AWS re:Invent 2017: ElastiCache Deep Dive: Best Practices and Usage Patterns (DAT305)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everybody my name is Michael Abbi I'm a specialist Solutions Architect here AWS I'm super super excited to be speaking with you about Amazon ElastiCache for for Redis so we recently released a lot of really awesome features that we're gonna go through in today's talk and just the level set we're gonna start off you know kind of getting everybody on the same page with what Amazon ElastiCache is then we're gonna we're gonna dive in so how do you scale your Amazon ElastiCache cluster with online regarding then we'll review you know our security footprint with respect to you know how do you set up that topology and and networking and all those infrastructure associated with your cluster as well as encryption which is a new feature that we just recently announced and then we'll take a look at some common usage patterns that a lot of our customers are using today and it will close off with best practices so year after year we're hearing increasingly from our customers that's Need for Speed and this is really coming from virtually every workload and every vertical that you can imagine so what customers are essentially asking for is they in their data platform they want a fast data layer right and this is this need has has developed to the point where we're no longer measuring performance in milliseconds this is a microsecond sub-millisecond gauge so this is really where Amazon ElastiCache fits in Amazon ElastiCache is an in-memory key value store we support the two most popular key value store engines Redis and memcache T it's fully managed Xero administration no racking stacking failover that you have to worry about and it's hardened by Amazon and what does that mean what it means is we have some engineers who are involved with the open source reddit's product and they're you know they are constantly looking at the engine and the features and we evaluate what our customers are asking us for and we enhance those areas that we feel better serve our customers so if you were to take a second to think about your data from a temperature gauge the hot data would be that data that you access most frequently and if you thought about the the characteristics of that hot data what you would want it to be is the ability to support incredibly high request rates very very low latency and that's basically where Amazon ElastiCache fits and as data volume grows and sort of the the data starts getting warmer and colder you'll see SSDs and you'll see other data stores sort of fit in now this is not a you know a bit or an or operation so if you're building a data platform essentially what you want to do is you want to think about your data access patterns and you want to utilize the different data stores and databases that make sense for that specific data data access so Redis quick overview just a level set make sure everybody here is on the same page Redis today is the most popular key value store in the market and the reason it is is because it has a variety of different features it's a Swiss Army knife of key value source it has a variety of different data structures so if you're a developer a lot of these are you know are you know they ring a bell to you hashmaps list sorted set sets and it's very intuitive with respect to how do you use Redis in fact I'd like to think of the API is you know sort of genius because it's very rich in capability but syntactically it's very predictable and very easy to use other things that make rhetta's very interesting is that it's a support h8 so you can have a master or a primary or multiple primaries in a clustered environment and you can failover in the event of a failed scenario and what is that support it gives you that added HEA that you need for those critical workloads and those those those solutions and infrastructures that you care most about other things that's great about a Redis it's open-source and Amazon ElastiCache fully supports the open-source version so from a syntactical standpoint from a protocol say a protocol standpoint is a you know it's a drop-in replacement and another cool feature is it has transaction support so you can can have a multi execute various different commands in a in a transaction block which simulates that transactions and just like the commercials there's always more right so other things that Redis has it has Lua scripts so you can bake in business logic that you can call and reference within your Redis cluster there's geospatial capabilities so geospatial sports a special assorted set that supports longitude and latitude this is great for mobile applications where maybe you're passing you know location-based information to the cloud and to Redis and you're looking for points of interest that you want to serve to your end users there's pub/sub capabilities so this allows you to build chat applications and notification systems and it's just built into Redis so there's nothing that you need to turn on there's no additional cost it's just there now in terms of ElastiCache features ElastiCache offers a variety of different ways that you can deploy and monitor your Redis cluster now a lot of these are familiar to you because we support these and other services as well so AWS CloudFormation template engine where you can run JSON yamo scripts and this is great for if the structure is code so you can build you know what you would want your application and Redis cluster to look like how many shards what no types all those sort of details with respect to your cluster you can version eyes that template and it build it whenever you need to I use this all the time when I want to do benchmarking or performance testing and it once I'm done I sort of turned out or collapsed that that stack other capabilities of CLI SDK so all the operations associated to ElastiCache you have full control on using CLI and SDK the SDK you'll use in your your applications may be in lambda functions and in CLI you'll have you know through the AWS CLI tool and and if you don't like a lot of excitement in your life there's always the console right so a couple clicks and you can build a lettuce cluster that way as well from a monitoring perspective there's cloud trail AWS cloud trail and so this gives you all the the the the log essentially of every interaction to the service you know when it happened who did it that sort of thing maybe OS config which is great to build compliance how you want that cluster to look like Amazon Cloud watch which I think pairs very nicely with Redis and so what's cool about cloud watch is in addition to us you know funneling up all the Redis info data through cloud watch you also can be very proactive with respect to this and how do you be proactive is issue an alarm you know set a metric that makes sense to you issue an alarm consume that alarm through an SMS notification and then do something interesting right maybe have an operation on your cluster I'm going to have an interesting an example about this later in this presentation we'll talk elite we'll revisit that that architecture other things that we've done so I mentioned earlier that you know it's a fully open source compatible our engineers are are very you know involved with the open source project but we also have some enhancements that we've built into ElastiCache I'll review some of these so the first one is a background save operation so optimize swap memory so with rhetta's especially if you have a lot of writes happening to your primary if you were to do a say a snapshot what happens is you have the possibility of doubling your memory footprint alright this is this is true for open source Redis and so what that means is if you wanted to be safe and conservative you would want to reserve 50% memory just for those background operations so we took a look at that and we thought we could do better right so what we wanted to do is we wanted to give our customers the ability to use more memory without you know putting them in a situation where it would you know sort of create a possible problem right so what we did is we we we change the algorithm with respect to how much memory you have available in your cluster and if we detect you have less memory to do that fork based operation will do a fork listen which is essentially a timer task job that will you know do this background snapshot in a more efficient way the second feature that we did with with open source Redis if you're experiencing a lot of writes to your primary and your primary se is under a lot of pressure maybe memories very low there's the possibility that your replicas might fall out of sync and that's not that's not a good scenario so what we did here is to protect your primary to against failures and also to reduce the possibility that your replicas are you know out of sync will throttle some of those rights they'll still execute successfully we'll just slow them down just to make sure everything here is consistent and again this is only an innocent in a scenario where you're you're doomed for failure so we're sort of mitigating that possibility and then the the other one which we actually contributed it to open source it's in Redis for today is is is referred to as piecing to in the open source Redis and essentially what it is is assume you have a primary in multiple replicas if that primary fails an election takes place where one of your replicas becomes a new primary with open source right as prior to run as for all the other replicas essentially all the data gets flushed and then they have to resync the entire rehydrate all the data from the newly-elected primary so what we wanted to do is make that whole failover process more effective more efficient and we rewrote that in an earlier version of ElastiCache and we we contributed a lots of that to the open source so some of our engineers you know there's a you know as I mentioned earlier there has been bugs there's been various things that we've done ever since - point 8 and I wouldn't I wouldn't imagine this trend to change so we're we support the open source project this is a trend that we're going to continue obviously to support now Redis comes in two different flavors there's two differents apologies the first one is a vertically scaled environment and what that means is you have one primary all your your entire key space which is the the 16384 a hash slots they live in that one key space and so what this means is all your data can fit to the largest size of that node and in this case would be a an r4 16 XL which is four hundred and seven gigabytes and your replicas have the entire copy of all the data right what what we would provide you is a primary endpoint your applications would point to that primary endpoint you'd also have the replica head points and if there was a failover what would happen is we would take that primary endpoint and we propagate that on one of the replicas so we do the DNS swap for you nothing has changed with that s-- apology we still support it but custom mode enabled there's a lot of new things so we're gonna focus in this talk on custom mode enabled so the custom mode enabled what we're doing here is it's basically a horizontally scaled environment so you have up to 15 shards a shard is made up of a primary and 0 to 5 replicas and each one of these shards essentially you know owns a portion of that key space right so it's a hash slot range and a default distribution would be to divide the the hash slots across the number of shards that you have so that essentially is the difference the other thing with a cluster mode enable is we give you a configuration endpoint and in your Redis cluster a where client the way it works is it would communicate to through that Connect configuration endpoint it would get a map of all that's apology in the nodes associated to your cluster and I'll know exactly which node and which shard to target for a specific key on a get set like operation now there is other differences so we'll sort of kind of recap a few other things between cluster mode enabled and disabled if I'm a failover standpoint because there's no DNS involved cluster mode enabled is much faster so if you care about failover speed enabled makes a lot of sense right because a failover may take somewhere in the neighborhood of 15 with the worst case scenario 30 seconds and remember this is on a an individual shard so imagine you have you know three shards and you had a failure on one of them only 33.3% of your data is affected and if you wanted to reduce that blast radius even further I'm all you would just do is add additional shards so the failure risk is also a better option for you if you wanted to lower the impact of failures using rhetta's you would go with the the enabled version the other difference here is with respect to performance because you have the ability to have many shards you have the ability to write too many primaries at the same time because again a shard has consists of a primary and its associated replicas so from a writing and from a reading perspective you have more nodes to interact with so it's performance is going to be greater the obvious difference here would also be storage so because you have more nodes and your data is spread across many primaries you can have 6 plus terabytes worth of data in a cluster mode enabled topology so that's mad right more connections the one trade-off here is cost so the cost may be more expects more expensive because you have more nodes however when you when you are architecting using cluster mode enabled your you're selecting smaller sized nodes versus you know a larger node but the cost is certainly more expensive with enabled so that would be your trade-off it's like a deeper look at cluster mode enabled just to make sure we're sort of all on the same page with how it works you know under the hood that sort of thing so I mentioned earlier Redis has 16384 hash slots and again these hash slots are divided by the number of shards that you have and that would be the equal distribution although you have control on changing that and a key essentially finds itself in a slot after doing a search 16 modulo function an effort does that a live in essentially a slot and in the client that you're using whether it's a Java client net Python as long as it's a cluster aware it will it will ping the cluster issue a cluster slots command and I'll get a get a map of the topology the nodes and also which node has is associated to what hash slot range so it knows where directed traffic it also knows which node is a primary which nodes a replicas things of that nature now all clients are different so you obviously want to look at different characteristics of each client but that's generally how the smart clients work now let's sort of visualize this imagine this is your cluster this is an example of the three shard cluster each shard here will be a different color and the border that gray border is will kind of notice that the primary now again a cluster can be made up of 15 shards and a shard can have up to 5 replicas so if you look closely here the replicas have the same hash slot range of the primary and this is true obviously for all three shards and what happens in a failure scenario so imagine one of your primaries failed the impact that you have is only to the rights you can always read during a failure assuming you have a replica now for 15 to 30 seconds what's going to happen here is we're gonna detect the failure and then we're going to immediately elect one of your replicas to be the new primary as soon as that election happens your client is made aware of that election and I'll be able to start writing to that newly elected to that newly elected primary now if you wanted to reduce again as we mentioned earlier that blast radius you just add more shorts right so if you had 10 shards only 10% of your data would have been affected in a failure scenario like that now what happens if you lost majority this is another difference that we have with open source so with open source if you lost the majority of primaries you have a problem now with the election and your cluster is unable to elect a replica a suitable replica to be the new primary this is something that we we added some enhancements around and we're able to make an intelligent election here and your cluster is recovered again this is nothing that you need to do there's no intervention this is something that we're managing for you so the question here is you know all right well imagine you have a cluster of three shards and you wanted to go to five how do you do this there's an old way and I'll just call it the old way for now and then there's a way that I want to introduce shortly so the old way would be and this is only for cluster mode enabled is to leverage backup thermistor and we'll talk through why this is not efficient as we're kind of building up this slide so the way you would do it is you take a snapshot of this three shard cluster and after you take that snapshot there's an obvious thing that you're noting here is that any new rights that are happening to your cluster are not captured in that at RPO or that that that snapshot right so you're gonna have to mitigate that but well we'll come back to that so after you take the snapshot what you do is you copy that snapshot you place it into your s3 bucket then you create a new cluster and then you pass the RTB files that were associated to the three shard cluster that you had to the new cluster that you're creating so you're hydrating this new cluster with the whatever shard count that you wanted we'll assume five here and then what we'll do is each shard remember it has you know it's designated a certain hash range it will discard whatever you know keys or slots don't belong to it and so the slots would distribute across the five the five shards and then after that after you've created that cluster than what you're going to do is you're going to point your application to this newly elected cluster there's a new endpoint they'll be some downtime which is never fun and then you have this new cluster that you're working with right and so the the things that you're gonna have to mitigate with this solution is what do you do with those rights that occurred after you took the snapshot so what some people would have done is it maybe write to a queue or Kinesis or something like that and then once that new cluster is up they'll sort of hydrate those rights not not the best solution but a solution but there's a better solution right and this is what we're we're gonna review right now and so that solution here is zero downtime online restarting so we just the recent recently announced us and we wanted to you know build this in a way that we knew our customers wanted right and we're gonna talk about the difference that we've done and the enhancements that we've done with this operation it's a little bit different than how open-source does this as well so let's kind of go back to that three shard scenario imagine you have three shards and now all of a sudden you do five this is a very very simple API in fact that's the CLI command for that API what you do is you would pass the replication ID that's like your cluster name you would pass a parameter which is apply immediately basically if you want that operation occur start occurring and then you would also Pat past the node count for the new shark right so in this case five and this API is also true for scaling down so if you wanted to scale down the only additional value that you would pass is which nodes you want to get rid of now everything after that is completely we're managing we're doing for you right so what's happening here is there's zero disruption to your app with application you're still using Redis just like you were using Redis we're doing the retarding and doing a uniform slot distribution across these new shards and slot by slot we're migrating these shards in a very reliable and robust way and you don't have to worry about this right and so one of the differences here that we did is with open source there's a key if I key migration and we wanted to change that because when you do key by key there's a possibility where your where your slot will be split across multiple shards and that problem basically limits a few commands for example there's some Lua capabilities and multiple em gets and things of that nature where you won't be able to use right so there's some limitations there and then when a failure happens it's harder to recover from from that sort of scenario so we decided to you know spend a little bit more time on this problem and we change that algorithm to do slot by slot so will prevent the split slot a scenario and also we don't disrupt or change do you at the behavior of your application we're not limiting the commands that you need to to run on us now the only the only con here is that there is a possible performance impact but there is no downtime right so that is one thing I'd like to call out now the same thing is true for scaling in right so you want to scale in you have a cluster and all of a sudden you realize you know what I don't need all these additional shards our memory consumption is good enough maybe you can support three to scale in and how do you do that in a in an automated way right so now we're revisiting we're cloud watch sort of pairs nicely with ElastiCache so let's say for example you had a alarm set on memory and now that alarm went off right so your memory is high what do you do right so you can issue an SNS notification that SNS notification you're gonna basically trigger a lambda function and that lambda function is going to parse the notification and then based on that alarm it's gonna do something and in this case we're going to add some shards and you can automate this so you can add the logic that makes sense to you to scale up and to scale down very similar to how auto scaling groups would work for you C 2 but in this case you're sort of building that workflow using cloud watch SMS and AWS lambda now the one thing I would advise if you're going to do this is that count for some of the time that would take for the operation a complete so you might want to be a little conservative with your metric associated so like memory or CPU or whatever you wanted to react to and again just like we we know we talked about as soon as you kick off that function you're basically selling it ElastiCache scale up or scale out rather and then everything is back to normal right now let's kind of look at this in another perspective so imagine this is your application everything is healthy and if you're a business owner something good just happened now you have more customers more customers are hitting your site and you know they want to buy products if you're an infrastructure or developer something bad just happen right now you have heavy pressure heavy loads your your database your application so what you would do again you know just like what we talked about you'd trigger an alarm you scale up your your ec2 instances and very similarly you can certainly have a conservative alarm to scale up your cluster the added benefit here is that you're still sort of protecting your back-end database you're you're consuming for that pressure in that cache layer and you're leaving your back-end sort of in the same state that it needs a bit it needs to be which is really cost effective because scaling your your databases is costly or can be costly with licenses and things of that nature and then the other thing is that when you do something like this you know your cache can support much much greater operations per second and again the latency is much faster than a back-end database so this is this is ideal so let's take a second here to take a look at Amazon ElastiCache security will sort of review the basics that sort of build up from that point so you know kind of starting from a clear canvas here imagine you have a V PC few 3a Z's that you're working from the first thing that you'll do with Amazon ElastiCache is you'll define a caste subnet group and this is a basically a collection of private subnets that you would place your your cluster in that really spans the AZ's that's your that you want to host your cluster in the second thing that you're going to do is you're going to define a security group which is very similar to a firewall that's going to have the you know the port and also the the protocol and then the the IP or the you know the security group that's going to have access to your cluster and then then you're gonna place your your you're going to create that replication group or your craze your cluster and you're gonna place it in there right so you're gonna use that security group and a caste subnet group as part of that that replication group creation now if you were to take a snapshot and you wanted there to be some encryption at rest we have that feature that's a relatively new feature so your your snapshots will be encrypted and then if you wanted application access you know you'll build your application they'll obviously be security groups that are sort of protecting your application and then you'll want an able traffic from your application security groups to your ElastiCache security group right and at that point now you have access to your ElastiCache cluster so this is a very secure environment the only way you can have access to your cache your ElastiCache cluster is if you enable access now what if you wanted to have encryption between your application and your cluster so encryption in transit so this is also relatively new so we just recently announced encryption in transit with Redis 32.6 and we also added Retta saw you know if you're for some folks who want to use that token based authentication against the cluster using their applications you also have that capability as well so to sort of recap here encryption is new and this is both in transit at rest there's nothing that you need to worry about with respect to keys and you know issuance and renewal of those keys we're taking care of all that for you and then it's from a from a HIPAA standpoint if that's something that you care about it's it's included in the AWS baa so it's a HIPAA eligible now we'll review some common usage patterns with a loss of cash so this first slide is just really a kind of a slide to show different types of organizations that use ElastiCache it really it covers all sorts of verticals and as I mentioned earlier today also all sorts of workloads and those workloads really span from cashing to you know to sentiment analysis streaming data some various things and and usually what happens is an organization will start with caching maybe they'll do like database caching and then they'll move to maybe you know object caching maybe they'll go into API caching they'll catch the response the responses for API requests maybe they'll start caching you know elasticsearch responses and they'll kind of grow into other use cases and it will start seeing additional things we also see some organizations use Redis as a standalone database and you know this is certainly doable especially if you can recreate that data because again you can take snapshots you also have a che that you can you can do so a metadata store or something that you can create is certainly possible let's take a look at caching so with caching typically the way people think about this is you're you're placing a cache in front of a data store by our database and your caching the results that you would typically get out of that database but what where it gets interesting is imagine you have correlated data or data that's that spans multiple databases so we'll just make up an example here imagine you had a customer or maybe you're capturing clickstream data for your particular customer maybe it's in dynamo maybe that specific customer you have transactional data like orders order history things of that nature in your relational database and maybe you also have you know product metadata in your s3 object store and what you want to do is in addition to caching that back-end data you want to create like a cache hub right you want to create a central location that really is capturing the activity of that user you can easily do this and some people do this with Redis and what this does is it simplifies your your your data access it puts everything in one spot for you that aggregation is very natural with respect to how you might want to use this for analytics or you know various use cases for your application and the one thing that you'll need to do is obviously make sure that you could trigger and keep that cache fresh so whether it's through AWS lambda or some other process that you might define whenever you're updating that back in datastore just keep that cache as fresh as possible just so that aggregated view is as accurate as possible furthermore with caching we also see organizations I mean taking caching they're essentially caching everything right including other no sequel databases and to some this is sort of like you know like they think about this at first like an ansi pattern they're like well why am i caching in a database or Cassandra database well you're doing it for the same reasons you want to lower the latency the data retrieval speed you want to increase the data retrieval speed from your back-end data store and you may also want to lower your cost because again what Redis is support incredibly high request rates at a low cost because you're not paying for that throughput or those requests to allows the cache Redis and there's various techniques that you can do to cache data whether you want to serialize the objects that you get out of those those back-end databases or whether you want to convert them into a hash map or something that makes sense for your applications there's a variety of different things that you might want to do all really depends on what you want to how your application wants to use that data another interesting pattern that we we see a lot of organizations do because Redis is very fast and very rich with respect to data structures and there's a lot of aggregate data structures like the hash map the set sort of said linked lists what they might want to do is you know as they're capturing fast moving data you know say maybe we'll go back to sentiment analysis they're capturing tweets or you know things that are coming in very fast maybe they're they're using Kinesis streams for this what they want to do is they want to interpret they want to kind of dive into that data and they want to enrich it and they want to see if maybe that user did something previously on the on the system so they'll take they'll peel off records from that stream and then they'll take that maybe the user ID or something that they can correlate to the cache and they'll query the cache and OCR do I have activity for this user if I do let's go ahead and summarize this and for information maybe we'll do the data decorate it and you will enrich this data and then after they create this aggregate view they'll store this information in a cleansed stream and in that cloud stream now has a richer information processed information that they might have some other you know sort of process may be additional analytics that they might do or might might hydrate another back-end database for that other things that you can do we talked about pub/sub you might want to if you saw something interesting maybe in that stream you might want to publish that information to subscribers and those subscribers are doing something with that data maybe have a dashboard or something else it can certainly build that wit with Redis and as we're sort of talking about you know fast data fast moving data Kinesis streams the sort of big data architectures is becoming more and more popular with Redis and what we're seeing people do is they you know as they're processing and for me they still want to have that data you know in their in their big data Lake but what they also might want to do is they might want to augment that data platform that they're building and create this fast data layer right and this fast data layer is gonna be for maybe you know that transient information the active data and something that they can really pound on for for a low cost because again there's no added cost for that throughput in deal with the requests so they might use a variety of different engines or products to you know process the information whether it's Kafka in Apache storm or a spark and then they'll use a connector to basically drop that data into to ElastiCache Redis and and once it's there in those data stores they'll use a variety of different tools it all depends on that workflow and then they'll analyze that data both in the Fast datastore maybe they're also using obviously their Big Data Lake Forest Oracle maybe aggregations and things of that nature IOC we're seeing a lot of usage with Redis with IOT and you know this is a this really depends on the use case but what a lot of times happens is an organization it might be building something and they don't know exactly at first you know what type of data they didn't want to you know store you know for historical reasons or they want to capture all the sort of sensor informations in the beginning and sort of kind of tailor it down to what they really need and they're thinking about this and they want to build a solution in the most cost effective way that does not hinder performance so what they'll do is they'll capture sensor information they may use the AWS IOT service for this which makes it incredibly easy and at once it's in the AWS IOT service they'll trigger a rule from the funa rules engine that might have a lambda function and that lambda function will essentially write data lettuce right and if this is you know time series data this is very easy to do because you can capture the data and use a sorted set and in this score which allows you to sort the data based on that score or you can use a timestamp for that and as you're writing the data to that sorted set you can also have the properties of that sensor in a hash map and wrap all that into you know a transaction would make a lot of sense and if you also wanted that data for historical reasons you wanted that raw data also write that to s3 the Kinesis streams so that would be an approach that we see some customers leveraging with ElastiCache Redis geospatial capability with Retta really was a game-changer so we talked to a lot of organizations who build mobile applications whether it's a ride-sharing organization whether it's an organization that has a recommendation engine maybe a restaurant engine or something of that nature and what they'll do is you know as you know maybe you're walking they'll take the users information that longitude latitude data they'll pass it up you know maybe through API gateway they'll hit a lambda function query Redis with that that geo information and then recommend points of interest this is incredibly easy to do with Redis and again Redis is very very fast this enriches that user experience because your your recommendations are happening in incredibly fast you know performance and the only thing that you would need to do here is just define a workflow that's constantly keeping those points of interests rush into Redis if it's not your your primary database here so in this case we're using dynamo DB as a primary database we're using dynamodb streams which every time we write maybe it's a restaurant into dynamodb that information will go through dynamodb streams be triggered off of lambda function which will write that data into Redis so in this case Radisson Redis is really complimenting DynamoDB ad sack is another use case that we see a lot with organizations using Greta s-- so kind of just to review the workflow here so what you have is you have the ad publishers who are essentially you know placing ad ad slots for bids right and so they send this information whether it's click stream information and user information with a particular you know ad location into this ad network and then you have bidders who are essentially saying well do I want a bid on that and place my advertisement in that and data in that slot and that logic that's involved there needs to happen incredibly fast right that entire workflow needs to happen in less than 40 milliseconds so the most critical part here is when the bidder receives the information about the ad they want to they want to execute whatever logic that they have incredibly fast and they need a really fast database but what would be better is if the database provided capabilities to do really quick operations so you know Redis has various things what sets where you could do intersection join unions things things of that nature with a variety of other data structures so they'll take advantage of some of those capabilities do very very quick you know operations with Redis and then go ahead and bid or not bid in this use case Chadd applications see a lot of this especially with like gaming companies so if you're a gamer maybe you're playing a game and you see like maybe chat happening in a campaign a lot of communication happening within a group of players that communication may be powered by Redis other times I've spoken with customers who you know you'll be on a website and in a chat application will just pop up that application is powered by Redis using pub/sub capabilities and in leaderboards this is like a go-to solution with gamers because what would it what you can do with a leaderboard and kind of going back to this sorted set is this gives you very very easy ways to rank information and then retrieve information and various different sort of different amounts of users so for example I can say give me all the users with a specific rank or give me the reverse rank of the first top users things of that nature natural operations that would happen in a ranking engine which is a leaderboard right so you have that capability with Redis rate limiting so we have organizations who within their API calls maybe they you know they're building a solution where they want certain organizations to purchase maybe different packages of how many times you can call a specific API right so this is sort of bundled ins in different packages maybe a silver gold platinum type of deal and in what they'll do is in their in their API that the the end user will consume they'll make a call out so Redis which is really you know there's a variety of different ways you can implement this but essentially you're using counters whether you're decrementing or incrementing a counter and it just making sure that that customer didn't hit that limit and if they did you're just throttling it and again they use Redis here for speed and they also use Redis here because you don't pay for all those requests so it makes a lot of sense to do something like that all right so let's uh spend some time here and talk about best practices so the first one that we're going to view is sizing cluster sizing so you know when you when you're building a cluster there's a few things that you want to take it to really consider the first one is storage right so this is sort of the default one you'll you'll want to take a look at how much data you you actually need and then you wanna you want to add 25% data this is a data that you want to reserve for Redis for those background operations that we talked about and by default we'll reserve that for you so and then the next thing that you want to do and this is really optional optional is you may want to add a little buffer for some growth right so that would be a healthy way to size your cluster now the second thing you want to do is you want to make sure and you want to review this with the developers if you're in the operations or if you're a developer you want to review this as well you want to make sure that proper usage of TTLs is being done and and the best way to do this is really understand the frequency of change of the underlying data so review how frequently that data changes and make sure that whatever TTLs are being placed on that data reflects what the underlying change makes does we talked about we talked about scale-up and out using cloud watch so you might have a bunch of different alarms that are gonna be set to proactively react to your cluster and if you wanted a size for memory you're gonna select a you know an R based instance our fours are great they're memory optimized but in addition to memory they support incredibly high networking it you know it's starting at a are for large for example as 10 gigabits of network performance so it's sort of network and memory optimized the second thing you want to do is think about performance so once you've got sort of the storage out the way you want to consider what type of operations are happening and have a game plan in terms of what you're gonna do when one of those thresholds kind of is met so the first thing kind of obvious if you're if you're spiking on read I ox and you need more read I ops you're gonna add more replicas if you need more right I ops that means you need more sharps you need more primaries that you can write to if you need more Network IO if you're somehow you're being scaled on a you know a network you'll just pick a network based instance in general if you're loading data use pipelining as much as you can for bulk reads bulk writes and then consider the Big O complex time complexity associated to the operations and this one is sort of a bigger topic but the point with this one is really understand you know how what what is that operation what kind of impact is that operation have on that data structure how many members does that operation that data structure have and what you can do the sort of reduce that speed or the worst case scenario associated with that operation and then finally the thing that you might want to do is think about you know the different isolation cluster isolations associated to your cluster and what I mean by this is you know when you create a cluster there's a variety of different parameters that you can define one of them is like an addiction policy and what you might select for a caching cluster may be very different than what you might select for a metadata store and so what you might want to do is have caching workloads in a particular cluster and maybe a different cluster for your queues maybe another cluster for something else your metadata store so sort of group by purpose and then the more granular you get here the cost is going to go up so you need to figure out what makes sense for your organization the other thing that you'll always want to do is once you sort of have an idea of what makes sense from sizing perspective you'll want to test this it's very easy to test this and what I actually do is in my cloud formation templates I also build out an instance that has the Redis client and built into it so it's very easy for me to test the performance before I hand it over to somebody or before I start using it in a POC or a demo so this is another way to sort of kind of stamp that you have a good idea that this is going to perform the way you want now when you we talked a little bit about eviction policies and we'll just kind of talk a little bit further about this when you create that cluster if you're doing it for you know for your developers you'll really want to evaluate what this cluster is being used for and you know what caused a t'as being used more specifically right so the different eviction policies that exist there everything from you know and all keys LRU which really takes any key least recently used of key across your entire key space if you have for example a volatile LRU you're basically going to vicked a key that has a you know that has an expiration set right so a TTL set and the difference between the two is big right because what if you have data that does not have TTLs and then you also have data that has TTLs but the data that doesn't have TTL is you never want to evict because it's metadata right so this would be a reason why you might use one versus the other and so this is the kind of the kind of logic that you might have and then the same thing is true here for the the TTL volatile TTL and so on and so forth from a CloudWatch perspective you'll always want to look at CPU utilization make sure that you know you're you're sort of monitoring how much processing is happening on your cluster if you feel that you need more CPU support you might need more shards from the swamp usage you you want to see that low if not zero especially if memory is high you never really want to be in swapping an in-memory system from cache misses the hits you always want to have more cache hits than misses right so you might want to target like a 90 and up ratio that would be healthy and so if you had all the things that we talked about which was storage performance all those sort of things built out but your developers aren't using the cluster effectively it's kind of pointless right so you always want to sort of look at the cache hit to Miss ratio evictions you never want to see evictions even though we just kind of talked about which one you would select really an eviction is happening when you're overloading the cluster right so the cluster is at least being nice to you and saying before I get rid of a key you know tell me what algorithm you want me to evict and it's that's that's where you select those max memory policies but ideally you don't you don't ever want to hit that scenario unless you have a specific use case when you're using a Russian Russian doll caching or you define it as an LRU cache or something of that nature from a connection standpoint you always want to see this as stable you always want to validate that your developers are using connection pooling just like they would use with a database and as we mentioned earlier set alarms wherever you can keep in mind from a client's perspective a connection perspective you can have up to 65,000 per node whether it's a primary or replica there's parameters here where you can kill idle connections whether you use a timeout or TCP keep alive so you want to take a look at that and figure out what value makes sense for your organization I mention here for reserve memory we by default will reserve 25% of reserve memory so that would be a recommendation there and then setting the max memory policy as we earlier now its kind of take a quick look here at caching tips to the first tip as we mentioned earlier understand the frequency of change of the underlying data and what I always do is I always try to be a conservative with this right so the first question you want to ask yourself is what's the impact if if you provide steel data to the end-user and if the impact is high then be conservative if there's very little impact and you know use your best judgment here and work with your your database administrators and your business owners to understand what what value makes sense it's a place as we mentioned set the appropriate TCL's that match that frequency choose the proper eviction policies that align to the requirements isolate for purpose which we we discussed maintain the cache for us the freshness to the right throughs so there's two main patterns when you're dealing with you know a cache so in this case this is a kind of a cache aside but it's a sort of on the side of your your your architecture and what your application is doing is you're either lazy loading you're checking your cache for a value if it's not there you're clearing your back-end database you're grabbing the value from your back-end database and you're hydrating your cache with it with a TTL so that way the next request that comes in the data is there the other approach is proactive is this is where whatever process you have that's updating your back-end database you're also writing that data to the cache maybe you're you're applying a conservative TTL so you know if that data is not needed that's okay it's going to expire but if it's needed and that a request comes in then you increase the likelihood that it's going to be there and you're maintaining a better user experience because that data is always you know being found consistently in the cache so you're using both lazy loading and that right through pattern with your cache we talked about how to you know size or cluster so you always want to do that these are tenants to have a successful cluster monitor the hit and miss ratio use the failover API I highly highly recommend this and we expose an API here that allows you to kill primaries so do this make sure the applications are built in a way that can withstand failures and the way you do that is by testing the failover API and that's all we have today so thank you guys for your time I hope you learned something new with elastic ass [Applause]

Info

Channel: Amazon Web Services

Views: 23,339

Rating: 4.911602 out of 5

Keywords: AWS re:Invent 2017, Amazon, Databases, DAT305, ElastiCache

Id: _YYBdsuUq2M

Channel Id: undefined

Length: 56min 49sec (3409 seconds)

Published: Wed Nov 29 2017