AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns (DAT306)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello and welcome to Amazon ElastiCache deep dive my name is Michel Abib I'm a specialist Solutions Architect here at AWS and I'm delighted to share the stage today with Bryan Kaiser CTO of huddled who will be presenting afterward now we have a lot of content that we're going to present today so please save your questions to the end of the session and we'll stick around and take those questions all right so today we're going to be talking about the value of a key value store we're going to dive into Amazon ElastiCache we're going to look at the various usage patterns that you could use em is on ElastiCache with we'll talk about how you can scale your data using Redis cluster we'll look at best practices and at that point I'm going to hand it over to Bryan all right so we are headed into a the midst of a massive shift toward real time data and if you think about the the need for you know real-time analytics the data velocity the data value it really created an emerging trend for this fast data so in our session today what we're going to do is we're going to talk about how Amazon ElastiCache can power those various workloads outside of caching what is Amazon ElastiCache so it's a managed service it supports the two most popular key value store engines which is reticent memcache D it's fully managed so what that means is you don't have to worry about anything besides your data and the size of your cluster it is highly available reliable and it's managed and hardened by Amazon so we're going to talk about what that exactly means in a later slide now if you were to think about your data as a temperature gauge you would want that hot data to be readily available you would want it to support extremely high request rates you'd want it to support extremely low latency that's where Emma's on ElastiCache fits you might also have cold data you might also have warm data and for those data for those data needs you'll have data different data data stores that can augment your solution and so for your cold data you might want to put that in Amazon glacier which you can art archive and do something else with it and the same thing is true for those other data stores in between and it all depends on your use case so I mentioned there are two popular key value stores that have supported with ElastiCache the first one being memcache D so memcache D has been available and around since 2003 it's been the gold standard of caching for many years if you think about the capabilities of memcache D it's really like a flat cache it supports a data structure which is a string you can support up to one megabyte in that value it has no persistence so if you're adding shards in a memcache D cluster and you lose the data in a particular node you lost that data and for a caching use case that's a that's okay and if it's not okay this is kind of where Redis fits in and we'll talk about that other things with memcache D is very easy to scale you can add nodes and your key space will kind of distribute across those nodes pretty easily and and it's insanely fast right so I mean we're dealing with micro second performance Redis on the other hand I like to think of is a superset to memcache D why do I say that well it supports the string data structure so you can have the string value except from a one megabyte instead of having data up to one megabyte you can store up to 512 megabytes worth of data there are other data structures which we're going to dive into it has persistence so if you care about that data and you want to have you know support our tos and our POS you can do that with Redis we'll talk about the different options you have it's oddly available you can have a master and if that primary node fails you can have a read replica which will be promoted to be the new new master it's very powerful there's over 200 commands in Redis you have Lua scripting which you can build some business logic and you can have that logic execute in memory and it's also simple and that's important to mention because syntactically it's very easy to use we'll see some examples later in this presentation so when we talk about data structures I like to start with the basic so the basic data structure is a string now in Redis this is supported both in memcache T and Redis amend Redis it's a you know it supports up to 512 megabytes it's binary safe so what does that mean what it means is you can essentially put any anything that fits into that space in that value that could be HTML code it could be a JSON object it could be a image a picture and there's also a cool capability where if you have an integer representation in that string value you can increment and decrement that value and use it as a counter where it gets interesting with Redis is these additional data structures the first one we'll talk about is asset so if you remember if you came from a dev background a set is a collection that allows you to have unique values or elements within that collection so say for example you have a customer you want a group maybe your customer IDs your key maybe that customer the customer list and in every value that you have in that set might be a customer ID this is great why because you don't want to have duplicate customer IDs so this is managed for you in memory it's lightning-fast its microsecond performance and it's a great way to group your data together now a sorted set is a set so it maintains those unique values within a set but it also has an interesting parameter which is score now a score allows you to sort the data based on a particular value right so take for example you are building a game and your key might be a leaderboard and then your users are the value and you want to sort these users based on something right so in a game you're actually sorting them by the score so again this has happened for you in an in-memory performance engine and you pass in those values they'll maintain the uniqueness and it will sort it for you automatically and you can retrieve it in a number of ways one way that you can retrieve it is you know in a synchronous order and a reverse order and you can pull in a range of information so again this is great for deduping information for grouping information and sorting information and we'll take a look at other use cases where you can use a sorted set a list is a collection that allows you to capture the elements that are inserted in that order so there is no particular order that this is maintained and but it's great for pushing and popping elements either from the head or the tail of this list so a lot of common patterns that are built using a list could be you know a a timeline and so you can have a timeline that might be your key and all the elements that you put in that list could be an event base on that timeline so this is a common pattern and a usage for a list and in hashes they are my favorite so what a hash allows you to do is it's the it's a it's a data structure that is suited for object representation so take for example you have a customer record now that customer has attributes that attribute could be you know the customer name could be a customer address so if you're using a hash the key may be your customer ID and in all the fields and values associated with that customer are the attributes related to that customer now why is the cache why is a hash cool you can create a JSON object and you can store this in the string but what I like about hashes is that you can do operations on individual fields so you can set the data in a hash it's memory efficient in Redis and then I can query individual elements so maybe within my customer key or my customer list or my individual customer I just want to know his address I can just query for that individual field so it's great one of the questions that I get a lot from customers who are new to you know caching is which one should I use memcache T or Redis and so one of the things I like to do is kind of think backwards if you are just doing a caching use case and I'd like to start with what you know what languages are you using you might be using a language that you know has you know sophisticated support for memcache D maybe it's baked into the framework and if you're just using caching maybe memcache D is good enough however if you are needing to do caching but you think that there might be other use cases for your data I'll tell you to use Redis I think Redis and a lot of ways does what memcache do you can do but it but it can support additional use cases and we'll take a look at those so some of the value propositions that you get with ElastiCache I mentioned the first one it's fully managed what does that mean you just have to worry about the data that you put in a cluster and the size of your actual cluster as far as patching as far as failover as all those additional processes which we call heavy lifting it's kind of removed from your plate the other thing is that it's open source compatible so if you're already thread a have code written using Rytas on ec2 you can easily port that code over to ElastiCache same is true for memcache D so we support all the open source protocols there is no cross AZ data transfer so this is a sometimes overlooked so say for example you're running Redis on ec2 and you're in a multi AZ environment you may have you know up your your nodes communicating with each other and if you're doing that there's an additional charge so if you take a look at the actual charge for your ec2 instance plus the data out you're in a same ballpark of using ElastiCache might as well just use the last Akash and this is especially true when your node size or your cluster size is large the other thing is that the enhanced Redis engine comes with this and so I'll talk about that in the next slide and this is really some of the lessons learned that we've heard from our customers that we built into a service so the first one is the first feature is that we've heard from customers that memory management can be challenging with Redis so one Apple is say for example you're running Redis and you have background processes like doing snapshots and syncing especially with snapshots depending on how much rights you have occurring on the primary Redis may take up upto you know it may double your memory footprint on that instance and if you don't have enough memory for those background processes the throw you want to swap now you don't want to be in swap when you're an in-memory database system so what we have is we have enhancements that detect that situation we'll look at how much memory you have available on that on your instance and it will put you in a fork list backup scenario the other one is a write throttling so if you have a lot of writes hitting your primary we've heard from customers that this could be challenging because your read replicas may fall out of sync and so we have controls that will detect that scenario and it will throttle some of those rights to make sure that your cluster is in sync the last one I'm going to talk about is a smoother failover process so say for example you have a primary you have a replica actually have a couple replicas and then your primary fails one of your replicas will be elected to be the new primary but any other replicas that you have the data will be flushed and this is in this as if you're running it on easy easy - what we have is we've kind of enhanced that process to make sure the whole failover process runs smoother so we don't flush the data in the other replicas we just make sure that it's in sync with the newly-elected the newly-elected primary talk about some usage patterns the most popular one is caching right so there's a couple reasons a couple drivers that you want to do caching the first one is you want to alleviate some of the pressure to your database now that pressure could be maybe you maybe can't your database can't scale it doesn't matter what's your database is it could be Cassandra it can be DynamoDB it could be it could be you know our DBMS base database it could be anything the other reason the other driver is maybe the performance that you're getting out of your database isn't good enough right you want to lower that legacy when a few lines of code you can you can you can augment your architecture and add a caching layer right and so what that caching layer is going to give you automatically it's going to give you a higher level of throughput up to 20 million reads per second up to four point five million writes per second that's crazy the second one is its cost effective why do I say that because if you were trying to scale your database a lot of times the cost of scaling your back-end database which you'll never get the lower latency compared to a caching system the cost of scaling your database is a lot higher than adding a caching layer and then the third one is better performance while you're getting microsecond speed and today is a day and age you want to have that fast response times to power your applications now if you're using dynamo DB what's cool about this is that you can have an automatic trigger which is automatically going to going to populate the data in ElastiCache so you can have a trigger based on an update that's hitting your DynamoDB table that will put that update in a DynamoDB stream where a lambda function will be triggered off of that stream and it populates your data into ElastiCache now outside of caching this is great for decorating your data because you might not want to put that data in ElastiCache in the same way that it's stored in to Dynamo you might want to augment that data you might want to enhance or enrich that data you can just code that into your function now once you set this function it just they're just going to work for you this isn't it this is a way of a right through pattern now I said earlier in a few lines of code you can see you can augment your solution by adding a cache let's see if I'm lying so from a right through if you see there's two lines that are highlighted here essentially how write-through works is you you are writing to your system of record to your database and it and after you write to your database you immediately write that data to your cache now what's great about a right through pattern is that you are proactively filling your ket cast you're hydrating your cache with data that is that you think is usable now the con with doing this is that you are you have the potential of putting data into the cache and using more memory than you then you probably need on the other hand is another common pattern that's lazy loading so the way lazy loading works is you check your cache to see if a value is there if it is not there you retrieve it from your system of record your database and then you you set that data into your cache now the value with that is that you are setting the data that you know your application actually needs right so the con is you have a higher chance of hitting getting a Miss which is you know the data is not in your cache and you know that's that might not be best for your performance in practice people typically use both of these patterns and they augment with a TTL and expire parameter based on the data frequency the change of their data in their database so you have to understand how your data changes in your system of record and then apply a TTL that corresponds to that data okay so we're talking about caching another example I'll quickly go over a session caching now when you're in a distributed environment you have web applications it's important to to you know abstract your sessions and put them in a distributed cache right so this is great especially if you have a fleet of servers that can you know that can grow and shrink and based on your usage so based on a lot of the frameworks that you're using say for example is PHP example I can augment my solution with just changing a couple configuration changes and I don't have to create a session manager or do anything else I could just make these changes and now my application is using in this example memcache T this is also true for Redis and it's also true for a variety of programming languages so this is just one example you can take a look at that github repo for how you can actually do this IOT is an emerging kind of need that we're seeing with customers so imagine you have a solution that you know you have devices and you're capturing say sensor information one of the ways that you could do this there's a variety of ways that you could do it but one of the ways that you could do it is you create an AWS IOT rule that rule will trigger a lambda function and and after that lambda function is triggered you can have whatever sensor information that's coming in you can have that basically persisted into your Amazon ElastiCache engine now why is that good it's good because you're not paying for request rates you're not paying for throughput you are not paying for anything from up from a cost-effective perspective you're only paying for that instance type that you've selected and you can support the twenty million reads and the 4.5 million writes per second that's that's fast that's going to support all your device data now say for example you want to capture that data in another repository maybe you want to have longer retention you can always augment your solution and dump that data in DynamoDB you can also create a data lake put that data in s3 and you can do you know EMR jobs on top of that data or you can archive that data into Amazon glacier if we look at this particular example for the IOT rule you'll see that it's pretty simple this is a no js' example actually have the code checked into github as well you can play around with that but the ziad is essentially how you add data to a sensor to a sorted set so the sort of set is called sensor data and the date is my score so we talked earlier about what a score does in you know with Redis with a sorted set so what I want to do here is I'm going to capture time series data and it's important for me to have the actual time or the date when that event actually occurred now when I query this data out of the sort of site I'm going to do the reverse order so I know all the values that happened in you know the most frequent data that I'm going to have that returned back to me now the HM Set is how you you persist data into a hash and the first is my the first value is my key so the device ID is my key and all those additional attributes and values are the fields and values that I have associated to my hash now I'm wrapping this into a multi command because I just want to queue up these commands is execute that transaction with Redis another popular use case is streaming data so we have Amazon Kinesis streams and with Kinesis streams you could have a AWS lambda function trigger as soon as records are unmet stream and then as I mentioned earlier as that data is coming out of that stream you can decorate that data you can do something with that data and then you can persist that data into Amazon ElastiCache you can always have an ec2 instance sitting on the right-hand side here that can query ElastiCache maybe you just want to see that moving data moving maybe you want to do something with that data you can always do that and the same as true as I mentioned before you can always meant that solution and store that data in another system of record streaming data enrichment is another interesting pattern that we're seeing customers use so imagine you have that data coming in to stream now you may have various data sources that are populating that stream with data that stream is a raw stream and it might not be a cleanse stream so you want to do something with that data before you start you know using it so what you can do is you can collect that data from the stream have an AWS lambda function trigger when records are in the stream persist that data in serratus say you wanted to do doop de doop data you can throw that data in a set say you wanted to decorate that data you can take data that's already persisted into Redis and then based on the records that are coming in you can you can you know decorate that data you can enrich that data you could do a lot of things with that data and then once your data is cleansed throw that data in a cleanse stream and then and then you can run any type of operations on that stream you can even do sequel in that stream using Kinesis analytics spark streaming with Redis is an interesting use case so you know typically when you are doing data analytics you are you know you have maybe data coming in from Kinesis stream you have maybe a spark streaming jobs that's pulling data out of that stream it's summarizing that data it's augmenting that data it's dumping that data in to s3 then once it's an s3 your data Lake you could have maybe a redshift or another EMR job pick up that data and do something with it now what's what's an interesting trend here is that if you augment your spark job with Redis you will by orders of magnitude speed up that performance why because number one pulling data out of an in memory system is faster than pulling data from a you know file based system right or a SSD based system the second reason is that if you think about the kind of code that you typically write in spark you're usually you know sorting data you're aggregating data you are doing some sort of function that a lot of these advanced data structures and Redis can help reduce your code complexity so from both angles it's an interesting project to take a look at so Redis and a multi AZ environment let's take a look at that so the first thing that I'll call out is this is in a non-clustered environment so we'll talk about clustered environments a little later in this presentation but typically what happens is you have writes that you want issued to your primary your primary is a synchronously communicating to your read replicas there's a an option to set for multi a-z and when you do multi I Z we will put your read replicas in different Easy's and we'll enable that failover and when a failover happens we'll basically take the DNS name from your primary and will propagate that to to one of your read replicas now we'll select the read replicas that has the lowest replication lag in your cluster now one thing that we always recommend as well is that when you do snapshots do them on a read replica so you don't interrupt or interrupt your your master or your primary let's take a look at that let's visualize it so you have your applications they are talking to your primary which has a border around it something happens to that primary DNS replication will happen the read replicas will be the new primary and then we'll replace that replica and in this particular example you'll see there's another data store in this case its DynamoDB it is showing you that you can your application can talk to a variety of databases it doesn't really matter that but the databases themselves don't talk to each other now one question that I get a lot is how do I know what my reed replicas are especially in an environment where your reer applicants can change well we provide an API that allows you to query your replication group and you can pass in a replication ID you can easily get all the attributes associated to your cluster and then once you have for example will your replicas you can use them right so you can issue reads against your replicas and take advantage of one so an example of how you could do that I checked into that github repo all right so what's new so two months ago roughly two months ago we announced a new feature which is Redis cluster we support up to 3.5 terabytes and run us we as I mentioned earlier up to 20 million reads per second up to 4.5 writes per second all the enhancements that we talked about earlier are rolled into this version it's up to four times faster than version 2.8 and that's because it's not based on DNS and we'll talk about how actually works cluster level backup you don't have to backup individual nodes and we support up to 15 shards within your cluster the other thing that I'll mention it's fully supported by a table AWS CloudFormation if you're familiar with that it's a template engine it allows you to build up build environments and it's supported in all our AWS regions the other thing is that there's two additional data types that are supported with version 3.2 one is the bit field command and then the other one is geospatial I love geospatial I think it's awesome it's an awesome way to build data where code or geo aware code so say for example you have a mobile application and in that mobile application you want to advertise particular points of interest to your customers those points of interest could be anything right there could be restaurants they could be anything that you really want to advertise well in that mobile application you can pass up the longitude latitude of that position of that customer and based on Geo ads which are those points of interest that you added in the cluster you can find all the the points of interest within a particular radius right so I can say alright this customer is in this particular location let me find everything in a you know a mile away from this customer and let me send those recommendations up to this mobile app so we can see it now again as I mentioned Redis is an in-memory system we're dealing with microsecond performance this is why this is awesome because realistically when you have an application like that you want to be able to you know to market those advertisements as quick as possible now in addition to finding all the points of interest within a particular you know radius you can also do other things like what's the distance between two points as well as other things scaling with Redis cluster so how do you tell the engine that you want a horizontally scale the first thing you do is you check cluster mode otherwise if you don't check that we'll assume that you want that primary and in that read replica kind of vertically scale architecture that we looked at earlier and before we kind of dive into how sharding looks like let's talk a little bit about the client so essentially the way sharding works is you have 16384 total hash slots now those hash slots are distributed across your charts now by default that distribution is an equal distribution so if you create five shards those hash ranges will distribute distribute that be distributed across those those charts now your actual client now client being the driver or the client code that you're using the connector Redis has a map of every shard that has those hash ranges as well as any read replicas for that individual shard now what's also great about a client is that a lot of clients do they load balanced reads for you the other thing that I'll mention about the client is that because all that information is in a map within a client you don't need DNS propagation all those IP addresses and all that cluster information is built into the map and so the client itself knows where to route the traffic now this of course is much better than you know doing things like a proxy or something else why because that map is all right there with your code and so you can eliminate you're talking directly with the cluster and you can eliminate any other network hop between your code and actual the Redis engine all right let's visualize this so with the outside blue border is your Redis cluster now that cluster again can be up to 15 shards in this example we have three shards the the nodes that have the gray border those are your primary shards now any other node that has the same range in this example as a rear uploader so our first our first shard has a slot range of 0 to 54 54 and then you can see which are the reed replicas of this example they're in different Aziz and I have two other shards with a different hash range associated to them now with ElastiCache you could have up to five reed replicas for each shard and as I mentioned up to 15 total shards every one of these nodes together are your total cluster size in this example we have nine how do you do that right so in the console the first thing that you do is you give your cluster a name so this example my Redis cluster the second thing you do you select the engine type 3.24 then you will start selecting a node size this is the node size for your shard so in that case 13.5 times 3 that's the total memory space for my cluster and then I'm saying I want to shards or two replicas for my shard so in total 9 total sharps now I mentioned earlier by default we'll assume you want equal distribution so we have that total hash slop range and we will divide that range across each individual shard but you may have a key or something that's a hot key say for example you know you're always reading from an individual key so we'll give you the capability to change the slot and a key space in an orientation of your of your cluster and it will also by default spreads your your nodes across a ZZZ unless you have a particular use case where you want your ear your your nose and in particular a Z so get failure scenarios so assume that your primary fails this is an easy scenario right so what we do is we will promote one of your read replicas they'll be a new primary your previous you know your primary will be a replica and we'll will repair it your your client code your client is aware of this and this typically happens within 30 seconds as far as your reads there's no interruption assuming you have read replicas your application can continue reading from the cluster you might have some write interruption in the process of making that new read replicas the new primary but that's up to 30 seconds another example is when you have two primaries fail within your cluster and in this example are two or more primaries fail in your cluster this example we have 3 told shards so two primary shards of failing and why this is challenging is that if you're running this on ec2 and we've heard customers tell us this if the majority of your primaries fail it causes a problem and why does this cause a problem because you need a majority of primaries to be available to elect new primaries from read replicas we have our controls around us so essentially we'll see you know we'll look at your entire cluster health and you know if you're in a scenario where you don't have the majority to elect new primaries we have controls that can resolve that problem builds in the engine you don't have to do anything we will do this for you how do you get from a non-clustered environment to a clustered environment so it's pretty easy you essentially you take a snapshot of your cluster and then you restore that snapshot on each one of your individual shards in your cluster now the way cluster works there's a particular hash range right that we mentioned earlier on each 180 shards so we'll discard any keys that aren't applicable to that hash range now if you wanted to you know migrate from a non-clustered to a non-clustered 3.2 version that's seamless right there's a couple clicks and then that just that just works the other thing that I'll call out is that it's important to make sure that your client supports Weta's cluster and so you can just look up the client that you're using and just make sure it supports whereas 3m cloud formation is fully supported out-of-the-box so essentially building your cluster you know how many weave replicas you want all that is supported so you want to augment or automate your environments maybe build up an environment for test and dev and you know just terminate it you can do all that through cloud formation okay so best practices kind of go through a few of these so we can leave time here in the presentation these are just a few that we snuck into the presentation first one is avoid really short key names so I know a lot of people who are you know they try to be extremely memory efficient they want to have like the smallest you know abbreviated key name possible so what they'll do is they'll pick a key name that doesn't make sense for an application developer right so pick something that has a you know a logical schema name that's easy to code against the second thing is you know use hashes lists and sets when possible these are memory efficient collections so one thing that you can easily do so if you think about it this way if you have maybe five keys and you had a you know a hash that had five values in it the hash has a smaller memory footprint than those five individual keys and then the last thing is I see some people you know they have the keys command coded into their application don't do that right that is a you know that's a blocking command instead use like scans and you know just kind of iterate through that the results that you get all right so a few things I talked about a lot of these things just kind of mentioned a few of them from a Redis cluster standpoint have a odd number of shards now I talked about earlier that even in in a situation where the majority of your shards fail we will fix that scenario we have controls around that but it's still good to have an odd number because it just speeds up the overall failure scenario the other thing is you know for critical workloads you want to have a few read replicas you know associated you know for swap uses you'd never want to see that you never want to be in swap swap memory so you always want to see that at least zero to very low and let's see what else for Reserve memory I mentioned in if you're running this on ec2 that typically this can double your total memory footprint with ElastiCache we'll just recommend you know 25 to 30 percent reserve memory just to make sure you know there's there's additional memory for those background operations and Redis few things I'll call out here are some cloud watch metrics now every cloud watch metric you can have an alarm set up first one CPU you know typically don't go past 90 remember Redis is single threaded so you want to divide that by the number of number of cores that you have swap usage low again this is an in-memory system you never want to be in swap cache misses the hits you want to have more hits right so if you're getting the value out of your cache you want to be you want to be finding data in the cache evictions this is when you know Redis you know memcache D kind of kind of just jumps in and starts a victim Keys because your poor memory management you never want to really run into evictions unless this is unintentional maybe you're following a particular the rhythm that you know like a Russian doll caching algorithm where you want to do this I wouldn't recommend it and there are addiction policies that you can you know that you can look at and I would select one in the case you are in an eviction select one that makes sense for your application the other thing I mentioned is for an Maxo for the clients you can have up to 65,000 connections per node but it's good to have parameters around timeout and TCP keep alive and make sure you're killing dead connections or idle connections just get rid of those and that's pretty much all I'm going to cover for this session the other thing just to kind of recap Amazon ElastiCache supports of variety of use cases we're not talking about just caching although that's what the name kind of sounds like we're talking about a lot of use cases that support that fast data and fast moving data second thing is as you saw with a few lines of code it's very easy to augment your solution you know you whether use a memcache D or Redis they were lazy loading right through very easy a lot of frameworks already support these caching solutions so in some cases is just the configuration changes and then lastly you can support you know terabytes worth of data with millions of I ops so to really power your architectures it's an interesting you know it's an it's an interesting and also high ROI to augment that solution with putting a ElastiCache as part of that thank you guys that's all I have for this presentation alright thanks Michael hey everyone my name is Brian Kaiser I'm the CTO of huddle so today I'm going to talk about what huddle is kind of how we got into basic cashing our journey from memcache D to Redis and ElastiCache and then some best practices we learned along the way so huddle is a sports platform that really allows coaches and lists and athletes to win with video and analytics we were founded by myself and two partners about 10 years ago really focusing on football at the professional level nowadays we were broadly across sports from soccer basketball football from kind of peewee and youth teams all the way up to the NBA NFL and English Premier League in fact over 98% of American football teams use our product and the entire English Premier League calls us a customer now so it really is broadly applicable both domestically and internationally this is just a cool example I think of some of the more advanced analysis that we're seeing the English Premier League do around player tracking data the really kind of cutting edge in the space of sports analytics as you've probably read about a lot online so in some quick fun facts on our platform we have over a hundred and thousand one hundred and thirty thousand teams internationally using the product that I Quast over four and a half million active users we actually store and serve over two billion videos on s3 so needless to say we have a lot of video on s3 really likes our usage we actually ingested in code over 35 hours of HD video per minute during our primary sports season get that encoded and sort of back out and we're servicing over 15,000 API requests per second during that same time span and every one of those API requests is actually multiple cache hits as we'll kind of talk about so we've been on Amazon pretty much since the start of huddle and in fact if you look at the data it's the start of Amazon is right about the time that huddle actually started and it really made sense for us right we needed the ability to scale very quickly we handle very seasonal traffic workloads which is a great fit for Amazon and we need the ability to deliver high-performance the team's no matter what region of the world they're and Amazon checked all those boxes for us we run a fairly standard microservices architecture at huddle so we use an EOB is our primary entry point that's actually spread out to our routing layer which is really nginx boxes in each availability zone that talked to Eureka and Eureka is a service discovery system written by Netflix it's a wonderful piece of open source software we use arica to see what services are online what servers are available and what routes we need to do and then route down to the appropriate squad cluster in the micro service cluster in each applicable availability zone now the entry point that is of course our web tier pretty standard we run IAS for our primary web server and it's just an auto scaling group across the three ACS each squad is able to determine the supporting services to meet their needs whether it's ElastiCache DynamoDB SQS whatever it may be they actually service their needs each squad is able to use those Amazon services and then MongoDB is our primary data store at the bone at the bottom layer now we've had caching for a long long time and we got started with Couchbase on top of memcache D and honestly there's a pretty logical fit for us hopefully it's obvious from this presentation what Michaels talked about that caching is easy implement and it's very high impact and we recognize that early on and honestly for us it's not just about performance it's also about some of the things that are a little less obvious so we noticed that it helped us smooth out volatilities and things like IBS latency that might spike or network blips that happened in our internal infrastructure we also found as a very effective way for us as a smaller startup to scale cost effectively without having to increase our database you should significantly we started making the transition to read us a couple years ago and it's been quite impactful for us I like to think of it now as kind of the Swiss Army knife of a Swiss Army knife of our infrastructure not only created that very basic key value store but also has so many advanced capabilities and data structures we use it for queuing and pub/sub and now we serve over 80,000 requests per second through our through our Redis clusters and in preparing for this presentation I did some calculations and found that our average latency is less than one millisecond which is pretty astounding right even in as me that's one to two orders of magnitude quicker than what we found from our traditional database fetches so now I'm going to walk through some kind of basic use cases of how we use Redis at huddle I'm going to start with the most simple one which is the lazy loading basic data caching now we put this data caching at the lowest level possible right in front of the database calls in the database functions and we do that to allow for very easy cache invalidation we found that if our caching code was spread out broadly or up at the service layer and validation became very difficult so in our opinion we like to have it as low as possible for those invalidation so here's a very basic get utility function that we wrote and this isn't dotnet code one of the things that I think is somewhat unique is the first chunk where it says underscore Redis and Abel dot value huddle has a system of feature toggles where we can turn on and off different pieces of our system and architecture dynamically on a per cluster basis so for example in Redis if we need to do maintenance on a certain cluster if we're having any kind of connectivity issues we can hit one toggle we actually use SNS for this propagation we can turn off Redis access across that entire cluster now we have some pieces in our system that are extremely high volume and just kind of turning off Redis binary would cause the thundering herd effect potentially in a massive low in our database so in those cases it's much more of a broad range toggle where it might be between 0 and 100 we can ramp up and down our usage of these features depending on the maintenance and the time of day things like that the next line you see in bold there is just where we actually get a value from the database this is coming back as a byte array very very simple and then we're deserializing it in the bottom line in this case we're using protobuf so we're taking that byte array and we're deserializing it internet object again this is incredibly simple but I just want to show how simple this actually is and this is the code we use for it on the flip side of it here's our put method so you can see again that toggle at the top that allows us to control access to Redis we actually sir lies the object into a byte array and then we pop it into the database with a TTL on it and how do well it really depends on what the frequency of access on our objects are it could be anywhere from five minutes to an hour on the low end up to one to seven to thirty days or the amount of time that we put objects in the cache for a slightly more complicated Apple is a utility function rewrote to kind of enforce this lazy loading pattern so this is our get input so again we have the toggle at the top we go to attempt to get that key out of cache if it's a cache hit we go ahead and return that value right away if it's amiss we're able to pass in an accessor function which may be a DB access or maybe a call to another micro service and then we put that cache key in again this is ultra simple right we write these functions to help enforce a good pattern and make sure we don't have any problems or bugs down the road so what are some examples of ways that we use this well the most common one for us is the auth token so every call that comes in to huddle gets authenticated right that auth tokens in the cache we check it every single time so these keys alone are over 15,000 requests per second just for our auth tokens the next big one is user information this is a perfect example as Michael mentioned where a hash is a wonderful fit in Redis we can store the relevant information for a user like their Jersey their email address what teams they're a part of in a simple hash function in Redis and either get that whole object out which is actually those common thing for us or individual pieces as we need it likewise we store information on our teams this might be the sport they were part of the level whether it's a high school collegiate or pro team and a very simple hash function in Redis it just makes a whole lot of sense so let's move on to a little more advanced example this is our news feed so you can think of the news feed kind of like well a facebook news feed right this is essential source of information for our users and teams and a good example would be a coach post a new highlight reel or playlist or video for a team to watch and you can imagine it populating into the newsfeed this is a function that makes incredibly heavy use of Redis force in fact it's such a heavy user of Redis that if Redis has a problem or goes down for any reason we have to turn off this piece of functionality our system because it would quickly overload our database I think it'll make sense in a second when I talk about the access patterns so at the top of the page here we use a hash so we're storing things like the background image for the team the location of the team things like the view count for this page or for highlight reels the number of followers of this team and again this is a wonderful use of a hash often as an example in this page load we're retrieving that entire thing out we can atomically increment things like view count and lowers right it's a very very simple but highly effective and highly efficient way of storing this data and kind of the bulk of the page here is the actual feed itself and I think this is a somewhat novel approach that one of our developers came up with for this essentially this is a single Redis list of feed IDs and a little bit of metadata around them what happens is let's say a coach in that previous example is publishing a video we will go through every single user on that team we will push that ID and information onto the Redis list for that user it's a very very quick operation and then we'll trim the list to make sure it's at a constant length and so in our case we keep this listed right around 500 items now when the user is scrolling through this list we know the offset so we can actually grab a range off that Redis list of the IDS and information and then we can hydrate that with those items out of the cache in parallel with a multi get so I know that's a lot there but effectively what that means is we're able to generate this news feed and literally a couple milliseconds on page load it's a very very efficient operation for us for something that would be very complicated right and this is a wonderful example of how we can use lists and coupled with the caching and a multi get so let's talk about an another example distributed caching this is kind of a dirty word often I don't encourage you to use this driven caching but when you need it it's really nice to have a very reliable and consistent way to do it atomically and Redis provides that so in this example we have a lot of coaches they're at their game and they're filming it with iPads fragments of this video is flowing directly to s3 so little snippets of the game live why it's happening we're then making calls into a queue which get put into a very massive worker farm we have that does that video encoding we talked about that 35 hours of video per minute right that's what these jobs come in we're grabbing those chunks of video we're encoding the minimal qualities we're sanitizing them we're getting ready for streaming the challenge comes this is all happening in parallel at a very broad scale so these jobs come back not only we're writing them to MongoDB out of this queue but we also have to use a distributed lock to make sure that we're coordinating delivery and finalization of this video and it's a small piece and it's very very quick but it's incredibly important to get right so they know when this video is ready for the user and that's what we actually use ElastiCache for and again this is a thing that staccato is extremely extremely well in our experience and if you have to distribute locking I'd recommend you checking it out so we're going to answer the question why do we use ElastiCache versus just running Redis or self and Michael gave some great examples of the benefits and I can tell you from personal experience that we've seen that for us it's also an operational question and an operational answer so we are organized as a company in a product team like Spotify we pretty blatantly ripped it off and I appreciate all the documentation that done this in fact these are screenshots from a slide that they've posted online and what this model is it's all about tribes and squads and pushing down decision-making and autonomy as low as possible and that includes not just requirements gathering and working as customers but also the operational side whether it being what services you want to run how you want to run the servers monitoring scaling and all that so having a managed service like it last Akash is really critical to allow our squads to operate autonomy it's a perfect fit each squad can determine how they want to integrate that into their cluster do they want to use Redis cluster is it a single node what size do they want to be what eviction policy makes sense for them and how do they want to manage their serialization and ElastiCache takes care of all the operational complexity and it just works and they can focus on how they actually want to use that service on top of it it really just makes sense so we had an analogy we were prepping for this where someone said it's kind of like a slam dunk isn't it that's the cheesy sports analogy but I think it really resonates here so I'd be remiss if I didn't talk about Redis cluster briefly this is a really new addition it's been incredibly impactful for us I'll tell you that we don't need to use it broadly it's kind of like a hammer and often it's overkill for us but we do need it it's incredibly important there's the last piece of the puzzle for us to migrate all of our caching onto ElastiCache I give this a screen shot up here of that feed that I talked about earlier that is running on Redis cluster in production today and the performance on it is is really incredible this is one screenshot that I took just prepping for this presentation this is one single shard out of that feed cluster and you can see it's running right around 1.2 percent CPU and it's doing a hundred thousand operations per minute by without per second sorry buzz per minute but on this single shard at one point two CPU I mean that's that's really really incredible we're not even pushing this operating in shard capacity all right so it's time for me to wrap up I want to talk about some best practices that we've learned over the past couple years in ElastiCache first and foremost and I cannot stress this enough if you're in production use a multi easy replica honestly you just have to at the end of the day you know you're running on an infrastructure where nodes need maintenance nodes will fail that we network problems and in our experience running multi a Z has been a very very efficient way to provide uptime our failure is a failover is often in the order magnitude of seconds I know they talk about 30 seconds for us that's been a very high bar it's often much much quicker than that multi Z is critical now obviously in you know dev and tests we run single node right but productions always multi Michael had a slide up there that he went through kind of quickly talking about setting up alerting that's been critical for us and early on we didn't do a very good job with this and we paid the price for it so some of the alerts that we found very valuable swap usage we've had some cases where we didn't manage our memory properly swap usage got out of hand and all of a sudden you see horrible performance is a very critical alert to set up CPU and evictions are the two others I'd recommend alerts on from our point of view so very important I think it's worth taking the time to understand available eviction policies and Redis depending on your use case different policies make sense for you there's six of them out there there's a really really good documentation online so I don't need a waste your time talking about it but for example there's one called all keys LRU and if you're just going to have a cluster that only does caching it'll take care of the eviction for you based on least requested automatically you basically don't even need TTLs it's a very easy way to phase in whereas we use volatile TTL for clusters that we have mixed things like we have some distributed locks some caching some more persistent objects and if we use the wrong one we might get the wrong objects evicted from memory or the wrong access patterns might be evicted it's very very important to get that right my final one it's worth the time to learn and understand ritas advanced data structures I talked a little bit about our news feed when we first got into caching we were just doing the basic object caching key value and that is impactful and it makes a huge performance difference but what really happened in our company was almost like a cultural shift to take the time to learn to understand sorted sets Chiz lists hyper hyper logs and how those could benefit our ability to provide higher access to our data for our customers and things that just aren't possible in our traditional database it really kind of opened our eyes to what we could do I also think it's important to understand the Big O complexity of the operations on those data structures so when we're looking at design the newsfeed we had to decide did we want to use a sorted set based on date or did we want to use a more traditional list we looked at the Big O complexity of the operations we need to run on that object to understand what the performance implications are and again this is all documented on the Reta site very very clearly there's wonderful examples on there I encourage you to check it out so with that I really appreciate you talking any time I'm going to be available afterwards to answer questions I encourage you all to fill out the recommendation forms thank you and have a good one

Info

Channel: Amazon Web Services

Views: 13,052

Rating: 4.6923075 out of 5

Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, AWS re:Invent 2016, aws reinvent, reinvent2016, aws, cloud, amazon web services, aws cloud, re:Invent, DAT306, Michael Labib, Advanced (300 Level), Databases

Id: e9sN15a7utI

Channel Id: undefined

Length: 55min 39sec (3339 seconds)

Published: Thu Dec 01 2016