RailsConf 2019 - Cache is King by Molly Struve

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(slow techno music) - Welcome, everyone, to Cache is King. My name is Molly Struve, and I am the lead Site Reliability Engineer at Kenna Security. If you came to this talk hoping to see some sexy performance graphs such as this one, then I'm happy to say you will not be disappointed. (audience laughing) If you're the kind of person who enjoys an entertaining GIF or two in your technical talks, then you're in the right place. If you came to this talk to find out what they heck a site reliability engineer does, also a bingo. And, finally, if you wanna learn how you can use Ruby and Rails to kick your performance up to the level, then this is the talk for you. As I mentioned, I'm a Site Reliability Engineer. Most people, when they hear Site Reliability Engineer, don't necessarily think of Ruby or Rails. Instead, they think of someone who's working with MySQL, Redis, AWS, Elasticsearch, Postgres, or some other third-party service to ensure that it's up and running smoothly. For the most part, this is a lot of what we do. Kenna's Site Reliability Team is just over a year old. But when we first formed, you can bet we did exactly what you'd expect. We went, and we found those long-running, horrible, MySQL queries. And we optimized them by adding indexes where they were missing, using select statements to avoid N+1 queries, and by processing things in small batches to ensure we are never pulling too many records out of the database at once. We also had Elasticsearch searches, that were constantly timing out. So we rewrote them, to ensure that they could finish successfully. We even overhauled how our background processing framework, Resque, was talking to Redis. All of these changes led to some big improvements in performance and stability. But even with all of these things cleaned up, we were still seeing high loads across all of our datastores. And that's when we realized something else was going on. Now, rather than tell you what was happening, I wanna demonstrate it. So for this, I need a volunteer. I promise it's super simple. - [Shane] All right. - Get on up here. Awesome. Thank you, thank you. Okay, come right up here. What's your name? - Shane Smith. - Okay, think I'm gonna remember this. Oh, I already forgot it, what's your name? - Shane Smith. - Maybe one more time, what's your name? - Shane Smith. - Okay, the last time, what's your name? - Shane Smith. - Okay, about how annoyed are you right now? - Getting there. - Getting there, okay. (audience laughing) How annoyed would you be if I asked you, "What's your name?" a million times. - Seems like the, "Orange you glad I didn't say banana," joke, yeah. - Yes, yes, there you go, yeah, really, really annoyed. What is one easy thing I could have this person do so that I don't have to keep asking them their name? And I'll give you guys a hint. It involves a pen and a piece of paper. Shout it out. - [Audience] Name tag. - Even simpler, just writing it down. So if you could write your name down there, you can just do your first name, it's fine. - Terrible handwriting. - It's fine, I can read it. Now that I have Shane's name written on this piece of paper, I no longer have to keep asking Shane for it. I can simply read the piece of paper. This is exactly what it's like for your Rails applications. Imagine, I'm your Rails application, Shane's your datastore. If your Rails application has to ask your datastore millions and millions of times for information, eventually it's gonna get pissed off. And it's gonna take your Rails application a long time to do it. If instead, your Rails application makes use of a local cache, which is essentially what this piece of paper is doing. It can get the information it needs a whole lot faster and it can save your datastore a whole lot of headache. Okay, that's all I needed you for, thank you. (audience clapping) The moment at Kenna when we realized it was the quantity of our datastore hits that was wreaking havoc on our datastores, that was a big a-ha moment for us. And we immediately started trying to figure out all the ways we could decrease the number of datastore hits we were making. Now, before I get into all the awesome ways we use Ruby and Rails to do this, I first wanna give you a little bit of a background on Kenna so you have some context around the stories I'm gonna share. Kenna Security helps Fortune 500 companies manage their cyber security risk. The average company has 60,000 assets. You can think of an asset as basically anything with an IP address. In addition, the average company has 24 million vulnerabilities. A vulnerability is basically any way you can hack an asset. Now with all of this data, it can be extremely difficult for companies to know what they need to focus on and fix first. And that's where Kenna comes in. At Kenna, we take all this data and we run it through our proprietary algorithms. And then those tell our clients what vulnerabilities pose the biggest risk to their infrastructure, so they know what they need to focus on and fix first. When we initially get all of this asset and vulnerability data, the first thing we do is we put it into MySQL. MySQL is our source of truth. From there, we then index it into Elasticsearch. Elasticsearch is what allows our clients to really slice and dice their data any way they need to. In order to index assets and vulnerabilities into Elasticsearch, we have to serialize them, and that is what I wanna cover in my first story. Serialization. Particularly I wanna focus on the serialization of vulnerabilities because that is what we do the most of at Kenna. When we initially started serializing vulnerabilities for Elasticsearch, we were using ActiveModelSerializers to do it. ActiveModelSerializers hook right into your ActiveRecord objects. So all you have to do is define the fields you wanna serialize, it takes care of the rest. It's super simple which is why it was naturally our first choice for a solution. However, it became a less great solution when we started serializing over 200 million vulnerabilities a day. As the number of vulnerabilities we were serializing increased, the rate at which we could serialize them dropped dramatically, and our database began to max out on CPU. The caption for this screenshot and slack was 11 hours and counting. Our database was literally on fire all the time. Now some people might look at this graph, and their first inclination would be to say, "Why not just beef up your hardware?" Unfortunately at this point, we were already running on the largest RDS instance AWS had to offer. So beefing up our hardware was not an option. My team and I, when we looked at this graph, we thought, "Okay, there's gotta be some horrible, "long-running MySQL query in there that we missed." So off we went, hunting for that elusive, horrible MySQL query. Much like Liam Neeson in, "Taken," we were bound and determined to finding the root cause of our MySQL woes. But we never found those long-running, horrible MySQL queries because they didn't exist. Instead what we found, were a lot of fast millisecond queries that were happening over again and again and again. All these queries were lightning fast, but we were making so many of them at a time, that our database was being overloaded. We immediately started trying to figure out how we could serialize all this data and make less database calls when we were doing it. What if instead of making individual calls to MySQL to get the data for each individual vulnerability, we group all the vulnerabilities together at one and make one call to MySQL to get their data at one time. From this idea came the concept of Bulk Serialization. To implement this we started with a cache class. This cache class was responsible for taking a set of vulnerabilities and a client and then running all the MySQL look ups for them at once. We then took this cache class and we passed it to our vulnerabilities serializer, which still held all the logic needed to serialize each individual field, except now, instead of talking to the database, it would simply read from our cache class. So let's look at an example of this. In our application, vulnerabilities have a related model called custom fields. They basically allow us to add any special attribute we want to a vulnerability. Prior to this change, when we would serialize custom fields, we would have to talk to the database. Now, we could simply talk to our cache class. The payoff of this change was big. For starters, the time it took to serialize vulnerabilities dropped dramatically. Here is the console shot showing how long it takes to serialize 300 vulnerabilities individually. Takes us just over six seconds. And that's probably a pretty generous benchmark, considering it would take even longer when our database was under load. If instead, we serialized those exact same 300 vulnerabilities in bulk? Boom. Less than a second to do it. These speed ups are a direct result of the decrease in the number of database hits we have to make to serialize these vulnerabilities. To serialize those 300 vulnerabilities individually, we have to make 2,100 calls to the database. 2,100. To serialize those same 300 vulnerabilities in bulk, we now only have to make seven. Boom, again. As you can glean from the math here, it's seven calls per individual vulnerability, or seven calls for however many vulnerabilities you can group together at once. In our case, when we're serializing vulnerabilities, we're doing it in batches of a thousand. So we took the number of database requests we were making for each batch from 7,000 down to seven. This large drop in database requests is plainly apparently on this MySQL queries graph. Which shows the number of requests we were making before, and then after we deployed the bulk serialization change. With this large drop in requests, came a large drop in database load, which you can see on the RDS CPU Utilization graph. Prior to the change, we were maxing out our database at 100%. Afterwards, we're sitting pretty chilly around 25%. And it's been like this ever since. The moral of the story here is, when you find yourself processing a large amount of data, try to find ways that you can use Ruby to help you process that data in bulk. We did this for serialization, but it can be applied any time you find yourself processing data in a one by one manner. Take a step back and ask yourself, "Is there a way I could process "this data together, in bulk?" Because one call for 1,000 IDs is always gonna be faster than a thousand individual database calls. Now unfortunately the serialization saga doesn't end here. Once we got MySQL all happy and sorted out, then suddenly Redis became sad and this folks is the life of a Site Reliability Engineer. A lot of days we feel like this. You put one fire you, out start one somewhere else. You speed one thing up and the load transfers to another. In this case, we had transferred the load from MySQL to Redis, and here's why. When we index vulnerabilities into Elasticsearch, we not only have to make requests to MySQL to get all their data, we also have to make calls to Redis, in order to know where to put them in Elasticsearch. In Elasticsearch, vulnerabilities are organized by client. So to know where a vulnerability belongs, we have to make a get request to Redis to get the index name for that vulnerability. When preparing vulnerabilities for indexing, we gather up all their serialized vulnerability hashes and one of the last things we do, before sending them to Elasticsearch is we make that Redis get request to get the index name for each vulnerability based on its client. These serialized vulnerability hashes are grouped by client, so this Redis get request is often returning the same information over and over again. Now keep in mind, all these Redis get requests are super simple and very fast. As you can see, they take a millisecond to execute. But as I stated before, it doesn't matter how fast your requests are, if you're making a ton of them, it's going to take you a long time. We were making so many of these simple get requests that they were accounting for roughly 65% of our job run time, which you can see in the table and is represented by the brown in that graph. The solution to eliminating a lot of these requests, one again was Ruby. In this case, we ended up using a Ruby hash to cache the Elasticsearch index name for each client. Then, when looping through those serialized vulnerability hashes, rather than hitting Redis for every single vulnerability, we could simply reference our client indexes hash. This meant we only had to hit Redis once per client instead of once per vulnerability. So let's take a look at how this paid off. Given these three example batches of vulnerabilities, no matter how many vulnerabilities are in each batch, we only ever have to hit Redis three times to get all the information we need to know where they belong. As I mentioned before, these batches usually contain a thousand vulnerabilities a piece. So we roughly decreased the number of hits we were making to Redis a thousand times. Which in turn, led to a 65% increase in our job speed. Even though Redis is fast, a local cache is faster. To put it into perspective for you, to get a piece of information from a local cache is like driving from Downtown Minneapolis to the Minneapolis Saint Paul Airport. It's about a 20, 25 minute drive. To get the same piece of information from Redis is like driving from Downtown to the Airport and then flying all the way to Chicago to get it. Redis is so fast that it can be easy to forget you're actually making an external request when you're talking to it. And those external requests can add up and have an impact on the performance of your application. So with these maps in mind, remember that Redis is fast but a local cache, such as a hash cache, is always going to be faster. So we've just seen two ways that we can simple Ruby to replace our datastore hits. Next I wanna talk about how you can use your ActiveRecord framework to replace your datastore hits. This past year at Kenna, we sharded our main MySQL database. And when we did, we chose to do it by client. So each client's data lives on its own shard database. To help us accomplish this, we chose to use the Octopus Sharding Gem. This gem gives us this handy dandy using method, which when passed a database name, has all the logic it needs to know how to talk to that database. Because our information is divided by client, we created the sharding configuration hash. Which tells us what client belongs on what sharded database. Then each time we make a MySQL request, we take the client ID, we pass it to that sharding configuration hash, in order to get the database name that we need to talk to. Given that we have to access the sharding configuration hash every single time we make a MySQL request, our first thought was, "Why not just store it in Redis?" Because Redis is fast and the configuration hash we wanna store is relatively small. It was small at first, eventually that configuration hash grew and grew, as we added more and more clients. Now 13 kilobytes might not seem like a lot of data, but if you're asking for 13 kilobytes of data millions of times, it can add up. In addition to your growing configuration hash, we were also continually increasing the number of background workers that we had working, so that we could increase our data throughput. Until eventually we had 285 workers chugging along at once. Now remember, every single time one of these workers makes a MySQL request, it first has to go to Redis to get that 13 kilobyte configuration hash. It all quickly added up until we were reading 7.8 megabytes per second from Redis, which we knew was not gonna be sustainable as we continued to grow and add clients. When we started trying to figure out how we were gonna solve this problem, one of the first things we decided to do was take a look at ActiveRecord's connection object. ActiveRecord's connection object is where ActiveRecord stores all the information it needs to know how to talk to your database. So naturally, we thought it might be a good place to find somewhere to store our configuration hash. So we jumped into a console to check it out and when we did, what we found was not an ActiveRecord connection object at all. Instead it was this Octopus Proxy Object that our Octopus Sharding Gem had created. This was a complete surprise to us and we immediately started digging into our gem's source code, trying to figure out where the heck this Octopus Proxy Object had come from. And when we finally found that Octopus Proxy Object, much to our delight, it already had all these great helper methods that we could use to access our sharding configuration. Boom. Problem solved. Rather than having to hit Redis every single time we made a MySQL request, all we simply had to do was talk to our local ActiveRecord Connection Object. One of the big things we learned from this whole experience was how important it is to know your gems. It is crazy easy to include a gem in your gem file, but when you do, make sure you have a general understanding of how it works. I'm not saying you need to go and read the source code for every one of your gems, because that would be insane and it would take you forever. But consider this. The next time you add a gem to your gem file, maybe set it up manually the first time in a console so you can see what is happening and how it's being configured. If we'd had a better understanding of how our Octopus Sharding Gem was configured, we could have avoided this entire Redis headache. However, regardless of where the solution came from, yet again, caching locally, in this case using our ActiveRecord Framework as a cache is always gonna be faster and easier than making an external request. These are three great strategies that you can use to help replace your datastore hits. Now I wanna shift gears and talk about how you can use Ruby and Rails to avoid making datastore hits you don't need. I'm sure some of you are looking at this thinking, "Pfft, duh, I already know how to do that." But let's hold up for a minute, because this might not be as obvious as you think. For example, how many of you have written code like this? Come on, I know you're out there. 'Cause I know I've written code like this. There you go, awesome. This code looks pretty good, right? If there's no user IDs then we're gonna skip all of this user processing. So it's fine, right? Fortunately that assumption is false, it's not fine. Let me explain why. Turns out, if you execute this where clause with an empty array, you're actually going to be hitting the database when you do. Notice this where 1=0 statement. This is what ActiveRecord uses to ensure no records are returned. Sure, it's a fast one millisecond query, but if you're executing this query millions of times, it can easily overwhelm your database and slow you down. So how do we update this chunk of code to make our Site Reliability Engineers love us? You have two options. The first is by not running that MySQL lookup unless you absolutely have to. And you can do that by doing an easy, peasy array check using Ruby. By doing this, you can save yourself from making a worthless datastore hit and ensure that your database is not gonna be overwhelmed with useless calls. In addition to not overwhelming your database, this is also going to speed up your code. Say you're running this chunk of code 10,000 times. It's gonna take you over half a second to make that useless MySQL lookup 10,000 times. If instead you add that simple line of Ruby to avoid making that MySQL request, and you run a similar block of code 10,000 times. Less than a hundredth of a second to do it. As you can see, there is a significant difference between hitting MySQL unnecessarily 10,000 times and running plain old Ruby 10,000 times. And that difference can add up and have an impact on the performance of your application. A lot of people will look at that top chunk of code and their first inclination is like, "Pfft, what are you gonna do, Ruby's slow?" But that couldn't be further from the truth. Because as we just saw, this simple line of Ruby is hundreds of times faster. In this case, Ruby is not slow. Hitting the database is what's slow. Keep an eye out for situations like these in your code where it might be making a database request you don't expect. And I'm sure some of you Rails folks are probably thinking, "Not exactly writing code like this." Actually I chained a bunch of scopes to my work clause so... (audience laughing) I have to pass that empty array otherwise my scope chain breaks. Thankfully, even though ActiveRecord doesn't handle empty arrays well, it does give you an option for handling empty scopes. And that is the none scope. None is an ActiveRecord query method that returns a chainable relation with zero records, but more importantly it does it without querying the database. So lets see this in action. We know from before if we execute that where clause with our empty array, we're going to hit the database when we do. And we're gonna do it with all of our scopes attached. If instead, we replace that where clause with the none scope, boom, we're no longer hitting the database and all of our scopes still chain together successfully. Be on the lookout for tools like these in your gems and frameworks that will allow you to work smarter with empty data sets. And even more importantly, never, ever assume your library, gem or framework is not making a database request when asked to process an empty data set. 'Cause you know what they say about assuming. Ruby has so many easily accessible libraries and gems, but their ease of use can lull you into a sense of complacency. Once again, when you're working with a library or a gem or a framework, make sure you have a general understanding of how it works under the hood. One of the easiest ways to gain a better understanding is through logging. Set your logging to debug for your framework, your gem and every one of your relayed services. When you're done, load some application pages, run some background workers, even jump in a console and run some commands. Afterwards look at the logs that are produced. Those logs are going to tell you a lot about how your code is interacting with your datastores. And some of it might not be interacting how you would think. I cannot stress enough how valuable something as simple as reading logs can be when it comes to making optimizations in an application and finding useless datastore hits. Now this concept of preventing useless datastore hits doesn't just apply to MySQL. It can apply to any datastore you're working with. At Kenna, we end up using Ruby to replace... To prevent datastore hits to MySQL, Redis and Elasticsearch and here's how we did that. Every night at Kenna, we build these beautiful, intricate reports for our clients from all their asset and vulnerability data. These reports start with a reporting object, which holds all the logic needed to know what assets and vulnerabilities belong to a report. Every night, to build that beautiful reporting page, we have to make over 20 calls to Elasticsearch and multiple calls to Redis and MySQL. My team and I did a lot of work to ensure all these calls were very fast. But it was still taking us hours every night to build the reports. Till eventually we had so many reports in our system that we couldn't finish them all overnight. Clients were literally getting up in the morning and the reports weren't ready, which was a problem. My team and I, when we started trying to figure out how we were gonna solve this issue, the first thing we did was decide to take a look at what data our existing reports contained. First thing we decided to look at, how many reports are in our system? Over 25,000. That was a pretty healthy number for us, considering only a few months earlier, we had only had 10,000. The next thing we decided to look at was how big are these reports? A report's size directly depends on the number of assets a report contains. The more assets in a report, the longer it's gonna take us to build that report. We thought maybe we could split these reports up by size somehow, to speed up processing. So we looked at the average asset count per report. Just over 1,600. Now if you remember back to the beginning of the presentation, I mentioned that our average client has 60,000 assets. So when we saw this 1,600 number we thought that seemed pretty low. The next thing we decided to look at was how many of these reports have zero assets? Woo, over 10,000. Over a third of our reports have zero assets. And if they have zero assets, that means they contain no data. And if they contain no data, then what is the point of making all these Elasticsearch, MySQL and Redis calls when we know they're gonna return nothing? Light bulb. Don't hit the datastores if the report is empty. By adding a simple line of Ruby to skip the reports that had no data, we took our processing time from over 10 hours down to three. That simple line of Ruby was able to prevent a bunch of worthless datastore hits, which in turn sped up our processing tremendously. This strategy of using Ruby to prevent useless datastore hits, I like to refer to it as using database guards. In practice, it's super simple. But I think it's one of the easiest things to overlook when you're writing code. We're almost there. This last story I have for you actually happened pretty recently. So you remember those Resque workers I talked about at the beginning of the presentation? As I mentioned, they run with the help of Redis. One of the main things we used Redis for is to throttle these Resque workers. Given our sharded database set up, we only ever want a set number of workers to work on a database at any given time. Because what we've found in the past is that too many workers working on a database would overwhelm it and slow it down. So to start we appointed 45 workers at each database. After making all of these improvements that I just mentioned, our databases were pretty happy, so we decided why not bump up the number of workers in order to increase our data throughput? So we increased the number of workers to 70 on each database and of course, we kept a close eye on MySQL. But it looked like all our hard work had paid off. MySQL was still happy as a clam. My team and I, at this point, we were pretty darn proud of ourselves so we celebrated for the rest of the day. But it didn't last for long, 'cause as we learned earlier, often when you put one fire out, you start one somewhere else. MySQL was happy, but then overnight we got a Redis high traffic alert. And when we looked at our Redis traffic graphs, we saw at times, we were reading over 50 megabytes per second from Redis. So that 7.8 from earlier? That's not lookin' so bad now. This load was being caused by the hundreds and thousands of requests we were making trying throttle these workers, which you can see on this RDS request graph. Basically before any worker can pick up a job, it first has to talk to Redis to figure out how many workers are already working on that database. If 70 workers are already working on the database, the worker will not pick up the job. If it's less than 70, then it knows it can pick up the job. All of these calls to Redis were overwhelming it, and it ended up causing a lot of errors in our application like this Redis connection error. Our application and Redis were literally dropping important requests because Redis was so overwhelmed with all of these throttling requests that we were making to it. Now given what we had previously learned from all of our experiences, our first thought was, "Okay, how do we use Ruby or Rails to solve this issue? "Could we cache the worker state in Resque somehow? "Could maybe we cache it in ActiveRecord?" Unfortunately, after pondering this problem for a few days, no one on the team came up with any great suggestions. So we did the easiest thing we could think of. We removed the throttling completely. And when we did the result was dramatic. There was an immediate drop in Redis requests being issued, which was a huge win. But more importantly, those Redis Network traffic spikes that we had been seeing overnight. They were completely gone. Following the removal of all the requests, all those application errors that we had been seeing, resolved themselves. Following the throttling removal, we of course, kept a close eye on MySQL but it was still happy as a clam. So the moral of the story here is sometimes you need to use Ruby or Rails to replace your datastore hits. Sometimes you need to use them to prevent your datastore hits. Other times, you might just need to straight up remove the datastore hits you no longer need. This is especially important for those of you who have fast growing and evolving applications. Make sure you're periodically taking inventory of all the tools you're using to ensure that they're still needed. Any anything you don't need, get rid of it. 'Cause it might save your datastore a whole lotta headache. As you all are building and scaling your applications, remember these five tips. And more importantly that every datastore hit counts. It doesn't matter how fast it is. If you multiply it by a million, it's gonna suck for your datastores. You wouldn't just throw dollar bills in the air, would you? Because a single dollar bill is cheap. Don't throw your datastore hits around, no matter how fast they are. Make sure every external request your application is making is absolutely necessary, and I guarantee your Site Reliability Engineers will love you for it. And with that, my job here is done. Thank you all so much for your time and attention. Does anyone have any questions? (audience clapping) All right, five minutes. (audience clapping) So we've got five minutes, so if anyone has any general questions. I'll also be available afterwards if anyone wants a chat. Excellent. So the question was were we able to downgrade our RDS instance after we decreased the number of hits we were making to MySQL? And the answer is we did not. We left it at the size it was, 'cause I'm sure at some point we're gonna hit another bottleneck and then we'll have to do this whole exercise again. But we've kept it at the same size that it was at the beginning. Thanks. Great question. So the question is are there any cache busting lessons that we learned from the RDS example? We learned a bunch just from all of these examples and one of which I think is the most important, is set a default cache expiration. Rails gives you the ability to do this in your configuration files. We did not do this from the beginning, and so at one point we had keys that had been laying around for five plus years. Set a default so that everything, at some point, will expire. And then two, finding that ideal expiration, how long a cache should live, it takes some tweaking. Take your best guess and set it and then observe what your load looks like, how clients are reacting to data. Do they think it's stale, do they not? And then from there, tweak it. That's what we have found, is every time we set a cache expiration, we always go back and tweak it afterwards. Okay so the question is, did we have any issues taking what we were storing in Redis and then now storing it in our local memory cache? And the answer is no. In this case, the cache is so small it's literally just this quantity matches to this name and majority of the time, the size of that hash, was only five, 10 keys. And so it was a very small hash and obviously the payoff was super big. So in that particular case, we have not run into issues with that. Anyone else? Thanks guys. (audience clapping) (upbeat music)

Info

Channel: Confreaks

Views: 7,226

Rating: undefined out of 5

Keywords:

Id: yN1rGZbwn9k

Channel Id: undefined

Length: 35min 52sec (2152 seconds)

Published: Wed May 22 2019