(slow techno music) - Welcome, everyone, to Cache is King. My name is Molly Struve, and I am the lead Site
Reliability Engineer at Kenna Security. If you came to this talk hoping to see some sexy performance
graphs such as this one, then I'm happy to say you
will not be disappointed. (audience laughing) If you're the kind of person who enjoys an entertaining GIF or two in your technical talks, then
you're in the right place. If you came to this talk to find out what they heck a site
reliability engineer does, also a bingo. And, finally, if you wanna learn how you can use Ruby and Rails to kick your performance up to the level, then this is the talk for you. As I mentioned, I'm a
Site Reliability Engineer. Most people, when they hear
Site Reliability Engineer, don't necessarily think of Ruby or Rails. Instead, they think of someone
who's working with MySQL, Redis, AWS, Elasticsearch, Postgres, or some other third-party
service to ensure that it's up and running smoothly. For the most part, this
is a lot of what we do. Kenna's Site Reliability
Team is just over a year old. But when we first formed, you can bet we did
exactly what you'd expect. We went, and we found those long-running, horrible, MySQL queries. And we optimized them by adding indexes where they were missing, using select statements
to avoid N+1 queries, and by processing things in small batches to ensure we are never
pulling too many records out of the database at once. We also had Elasticsearch searches, that were constantly timing out. So we rewrote them, to ensure that they could
finish successfully. We even overhauled how our
background processing framework, Resque, was talking to Redis. All of these changes led
to some big improvements in performance and stability. But even with all of
these things cleaned up, we were still seeing high loads across all of our datastores. And that's when we realized
something else was going on. Now, rather than tell
you what was happening, I wanna demonstrate it. So for this, I need a volunteer. I promise it's super simple. - [Shane] All right. - Get on up here.
Awesome. Thank you, thank you. Okay, come right up here. What's your name? - Shane Smith. - Okay, think I'm gonna remember this. Oh, I already forgot it, what's your name? - Shane Smith. - Maybe one more time, what's your name? - Shane Smith. - Okay, the last time, what's your name? - Shane Smith. - Okay, about how annoyed
are you right now? - Getting there.
- Getting there, okay. (audience laughing) How annoyed would you be if I
asked you, "What's your name?" a million times. - Seems like the, "Orange you glad I didn't
say banana," joke, yeah. - Yes, yes, there you go,
yeah, really, really annoyed. What is one easy thing I
could have this person do so that I don't have to
keep asking them their name? And I'll give you guys a hint. It involves a pen and a piece of paper. Shout it out. - [Audience] Name tag. - Even simpler, just writing it down. So if you could write
your name down there, you can just do your
first name, it's fine. - Terrible handwriting. - It's fine, I can read it. Now that I have Shane's name
written on this piece of paper, I no longer have to keep
asking Shane for it. I can simply read the piece of paper. This is exactly what it's like
for your Rails applications. Imagine, I'm your Rails application, Shane's your datastore. If your Rails application
has to ask your datastore millions and millions of
times for information, eventually it's gonna get pissed off. And it's gonna take your Rails application a long time to do it. If instead, your Rails
application makes use of a local cache, which
is essentially what this piece of paper is doing. It can get the information
it needs a whole lot faster and it can save your datastore
a whole lot of headache. Okay, that's all I needed
you for, thank you. (audience clapping) The moment at Kenna when we realized it was the quantity of our datastore hits that was wreaking havoc on our datastores, that was a big a-ha moment for us. And we immediately started
trying to figure out all the ways we could decrease the number of datastore hits we were making. Now, before I get into
all the awesome ways we use Ruby and Rails to do this, I first wanna give you a little
bit of a background on Kenna so you have some context around
the stories I'm gonna share. Kenna Security helps Fortune 500 companies manage their cyber security risk. The average company has 60,000 assets. You can think of an asset
as basically anything with an IP address. In addition, the average company has 24 million vulnerabilities. A vulnerability is basically
any way you can hack an asset. Now with all of this data, it can be extremely difficult
for companies to know what they need to focus on and fix first. And that's where Kenna comes in. At Kenna, we take all this data and we run it through our
proprietary algorithms. And then those tell our
clients what vulnerabilities pose the biggest risk
to their infrastructure, so they know what they need
to focus on and fix first. When we initially get all of this asset and vulnerability data, the first thing we do
is we put it into MySQL. MySQL is our source of truth. From there, we then index
it into Elasticsearch. Elasticsearch is what allows our clients to really slice and dice their
data any way they need to. In order to index assets
and vulnerabilities into Elasticsearch, we
have to serialize them, and that is what I wanna
cover in my first story. Serialization. Particularly I wanna focus on the serialization of vulnerabilities because that is what we
do the most of at Kenna. When we initially started serializing vulnerabilities for Elasticsearch, we were using
ActiveModelSerializers to do it. ActiveModelSerializers hook right into your ActiveRecord objects. So all you have to do is define the fields you wanna serialize, it takes care of the rest. It's super simple which is why it was naturally our first
choice for a solution. However, it became a less great solution when we started serializing
over 200 million vulnerabilities a day. As the number of vulnerabilities we were serializing increased, the rate at which we could serialize them dropped dramatically, and our database began to max out on CPU. The caption for this screenshot and slack was 11 hours and counting. Our database was literally
on fire all the time. Now some people might look at this graph, and their first inclination
would be to say, "Why not just beef up your hardware?" Unfortunately at this point, we were already running on
the largest RDS instance AWS had to offer. So beefing up our hardware
was not an option. My team and I, when we
looked at this graph, we thought, "Okay, there's
gotta be some horrible, "long-running MySQL query
in there that we missed." So off we went, hunting for that elusive, horrible MySQL query. Much like Liam Neeson in,
"Taken," we were bound and determined to finding the
root cause of our MySQL woes. But we never found those long-running, horrible MySQL queries because they didn't exist. Instead what we found, were a lot of fast millisecond queries that were happening over
again and again and again. All these queries were lightning fast, but we were making so
many of them at a time, that our database was being overloaded. We immediately started
trying to figure out how we could serialize all this data and make less database
calls when we were doing it. What if instead of making
individual calls to MySQL to get the data for each
individual vulnerability, we group all the
vulnerabilities together at one and make one call to MySQL to
get their data at one time. From this idea came the
concept of Bulk Serialization. To implement this we
started with a cache class. This cache class was responsible for taking a set of vulnerabilities and a client and then running all the MySQL
look ups for them at once. We then took this cache class and we passed it to our
vulnerabilities serializer, which still held all the logic needed to serialize each individual field, except now, instead of
talking to the database, it would simply read from our cache class. So let's look at an example of this. In our application, vulnerabilities have a related model called custom fields. They basically allow us to add any special attribute we
want to a vulnerability. Prior to this change, when we
would serialize custom fields, we would have to talk to the database. Now, we could simply
talk to our cache class. The payoff of this change was big. For starters, the time it took to serialize vulnerabilities
dropped dramatically. Here is the console shot
showing how long it takes to serialize 300
vulnerabilities individually. Takes us just over six seconds. And that's probably a
pretty generous benchmark, considering it would take even longer when our database was under load. If instead, we serialized those exact same 300 vulnerabilities in bulk? Boom. Less than a second to do it. These speed ups are a direct result of the decrease in the
number of database hits we have to make to serialize
these vulnerabilities. To serialize those 300
vulnerabilities individually, we have to make 2,100
calls to the database. 2,100. To serialize those same 300
vulnerabilities in bulk, we now only have to make seven. Boom, again. As you can glean from the math here, it's seven calls per
individual vulnerability, or seven calls for however
many vulnerabilities you can group together at once. In our case, when we're
serializing vulnerabilities, we're doing it in batches of a thousand. So we took the number of database requests we were making for each batch
from 7,000 down to seven. This large drop in database requests is plainly apparently on
this MySQL queries graph. Which shows the number of
requests we were making before, and then after we deployed
the bulk serialization change. With this large drop in requests, came a large drop in database load, which you can see on the
RDS CPU Utilization graph. Prior to the change, we were maxing out our database at 100%. Afterwards, we're sitting
pretty chilly around 25%. And it's been like this ever since. The moral of the story here is, when you find yourself processing
a large amount of data, try to find ways that you can use Ruby to help you process that data in bulk. We did this for serialization, but it can be applied any
time you find yourself processing data in a one by one manner. Take a step back and ask yourself, "Is there a way I could process "this data together, in bulk?" Because one call for 1,000 IDs is always gonna be faster than a thousand
individual database calls. Now unfortunately the serialization
saga doesn't end here. Once we got MySQL all
happy and sorted out, then suddenly Redis became sad and this folks is the life of
a Site Reliability Engineer. A lot of days we feel like this. You put one fire you, out
start one somewhere else. You speed one thing up and
the load transfers to another. In this case, we had transferred the load from MySQL to Redis, and here's why. When we index vulnerabilities
into Elasticsearch, we not only have to make requests to MySQL to get all their data, we also
have to make calls to Redis, in order to know where to
put them in Elasticsearch. In Elasticsearch, vulnerabilities
are organized by client. So to know where a vulnerability belongs, we have to make a get request to Redis to get the index name
for that vulnerability. When preparing
vulnerabilities for indexing, we gather up all their
serialized vulnerability hashes and one of the last things we do, before sending them to Elasticsearch is we make that Redis get request to get the index name
for each vulnerability based on its client. These serialized vulnerability hashes are grouped by client,
so this Redis get request is often returning the same
information over and over again. Now keep in mind, all
these Redis get requests are super simple and very fast. As you can see, they take
a millisecond to execute. But as I stated before, it doesn't matter how fast your requests are, if you're making a ton of them, it's going to take you a long time. We were making so many of
these simple get requests that they were accounting for roughly 65% of our job run time, which
you can see in the table and is represented by
the brown in that graph. The solution to eliminating
a lot of these requests, one again was Ruby. In this case, we ended
up using a Ruby hash to cache the Elasticsearch
index name for each client. Then, when looping through those serialized vulnerability hashes, rather than hitting Redis for
every single vulnerability, we could simply reference
our client indexes hash. This meant we only had to
hit Redis once per client instead of once per vulnerability. So let's take a look at how this paid off. Given these three example
batches of vulnerabilities, no matter how many
vulnerabilities are in each batch, we only ever have to hit Redis three times to get all the information we need to know where they belong. As I mentioned before, these batches usually contain a thousand vulnerabilities a piece. So we roughly decreased the number of hits we were making to Redis a thousand times. Which in turn, led to a 65%
increase in our job speed. Even though Redis is fast,
a local cache is faster. To put it into perspective for you, to get a piece of information
from a local cache is like driving from Downtown Minneapolis to the Minneapolis Saint Paul Airport. It's about a 20, 25 minute drive. To get the same piece of
information from Redis is like driving from
Downtown to the Airport and then flying all the
way to Chicago to get it. Redis is so fast that
it can be easy to forget you're actually making an external request when you're talking to it. And those external requests can add up and have an impact on the
performance of your application. So with these maps in mind, remember that Redis is fast but a local cache, such as a hash cache, is always going to be faster. So we've just seen two ways
that we can simple Ruby to replace our datastore hits. Next I wanna talk about how you can use your ActiveRecord framework to replace your datastore hits. This past year at Kenna, we sharded our main MySQL database. And when we did, we
chose to do it by client. So each client's data lives
on its own shard database. To help us accomplish this, we chose to use the Octopus Sharding Gem. This gem gives us this
handy dandy using method, which when passed a database name, has all the logic it needs to know how to talk to that database. Because our information
is divided by client, we created the sharding
configuration hash. Which tells us what client
belongs on what sharded database. Then each time we make a MySQL request, we take the client ID, we pass it to that sharding configuration hash, in order to get the database
name that we need to talk to. Given that we have to access
the sharding configuration hash every single time we make a MySQL request, our first thought was, "Why
not just store it in Redis?" Because Redis is fast and
the configuration hash we wanna store is relatively small. It was small at first, eventually that configuration
hash grew and grew, as we added more and more clients. Now 13 kilobytes might not
seem like a lot of data, but if you're asking
for 13 kilobytes of data millions of times, it can add up. In addition to your
growing configuration hash, we were also continually
increasing the number of background workers that we had working, so that we could increase
our data throughput. Until eventually we had 285 workers chugging along at once. Now remember, every single
time one of these workers makes a MySQL request, it
first has to go to Redis to get that 13 kilobyte
configuration hash. It all quickly added up
until we were reading 7.8 megabytes per second from Redis, which we knew was not gonna be sustainable as we continued to grow and add clients. When we started trying to figure out how we were gonna solve this problem, one of the first things we decided to do was take a look at
ActiveRecord's connection object. ActiveRecord's connection object is where ActiveRecord stores all
the information it needs to know how to talk to your database. So naturally, we thought
it might be a good place to find somewhere to store
our configuration hash. So we jumped into a
console to check it out and when we did, what we found was not an ActiveRecord connection object at all. Instead it was this Octopus Proxy Object that our Octopus Sharding Gem had created. This was a complete surprise to us and we immediately started digging into our gem's source code,
trying to figure out where the heck this Octopus
Proxy Object had come from. And when we finally found
that Octopus Proxy Object, much to our delight, it
already had all these great helper methods that we could use to access our sharding configuration. Boom. Problem solved. Rather than having to hit
Redis every single time we made a MySQL request, all we simply had to do was talk to our local ActiveRecord Connection Object. One of the big things we learned
from this whole experience was how important it is to know your gems. It is crazy easy to include
a gem in your gem file, but when you do, make sure you have a general understanding of how it works. I'm not saying you need to
go and read the source code for every one of your gems, because that would be insane
and it would take you forever. But consider this. The next time you add
a gem to your gem file, maybe set it up manually
the first time in a console so you can see what is happening and how it's being configured. If we'd had a better understanding of how our Octopus Sharding
Gem was configured, we could have avoided this
entire Redis headache. However, regardless of where the solution came from, yet again, caching locally, in this case using our
ActiveRecord Framework as a cache is always gonna be faster and easier than making an external request. These are three great
strategies that you can use to help replace your datastore hits. Now I wanna shift gears and talk about how you can use Ruby and
Rails to avoid making datastore hits you don't need. I'm sure some of you are
looking at this thinking, "Pfft, duh, I already
know how to do that." But let's hold up for a minute, because this might not be
as obvious as you think. For example, how many of you
have written code like this? Come on, I know you're out there. 'Cause I know I've written code like this. There you go, awesome. This code looks pretty good, right? If there's no user IDs
then we're gonna skip all of this user processing. So it's fine, right? Fortunately that assumption is false, it's not fine. Let me explain why. Turns out, if you
execute this where clause with an empty array,
you're actually going to be hitting the database when you do. Notice this where 1=0 statement. This is what ActiveRecord uses to ensure no records are returned. Sure, it's a fast one millisecond query, but if you're executing this
query millions of times, it can easily overwhelm your
database and slow you down. So how do we update this chunk of code to make our Site Reliability
Engineers love us? You have two options. The first is by not
running that MySQL lookup unless you absolutely have to. And you can do that by doing an easy, peasy array check using Ruby. By doing this, you can save yourself from making a worthless datastore hit and ensure that your database is not gonna be overwhelmed with useless calls. In addition to not
overwhelming your database, this is also going to speed up your code. Say you're running this
chunk of code 10,000 times. It's gonna take you over half a second to make that useless
MySQL lookup 10,000 times. If instead you add that
simple line of Ruby to avoid making that MySQL request, and you run a similar
block of code 10,000 times. Less than a hundredth
of a second to do it. As you can see, there is
a significant difference between hitting MySQL
unnecessarily 10,000 times and running plain old Ruby 10,000 times. And that difference can
add up and have an impact on the performance of your application. A lot of people will look
at that top chunk of code and their first inclination is like, "Pfft, what are you
gonna do, Ruby's slow?" But that couldn't be
further from the truth. Because as we just saw,
this simple line of Ruby is hundreds of times faster. In this case, Ruby is not slow. Hitting the database is what's slow. Keep an eye out for situations
like these in your code where it might be making
a database request you don't expect. And I'm sure some of you Rails
folks are probably thinking, "Not exactly writing code like this." Actually I chained a bunch of
scopes to my work clause so... (audience laughing) I have to pass that empty array otherwise my scope chain breaks. Thankfully, even though
ActiveRecord doesn't handle empty arrays well, it
does give you an option for handling empty scopes. And that is the none scope. None is an ActiveRecord query method that returns a chainable
relation with zero records, but more importantly it does it without querying the database. So lets see this in action. We know from before if we execute that where clause with our empty array, we're going to hit the
database when we do. And we're gonna do it with
all of our scopes attached. If instead, we replace that where clause with the none scope, boom, we're no longer hitting the
database and all of our scopes still chain together successfully. Be on the lookout for tools like these in your gems and frameworks that will allow you to work
smarter with empty data sets. And even more importantly, never, ever assume your
library, gem or framework is not making a database request when asked to process an empty data set. 'Cause you know what
they say about assuming. Ruby has so many easily
accessible libraries and gems, but their ease of use can lull you into a sense of complacency. Once again, when you're
working with a library or a gem or a framework, make sure you have a general understanding of how it works under the hood. One of the easiest ways to
gain a better understanding is through logging. Set your logging to debug for your framework, your gem and every one of your relayed services. When you're done, load
some application pages, run some background workers, even jump in a console
and run some commands. Afterwards look at the
logs that are produced. Those logs are going to tell you a lot about how your code is
interacting with your datastores. And some of it might not be
interacting how you would think. I cannot stress enough
how valuable something as simple as reading logs can be when it comes to making
optimizations in an application and finding useless datastore hits. Now this concept of preventing
useless datastore hits doesn't just apply to MySQL. It can apply to any datastore
you're working with. At Kenna, we end up
using Ruby to replace... To prevent datastore hits to
MySQL, Redis and Elasticsearch and here's how we did that. Every night at Kenna, we build these beautiful, intricate
reports for our clients from all their asset
and vulnerability data. These reports start
with a reporting object, which holds all the logic needed to know what assets and vulnerabilities
belong to a report. Every night, to build that
beautiful reporting page, we have to make over 20
calls to Elasticsearch and multiple calls to Redis and MySQL. My team and I did a lot of work to ensure all these calls were very fast. But it was still taking
us hours every night to build the reports. Till eventually we had so
many reports in our system that we couldn't finish
them all overnight. Clients were literally
getting up in the morning and the reports weren't
ready, which was a problem. My team and I, when we
started trying to figure out how we were gonna solve this issue, the first thing we did
was decide to take a look at what data our existing
reports contained. First thing we decided to look at, how many reports are in our system? Over 25,000. That was a pretty healthy number for us, considering only a few months earlier, we had only had 10,000. The next thing we decided to look at was how big are these reports? A report's size directly
depends on the number of assets a report contains. The more assets in a report, the longer it's gonna take
us to build that report. We thought maybe we could
split these reports up by size somehow, to speed up processing. So we looked at the average
asset count per report. Just over 1,600. Now if you remember back to the beginning of the presentation, I mentioned that our average
client has 60,000 assets. So when we saw this 1,600 number we thought that seemed pretty low. The next thing we decided to look at was how many of these
reports have zero assets? Woo, over 10,000. Over a third of our
reports have zero assets. And if they have zero assets, that means they contain no data. And if they contain no data, then what is the point of making all these Elasticsearch, MySQL and Redis calls when we know they're gonna return nothing? Light bulb. Don't hit the datastores
if the report is empty. By adding a simple line of Ruby to skip the reports that had no data, we took our processing
time from over 10 hours down to three. That simple line of
Ruby was able to prevent a bunch of worthless datastore hits, which in turn sped up our
processing tremendously. This strategy of using Ruby to prevent useless datastore hits, I like to refer to it as
using database guards. In practice, it's super simple. But I think it's one of the easiest things to overlook when you're writing code. We're almost there. This last story I have for you actually happened pretty recently. So you remember those Resque workers I talked about at the
beginning of the presentation? As I mentioned, they run
with the help of Redis. One of the main things we used Redis for is to throttle these Resque workers. Given our sharded database set up, we only ever want a set number of workers to work on a database at any given time. Because what we've found in the past is that too many workers
working on a database would overwhelm it and slow it down. So to start we appointed 45
workers at each database. After making all of these improvements that I just mentioned, our
databases were pretty happy, so we decided why not bump
up the number of workers in order to increase our data throughput? So we increased the number of
workers to 70 on each database and of course, we kept
a close eye on MySQL. But it looked like all our
hard work had paid off. MySQL was still happy as a clam. My team and I, at this point, we were pretty darn proud of ourselves so we celebrated for the rest of the day. But it didn't last for long, 'cause as we learned earlier, often when you put one fire out, you start one somewhere else. MySQL was happy, but then overnight we got a Redis high traffic alert. And when we looked at
our Redis traffic graphs, we saw at times, we were reading over 50 megabytes per second from Redis. So that 7.8 from earlier? That's not lookin' so bad now. This load was being caused
by the hundreds and thousands of requests we were making
trying throttle these workers, which you can see on
this RDS request graph. Basically before any
worker can pick up a job, it first has to talk to Redis to figure out how many workers are already working on that database. If 70 workers are already
working on the database, the worker will not pick up the job. If it's less than 70, then it
knows it can pick up the job. All of these calls to
Redis were overwhelming it, and it ended up causing a lot
of errors in our application like this Redis connection error. Our application and Redis were literally dropping important requests because Redis was so overwhelmed with all of these throttling requests that we were making to it. Now given what we had previously learned from all of our experiences, our first thought was, "Okay, how do we use Ruby or
Rails to solve this issue? "Could we cache the worker
state in Resque somehow? "Could maybe we cache it in ActiveRecord?" Unfortunately, after
pondering this problem for a few days, no one on the team came up with any great suggestions. So we did the easiest
thing we could think of. We removed the throttling completely. And when we did the result was dramatic. There was an immediate
drop in Redis requests being issued, which was a huge win. But more importantly, those
Redis Network traffic spikes that we had been seeing overnight. They were completely gone. Following the removal of all the requests, all those application errors
that we had been seeing, resolved themselves. Following the throttling removal, we of course, kept a close eye on MySQL but it was still happy as a clam. So the moral of the story here is sometimes you need to use Ruby or Rails to replace your datastore hits. Sometimes you need to use them to prevent your datastore hits. Other times, you might
just need to straight up remove the datastore
hits you no longer need. This is especially
important for those of you who have fast growing and
evolving applications. Make sure you're
periodically taking inventory of all the tools you're using to ensure that they're still needed. Any anything you don't
need, get rid of it. 'Cause it might save your
datastore a whole lotta headache. As you all are building and
scaling your applications, remember these five tips. And more importantly that
every datastore hit counts. It doesn't matter how fast it is. If you multiply it by a million, it's gonna suck for your datastores. You wouldn't just throw dollar
bills in the air, would you? Because a single dollar bill is cheap. Don't throw your datastore hits around, no matter how fast they are. Make sure every external request
your application is making is absolutely necessary, and I guarantee your Site
Reliability Engineers will love you for it. And with that, my job here is done. Thank you all so much for
your time and attention. Does anyone have any questions? (audience clapping) All right, five minutes. (audience clapping) So we've got five minutes, so if anyone has any general questions. I'll also be available afterwards
if anyone wants a chat. Excellent. So the question was were
we able to downgrade our RDS instance after we
decreased the number of hits we were making to MySQL? And the answer is we did not. We left it at the size it was, 'cause I'm sure at some point we're gonna hit another bottleneck and then we'll have to do
this whole exercise again. But we've kept it at the same size that it was at the beginning. Thanks. Great question. So the question is are
there any cache busting lessons that we learned
from the RDS example? We learned a bunch just
from all of these examples and one of which I think
is the most important, is set a default cache expiration. Rails gives you the ability to do this in your configuration files. We did not do this from the beginning, and so at one point we had keys that had been laying
around for five plus years. Set a default so that everything, at some point, will expire. And then two, finding
that ideal expiration, how long a cache should live, it takes some tweaking. Take your best guess and set it and then observe what
your load looks like, how clients are reacting to data. Do they think it's stale, do they not? And then from there, tweak it. That's what we have found, is every time we set a cache expiration, we always go back and tweak it afterwards. Okay so the question is, did we have any issues taking
what we were storing in Redis and then now storing it
in our local memory cache? And the answer is no. In this case, the cache is so small it's literally just this
quantity matches to this name and majority of the time,
the size of that hash, was only five, 10 keys. And so it was a very small hash and obviously the payoff was super big. So in that particular case, we have not run into issues with that. Anyone else? Thanks guys. (audience clapping) (upbeat music)