RailsConf 2017: The Secret Life of SQL: How to Optimize Database Performance by Bryana Knight

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(instrumental music) - Okay, so we're gonna go ahead and get started, thank you everyone for coming, my name is Bryana Knight, and I'm an engineer at GitHub. I work on the platform data team, which among some other projects is responsible for reviewing every single migration PR that comes through github/github, so we usually see at least a couple a day, and we're also responsible for improving any performance problems that we have within our application. So what's this talk about? Today, we're gonna go over some best practices that aren't always best practices after all. So a quick overview, first and foremost, we're gonna talk about database indexes, so what are they, and how do they work? Database indexing is fundamental to performance. Then we'll dive into some common roles of indexing, querying, and data modeling, talk about what those are, when and why do they break down, what are the tools you can use to understand what's going on, and then how do you move forward when you can't rely on those rules anymore? So what even is an index? Well, an index is a copy of selected columns of data from a table that can searched quickly and efficiently. And the reason why they can be searched more quickly is because an index is sorted. An index has as many rows as the table, so maybe that's something you didn't realize, that there is one row in the index for every row in your table. So you can really think of indexes and tables as the same thing, but you can get to the data you need in an index faster, and then look up whatever else you need in the main table, knowing exactly where to look. So, I think really to understand indexes, we need to take a step back and think of this in terms of a real-life example. So, hypothetically, I'm a cat owner, and I have this great big book called The Ultimate Cat Book, which contains a ton of great information about cats. But one day, my cat gets sick, and I'm pretty sure my cat is having some sort of an allergic reaction. So not knowing where to go in this book to find information about cat allergies, I might have to flip through every single page in this book, checking whether or not that page has the information I need about allergies. But luckily, this book has an index, and this index is alphabetically sorted, so I can go to the A section and see that the cat allergies chapter starts on page 88. And with that information, I can flip directly to that page in The Ultimate Cat Book, read up on cat allergies, and save my dying cat. (chuckling) So on a very basic level, this is the same way that database indexes work. So indexes are used to filter data, for most queries, we just want a subset of the data in our table, so we often can use an index to get that information. Indexes can be used to sort data. So as I mentioned, indexes are sorted. Indexes can also be used to retrieve records, so sometimes, the index will have all of the information you need, and you actually don't have to reference back to the main table to get information. And at the end of the day, what indexes provide us is, they just give us a way to get to the smallest information the soonest. So let's take a database example this time. We have a repositories table here that has an ID, which is the primary key, the name of the repository, and the ID of the owner of that repository. And we have a sample query where we want all of the repositories where the owner ID is greater than 500. So without any index, we'd have to do a full table scan. So what that looks like is we'd have to scan through every record in the table, asking whether or not the owner ID is greater than 500. And that would take a very long time, and this is exactly what we're trying to avoid with our usage of indexes. Looking at another example, we have the same repositories table, same query, but now we have an index over owner ID. So as you'll notice, the index is sorted based on owner ID, which is the left-most column in the index, in this case, the only column in the index. And you also might notice that the ID is appended onto the right of the index. So with this index, we can go directly to the part of the indexes where owner ID becomes larger than 500, and then we can look back in the main table to get that information knowing exactly where to look. So this is what makes an index faster. So essentially, our repositories table becomes our Ultimate Cat Book, and our index over owner ID is just like the index of that book. So looking at one more example, in this case, we just want the name of the repositories where owner ID is greater than 500, and now we have an index over both owner ID and name. You'll notice that the index is still sorted based on owner ID, but it also includes the name field, as well as the ID appended on to the end. In the same way, we can go directly into the portion of the index that has the information we want, but this time, since the index includes the name field, we have everything we need, and we don't have to reference back to the main table, and so this is a situation in which an index can be used to retrieve data. And so if you have a situation like this, this will be super efficient. A few more things about indexes, index contents are used left to right, so as I mentioned, indexes are sorted based on the left-most column. However, if you have duplicate values in that column, it will then sort those records based on the next column in the index, so we have two repositories with an owner ID of 745, those are then ordered based on the name, since name is also included in our index. And so if we had example here where we wanted the repositories where the name is 'rails,' name is not the left-most column in our index, so it's really not sorted based upon the name, so we actually couldn't use the index here to resolve that query. So now that we have a better understanding of how indexes work, we're gonna move on to some common roles of indexing, querying, and data modeling, talk about what they are, when and why do they break down, what are some tools you can use to understand what's going on, and then how do you move forward? So our first rule is, any columns involved in queries should be covered by an index. So let's take a look at a sample query here, we want all the labels where the label repository ID is either four, five, or six. So looking at this query, your first thought might be to add an index over repository ID, and that way we can get to the records faster, since that index will be sorted based on repository ID. And that's a great instinct, so we're gonna go ahead and add an index over repository ID. A bit later on, we might add to this query and say we want the labels with a repository ID of four, five, or six, and with the name of 'feature.' So your first instinct might be now to add an index over repository ID and name, and once again, that's a great thought, but now our index over repository ID is redundant. So what is a redundant index? It's basically an index that's already contained within another index. So as you can see here, both of these index are sorted based on repository ID, as that's the left-most column in each of those indexes. So any query that's only concerned with the repository ID can actually use the index over repository ID and name to satisfy that query, and we no longer need the index that's over repository ID only. So when are we breaking this rule? And again, our rule was any columns involved in queries should be covered by an index. We're not really breaking it, we're just adding the caveat of, unless there's already a covering index. So why does this matter? Our selects are still gonna be super fast, the index, it's either gonna use one or the other, what's the harm? Well, indexes take up space, and that's something we always wanna be aware of. And most importantly, adding an index is going to slow down UPDATEs, INSERTs, and DELETEs, because if you think about this logically, since you're storing that information in more than one place, if that information changes, it's gonna need to be updated in all of those places as well. So even though we get a performance improvement on selects, over-indexing can lead to performance problems for other operations. We care so much about this at GitHub that we have a ChatOps command built in to our Slack, so that command is .mysql dead-indexes, and it just takes the name of a table, and it will tell us if there are any redundant or unused indexes for that table. So our rule was, any columns involved in queries should be covered be an index. Just make sure you're avoiding unused or redundant indexes. So we're gonna add on to that rule, and talk about index prefixes. So sometimes it's actually better or necessary to use an index prefix as opposed to an index. So what's an index prefix? So here we have an index over repository name, and now we have an index prefix over repository name, and you might notice that that index only includes the first five characters of the name of the repository. So an index prefix is exactly that, it just prefixes a subset of data within a column. So let's take a look at another sample query, this one is a bit more complex, don't need to worry about understanding it, but it's just asking for the recent code reviewers for a certain code path. So looking at this query, the columns that stand out to us are repository_id, path, and created_at. So, once again, our first instinct would probably be to add an index over all of those columns. So before doing that, it's important to analyze the table that you're adding an index over, so once again, a bit of a plug for ChatOps, we have this built-in command called MySQL table, and it will give us some great information about a table that we have in our database, including its size, the columns included and any indexes we already have, and looking at the columns in our table, we can see that the path column is of the datatype varbinary(1024), so that's like really scary. It's a really large column, and we're actually not gonna be able to add an index over that, 'cause it's gonna violate our limitations for the size of the index. So, when are we gonna break this rule? Our rule is, any columns involving query should be covered by an index, but we're actually not gonna cover them with a regular index, we're gonna cover them with an index prefix. So in this situation, we have a longer datatype. Another situation in which you might wanna use an index prefix is when querying the data works well with a prefix. So maybe you have a query, give me all the usernames that start with A. Well, we really don't care about the rest of the username, we just care about the first character of the username, so we could put an index over that first character, reference back to the main table and get everything we need. So how do we know how long to make the prefix? It's a great question. Essentially, you just wanna make it long enough to differentiate values within this field. So if we think back to our sample query, the column in question was path, and that's just basically a file path, so if every single file in your repository starts with lib slash, and your index prefix is only over the first four characters of that column, that index is not gonna do us any good. So we need to make it long enough to differentiate the values within a field, and it's a good idea to base it off of real data, if possible. So for that query in question, a PR was opened to add an index over that column, and the developer said that he chose 255 bytes as the prefix because the longest path in github/github is 199 bytes. So he took a pretty large repository, and evaluated that data, and came up with the length of 255, then he was able to add that index prefix right into the migration to add the index, and we got the coverage that we needed. So what do we gain from using an index prefix? Well, first off, once again, space. Less space is required for the index. We gain the ability to index larger datatypes that perhaps we weren't indexing before. And we can also see a comparable performance improvement as a full index if the length is calculated thoughtfully, and we're smart about it. So our rule was, any columns involved in queries should be covered by an index. Just make sure you're looking out for redundant or unused indexes, and using an index prefix when it's appropriate or necessary. So our second rule is, use an OR to return records satisfying one or more of several conditions. So I can guess that probably a lot of you have some queries with OR classes in them in your application. So we'll take a look at another sample query, we want the max ID for users that are either spammy users, or the users are suspended. As we can see, this query takes about a minute to run, so that makes us really sad, we're definitely gonna need to speed that up. So generally, the first thing I do when I see a slow query is I run an EXPLAIN on that query. And what that will do is it'll give me some great information about how this query is being executed. The columns that I'm paying attention to are possible keys, which just is possible indexes. That this is, will include any indexes that involve columns that are in your query. And then the key column itself is telling you which index was actually chosen to be used for the query itself. Rows is the number of rows that needed to be scanned, and Extra will give you some extra information like using where using index, using files for things like that. Another plug for ChatOps, we have this built in at GitHub, so any developer, or anyone, really, can run an EXPLAIN on a query in Slack and get the explained output, so this really encourages all developers that aren't necessarily on the database team to really understand what their queries are doing, so they don't just come to us first for help, they can look into it themselves. So as we can see here, the key is null, so it's not using any index for this query whatsoever. So let's take a look at what were the possible indexes that might have been used, so we have index over spammy, and we have an index over suspended_at. So those are the exact columns in question of our query. However, the use of the OR is limited, because MySQL can only use one index per table during a query, so if we were gonna use one or the other, we would be excluding rows that should be returned to resolve this query. So when do we break this rule about using an OR? Well, first off, the table being queried isn't small, so a full table scan isn't performant. So maybe you have a query like this, and it runs fast enough, so this isn't an issue for you. But I would guess that's not the case for most of the tables in your databases. And in this situation, the use of the OR is preventing any index from being used. So instead of adding an index, we're gonna tune our query. We're gonna use the UNION, and that's gonna basically split the query up into two queries that can each exploit the right index, and then UNION those results, and then grab the max ID from there. So this query runs in 11 milliseconds, it's way better, and we're a lot happier. One thing to note, Postgres has an index merge feature, so, your query might be able to recognize that it has two indexes that it can use, merge them, and then use that, so maybe this is fast already, one thing to consider, though, is this is still gonna require more work in the process of solving that query. So it actually might be more efficient to use the UNION anyway. So our rule was, use an OR to return records satisfying one or more of several conditions. Except when the OR is preventing an index from being used and you have to do a full table scan, it might make sense to tune your query to use a UNION. So our third rule is, if there's an index over all the fields in your queries, you're all set, have a great day, there's nothing else you can do. (chuckles) So let's take a look at another fun sample query here, we want all the commit comments for this repository ordered by ID. Once again, we can see this query as running pretty slowly, it's taking about 40 seconds. So we'll run an EXPLAIN and examine what's going on under the hood when this query is run, and we can see that we have some possible indexes that we're gonna potentially use, but the query planner is actually choosing to use the primary key index. So why isn't it using this index over repository ID and user ID? And if you think about this query, all we really care about is the repository ID, so you would think it would go to that index, that's sorted based on repository ID, which is the left-most column, and then after it gets those records, do a sort on them afterwards. It's not doing that, and so this makes us really mad. What's happening is that since we have an order by ID clause at the end of our query, MySQL thinks it's better to scan the primary key index just so that the rows are generated in order and it doesn't have to do a sort. This is a valid thought, it's just the wrong decision. So we go from being mad to sad, because we did everything right, our query is sound, we can't really change it, we have a good index, it's just still slow. So the situation is such that MySQL is choosing to use an index that isn't performant, or no index at all when one is available. So we gotta help MySQL out! It's done a lot for us, and the only thing we really can do now is just provide a helping hand. So we're gonna use an index hint. In this case, we're using a FORCE INDEX, which is basically just gonna tell MySQL exactly which index to use to resolve this query, and this runs in a millisecond, it's way faster, so we're happy, we're happy and MySQL is happy. You have some other options here, you could also use a USE INDEX, which is more of a hint, you could also use an IGNORE INDEX, which explicitly tells MySQL what index not to use, but then it lets it make the decision afterwards of what's the next best thing. There are some gotchas here, so future proofing is a concern. So say you added an index in using a specific index, and then later on, that index is deleted. You definitely have to be aware of that, and Postgres doesn't have any index hint, so, sorry. So our rule was, if there's an index over all the fields in your query, you're all set, except when the query planner doesn't know to use the index, in which case, help it out. So our fourth and final rule, and this is a big one, is to avoid redundant data across tables. So this is something that we're taught in terms of data modeling, and data storage, it's a good idea to have all your tables follow the single responsibility principle, and only contain information in regards to the subject matter of that table. So we have a sample database here, it's a small one with just a pull request table, a repositories table, and a gist table. We also have a users table, which only contains user information, and you'll notice that there's a column here called spammy, and that just denotes whether or not a user is a spammy user. So in order to satisfy some really fundamental requests to GitHub, we have to join across multiple tables. So for example, if we wanted a group of pull requests, for example, we'd wanna make sure we're excluding any pull requests opened by spammy users, so we have to do a join across these two tables, we're basically reading information from two places. What if we need a group of repositories, except those owned by a spammy user? Once again, that information is in two places, we have to do a complex join with those two tables. Or if we just needed the gist for our user, but we needed to make sure that user isn't a spammy user, once again, we need to do a join across these tables. So this is really adding up performance-wise. So when are we gonna break this rule about not having redundant data across tables? Well, additional reads, or JOINs, are causing noticeable performance degradation, so this was really adding up, and we were seeing really slow queries just to satisfy these really fundamental questions. We also have a high ratio of reads to writes, and this is a really important point. So, we basically needed to check for spamminess on almost every request to GitHub.com for content like this. So that number is really high. And conversely, the amount of times that we're marking a user as spammy is much lower in relation to that read number, so we have a high ratio of reads to writes in this situation. So we're gonna decide to have redundant data across tables, we're gonna denormalize this data. So we're gonna add a column to all of these tables that just denotes that a user is a spammy user, we actually added, suspended onto that, so now we only have to check for one place to get this information, and it's much faster. So, a query that used to look like that, pretty complicated, had a left outer join, had an OR clause in there, is highly simplified, and much, much faster. So this is the performance improvement that we saw after denormalizing this data, and I can tell you we did it across a lot of tables, not just the three I mentioned. So this is just the rate of git requests for pull requests and repositories via our API. And as you can see, the response rate is way better. So, once again, nothing comes for free, there are some trade-offs that you have to consider before making a decision to denormalize data like this. First and foremost, there's gonna be a lot of changes to your database right away, so we're gonna have migrations to add those columns for all of those tables, and don't forget to add indexes for those columns as well. We also had to do a lot of data transitioning to backfill that data for existing records. Another thing to think about is data quality. So now that we're storing that information in multiple places, that information needs to be kept up to date and consistent. So the way that we handle this was with some background jobs, and a nightly cron job to resolve any mismatches, and so forth, but you definitely needed to think about the pros and the cons before making a big decision like this. So our rule was, avoid redundant data across tables. That's generally a good rule, but when you're JOINing on another table for almost every request, it's getting costly, you have a high volume of reads to writes for the data in question, it's probably a good idea to consider data denormalization. So recap. Our first rule was any columns involving queries should be covered by an index. Make sure you're avoiding redundant or unused indexes, and sometimes an index prefix is either necessary or good enough. Use an OR to return records satisfying one or more of several conditions. A UNION might be necessary to exploit the indexes. Our third rule was, if there's an index over all the fields in your query, you're all set. If MySQL is making the wrong choice, help it make the right one. And avoid redundant data across tables. Except when reads are slow, the read to write ratio is high, you can think about denormalizing. So some takeaways. Index, but don't over index, nothing comes for free, indexes take space, and they slow down INSERTs, UPDATEs, and DELETEs, so that's something you wanna think about. Tune your query to exploit the indexes that give you the best results. So sometimes it's not about adding an index, it's about changing your query to use the right one. There are tools to help you, I encourage you to run EXPLAIN on your queries and try to understand how they're being processed, and make sure you examine existing columns and existing tables and existing indexes before you make any changes. You can do everything right and still have performance problems, so this is kind of the point of this talk, is that you can follow all these rules and still see problems, so you need to really understand what's going on under the hood, and get creative in certain situations. So, we are hiring on the platform data team, if any of this interests you, or you have some experience with it, I'd love to talk to you. You can use that link if you wanna check out the job posting. Thank you! (clapping)

Info

Channel: Confreaks

Views: 5,823

Rating: undefined out of 5

Keywords:

Id: BuDWWadCqIw

Channel Id: undefined

Length: 25min 2sec (1502 seconds)

Published: Fri May 19 2017