(instrumental music) - Okay, so we're gonna
go ahead and get started, thank you everyone for coming, my name is Bryana Knight, and
I'm an engineer at GitHub. I work on the platform data team, which among some other projects is responsible for reviewing
every single migration PR that comes through github/github, so we usually see at least a couple a day, and we're also responsible for improving any performance problems that we have within our application. So what's this talk about? Today, we're gonna go
over some best practices that aren't always best
practices after all. So a quick overview, first and foremost, we're gonna talk about database indexes, so what are they, and how do they work? Database indexing is
fundamental to performance. Then we'll dive into some
common roles of indexing, querying, and data modeling,
talk about what those are, when and why do they break down, what are the tools you can use to understand what's going on, and then how do you move forward when you can't rely on
those rules anymore? So what even is an index? Well, an index is a copy
of selected columns of data from a table that can searched
quickly and efficiently. And the reason why they can
be searched more quickly is because an index is sorted. An index has as many rows as the table, so maybe that's something
you didn't realize, that there is one row in the index for every row in your table. So you can really think
of indexes and tables as the same thing, but you
can get to the data you need in an index faster, and then look up whatever else you need in the main table, knowing exactly where to look. So, I think really to understand indexes, we need to take a step
back and think of this in terms of a real-life example. So, hypothetically, I'm a cat owner, and I have this great big book called The Ultimate Cat Book, which contains a ton of great information about cats. But one day, my cat gets sick, and I'm pretty sure my cat is having some sort of an allergic reaction. So not knowing where to go in this book to find information about cat allergies, I might have to flip through
every single page in this book, checking whether or not that page has the information I
need about allergies. But luckily, this book has an index, and this index is alphabetically sorted, so I can go to the A section and see that the cat allergies
chapter starts on page 88. And with that information, I
can flip directly to that page in The Ultimate Cat Book,
read up on cat allergies, and save my dying cat. (chuckling) So on a very basic level,
this is the same way that database indexes work. So indexes are used to filter data, for most queries, we just want a subset of the data in our table,
so we often can use an index to get that information. Indexes can be used to sort data. So as I mentioned, indexes are sorted. Indexes can also be used
to retrieve records, so sometimes, the index will
have all of the information you need, and you actually
don't have to reference back to the main table to get information. And at the end of the day,
what indexes provide us is, they just give us a way to get to the smallest information the soonest. So let's take a database
example this time. We have a repositories
table here that has an ID, which is the primary key,
the name of the repository, and the ID of the owner
of that repository. And we have a sample query where we want all of the repositories where the owner ID is greater than 500. So without any index, we'd
have to do a full table scan. So what that looks like
is we'd have to scan through every record in the table, asking whether or not the
owner ID is greater than 500. And that would take a very long time, and this is exactly what
we're trying to avoid with our usage of indexes. Looking at another
example, we have the same repositories table, same query, but now we have an index over owner ID. So as you'll notice, the index is sorted based on owner ID, which
is the left-most column in the index, in this case,
the only column in the index. And you also might notice
that the ID is appended onto the right of the index. So with this index, we can go directly to the part of the indexes where owner ID becomes larger than 500, and then we can look
back in the main table to get that information
knowing exactly where to look. So this is what makes an index faster. So essentially, our repositories table becomes our Ultimate Cat Book, and our index over owner ID is just like the index of that book. So looking at one more example, in this case, we just want
the name of the repositories where owner ID is greater than 500, and now we have an index
over both owner ID and name. You'll notice that the
index is still sorted based on owner ID, but it
also includes the name field, as well as the ID appended on to the end. In the same way, we can go directly into the portion of the index that
has the information we want, but this time, since the
index includes the name field, we have everything we need, and we don't have to reference
back to the main table, and so this is a situation
in which an index can be used to retrieve data. And so if you have a situation like this, this will be super efficient. A few more things about indexes, index contents are used left to right, so as I mentioned, indexes are sorted based on the left-most column. However, if you have duplicate
values in that column, it will then sort those records based on the next column in the index, so we have two repositories
with an owner ID of 745, those are then ordered based on the name, since name is also included in our index. And so if we had example here where we wanted the repositories
where the name is 'rails,' name is not the left-most
column in our index, so it's really not sorted
based upon the name, so we actually couldn't use the index here to resolve that query. So now that we have a better understanding of how indexes work, we're gonna move on to some common roles
of indexing, querying, and data modeling, talk
about what they are, when and why do they break down, what are some tools you can use to understand what's going on, and then how do you move forward? So our first rule is, any
columns involved in queries should be covered by an index. So let's take a look
at a sample query here, we want all the labels where
the label repository ID is either four, five, or six. So looking at this query,
your first thought might be to add an index over repository ID, and that way we can get
to the records faster, since that index will be
sorted based on repository ID. And that's a great instinct,
so we're gonna go ahead and add an index over repository ID. A bit later on, we might add to this query and say we want the labels
with a repository ID of four, five, or six, and
with the name of 'feature.' So your first instinct might be now to add an index over
repository ID and name, and once again, that's a great thought, but now our index over
repository ID is redundant. So what is a redundant index? It's basically an index
that's already contained within another index. So as you can see here, both
of these index are sorted based on repository ID, as
that's the left-most column in each of those indexes. So any query that's only
concerned with the repository ID can actually use the index
over repository ID and name to satisfy that query, and
we no longer need the index that's over repository ID only. So when are we breaking this rule? And again, our rule was any
columns involved in queries should be covered by an index. We're not really breaking
it, we're just adding the caveat of, unless there's
already a covering index. So why does this matter? Our selects are still gonna be super fast, the index, it's either
gonna use one or the other, what's the harm? Well, indexes take up space, and that's something we
always wanna be aware of. And most importantly, adding an index is going to slow down
UPDATEs, INSERTs, and DELETEs, because if you think about this logically, since you're storing that information in more than one place, if
that information changes, it's gonna need to be updated in all of those places as well. So even though we get a
performance improvement on selects, over-indexing can lead
to performance problems for other operations. We care so much about this at GitHub that we have a ChatOps
command built in to our Slack, so that command is .mysql dead-indexes, and it just takes the name of a table, and it will tell us if there are any redundant or unused
indexes for that table. So our rule was, any
columns involved in queries should be covered be an index. Just make sure you're avoiding
unused or redundant indexes. So we're gonna add on to that rule, and talk about index prefixes. So sometimes it's actually
better or necessary to use an index prefix
as opposed to an index. So what's an index prefix? So here we have an index
over repository name, and now we have an index
prefix over repository name, and you might notice that that index only includes the first five characters of the name of the repository. So an index prefix is exactly that, it just prefixes a subset
of data within a column. So let's take a look at
another sample query, this one is a bit more complex, don't need to worry
about understanding it, but it's just asking for
the recent code reviewers for a certain code path. So looking at this query, the
columns that stand out to us are repository_id, path, and created_at. So, once again, our first
instinct would probably be to add an index over all of those columns. So before doing that,
it's important to analyze the table that you're
adding an index over, so once again, a bit
of a plug for ChatOps, we have this built-in
command called MySQL table, and it will give us some great information about a table that we
have in our database, including its size, the columns included and any indexes we already have, and looking at the columns in our table, we can see that the path column is of the datatype varbinary(1024), so that's like really scary. It's a really large
column, and we're actually not gonna be able to
add an index over that, 'cause it's gonna violate our limitations for the size of the index. So, when are we gonna break this rule? Our rule is, any columns involving query should be covered by an index, but we're actually not gonna cover them with a regular index,
we're gonna cover them with an index prefix. So in this situation, we
have a longer datatype. Another situation in
which you might wanna use an index prefix is when querying the data works well with a prefix. So maybe you have a query, give me all the usernames
that start with A. Well, we really don't care
about the rest of the username, we just care about the first
character of the username, so we could put an index
over that first character, reference back to the main table
and get everything we need. So how do we know how
long to make the prefix? It's a great question. Essentially, you just
wanna make it long enough to differentiate values within this field. So if we think back to our sample query, the column in question was path, and that's just basically a file path, so if every single file in your repository starts with lib slash,
and your index prefix is only over the first four
characters of that column, that index is not gonna do us any good. So we need to make it long enough to differentiate the
values within a field, and it's a good idea to base it off of real data, if possible. So for that query in question, a PR was opened to add an
index over that column, and the developer said
that he chose 255 bytes as the prefix because the longest path in github/github is 199 bytes. So he took a pretty large repository, and evaluated that data, and
came up with the length of 255, then he was able to add that index prefix right into the migration to add the index, and we got the coverage that we needed. So what do we gain from
using an index prefix? Well, first off, once again, space. Less space is required for the index. We gain the ability to
index larger datatypes that perhaps we weren't indexing before. And we can also see a comparable
performance improvement as a full index if the length
is calculated thoughtfully, and we're smart about it. So our rule was, any
columns involved in queries should be covered by an index. Just make sure you're
looking out for redundant or unused indexes, and
using an index prefix when it's appropriate or necessary. So our second rule is, use
an OR to return records satisfying one or more
of several conditions. So I can guess that probably a lot of you have some queries with OR classes in them in your application. So we'll take a look at
another sample query, we want the max ID for users
that are either spammy users, or the users are suspended. As we can see, this query
takes about a minute to run, so that makes us really sad, we're definitely gonna
need to speed that up. So generally, the first thing
I do when I see a slow query is I run an EXPLAIN on that query. And what that will do is it'll give me some great information
about how this query is being executed. The columns that I'm paying
attention to are possible keys, which just is possible indexes. That this is, will include any indexes that involve columns
that are in your query. And then the key column itself
is telling you which index was actually chosen to be
used for the query itself. Rows is the number of rows
that needed to be scanned, and Extra will give you
some extra information like using where using
index, using files for things like that. Another plug for ChatOps, we
have this built in at GitHub, so any developer, or anyone, really, can run an EXPLAIN on a query in Slack and get the explained output, so this really encourages all developers that aren't necessarily
on the database team to really understand what
their queries are doing, so they don't just come
to us first for help, they can look into it themselves. So as we can see here, the key is null, so it's not using any index
for this query whatsoever. So let's take a look at what
were the possible indexes that might have been used,
so we have index over spammy, and we have an index over suspended_at. So those are the exact columns
in question of our query. However, the use of the OR is limited, because MySQL can only
use one index per table during a query, so if we were
gonna use one or the other, we would be excluding rows
that should be returned to resolve this query. So when do we break this
rule about using an OR? Well, first off, the table
being queried isn't small, so a full table scan isn't performant. So maybe you have a query like this, and it runs fast enough, so
this isn't an issue for you. But I would guess that's not the case for most of the tables in your databases. And in this situation, the use of the OR is preventing any index from being used. So instead of adding an index,
we're gonna tune our query. We're gonna use the UNION,
and that's gonna basically split the query up into two queries that can each exploit the right index, and then UNION those results, and then grab the max ID from there. So this query runs in 11
milliseconds, it's way better, and we're a lot happier. One thing to note, Postgres
has an index merge feature, so, your query might be able to recognize that it has two indexes that it can use, merge them, and then use that, so maybe this is fast already, one thing to consider, though, is this is still gonna require more work in the process of solving that query. So it actually might be
more efficient to use the UNION anyway. So our rule was, use
an OR to return records satisfying one or more
of several conditions. Except when the OR is preventing
an index from being used and you have to do a full table scan, it might make sense to tune
your query to use a UNION. So our third rule is, if there's an index over all the fields in your
queries, you're all set, have a great day, there's
nothing else you can do. (chuckles) So let's take a look at
another fun sample query here, we want all the commit comments for this repository ordered by ID. Once again, we can see this query as running pretty slowly,
it's taking about 40 seconds. So we'll run an EXPLAIN
and examine what's going on under the hood when this query is run, and we can see that we
have some possible indexes that we're gonna potentially use, but the query planner is
actually choosing to use the primary key index. So why isn't it using this index over repository ID and user ID? And if you think about this query, all we really care about
is the repository ID, so you would think it
would go to that index, that's sorted based on repository ID, which is the left-most column, and then after it gets those records, do a sort on them afterwards. It's not doing that, and so
this makes us really mad. What's happening is that since
we have an order by ID clause at the end of our query, MySQL thinks it's better to scan the primary key index just so that the rows
are generated in order and it doesn't have to do a sort. This is a valid thought,
it's just the wrong decision. So we go from being mad to sad, because we did everything
right, our query is sound, we can't really change
it, we have a good index, it's just still slow. So the situation is such
that MySQL is choosing to use an index that isn't performant, or no index at all when one is available. So we gotta help MySQL out! It's done a lot for us, and the only thing we really can do now is
just provide a helping hand. So we're gonna use an index hint. In this case, we're using a FORCE INDEX, which is basically just gonna tell MySQL exactly which index to
use to resolve this query, and this runs in a millisecond, it's way faster, so we're happy, we're happy and MySQL is happy. You have some other options here, you could also use a USE
INDEX, which is more of a hint, you could also use an IGNORE INDEX, which explicitly tells
MySQL what index not to use, but then it lets it make
the decision afterwards of what's the next best thing. There are some gotchas here, so future proofing is a concern. So say you added an index
in using a specific index, and then later on, that index is deleted. You definitely have to be aware of that, and Postgres doesn't have
any index hint, so, sorry. So our rule was, if there's
an index over all the fields in your query, you're all set, except when the query planner
doesn't know to use the index, in which case, help it out. So our fourth and final
rule, and this is a big one, is to avoid redundant data across tables. So this is something that we're taught in terms of data modeling,
and data storage, it's a good idea to have all your tables follow the single
responsibility principle, and only contain information in regards to the subject matter of that table. So we have a sample database here, it's a small one with
just a pull request table, a repositories table, and a gist table. We also have a users
table, which only contains user information, and you'll notice that there's a column here called spammy, and that just denotes whether or not a user is a spammy user. So in order to satisfy some
really fundamental requests to GitHub, we have to join
across multiple tables. So for example, if we wanted
a group of pull requests, for example, we'd wanna make sure we're excluding any pull
requests opened by spammy users, so we have to do a join
across these two tables, we're basically reading
information from two places. What if we need a group of repositories, except those owned by a spammy user? Once again, that information
is in two places, we have to do a complex
join with those two tables. Or if we just needed
the gist for our user, but we needed to make sure
that user isn't a spammy user, once again, we need to do
a join across these tables. So this is really adding
up performance-wise. So when are we gonna break this rule about not having redundant
data across tables? Well, additional reads, or JOINs, are causing noticeable
performance degradation, so this was really adding
up, and we were seeing really slow queries just to satisfy these really fundamental questions. We also have a high
ratio of reads to writes, and this is a really important point. So, we basically needed
to check for spamminess on almost every request to
GitHub.com for content like this. So that number is really high. And conversely, the amount of times that we're marking a user
as spammy is much lower in relation to that read number, so we have a high ratio of reads to writes in this situation. So we're gonna decide to have
redundant data across tables, we're gonna denormalize this data. So we're gonna add a column
to all of these tables that just denotes that
a user is a spammy user, we actually added, suspended onto that, so now we only have to check for one place to get this information,
and it's much faster. So, a query that used to look like that, pretty complicated, had a left outer join, had an OR clause in there,
is highly simplified, and much, much faster. So this is the performance
improvement that we saw after denormalizing this data, and I can tell you we did
it across a lot of tables, not just the three I mentioned. So this is just the rate of git requests for pull requests and
repositories via our API. And as you can see, the
response rate is way better. So, once again, nothing comes for free, there are some trade-offs
that you have to consider before making a decision to
denormalize data like this. First and foremost, there's gonna be a lot of changes to your
database right away, so we're gonna have migrations to add those columns
for all of those tables, and don't forget to add indexes
for those columns as well. We also had to do a lot
of data transitioning to backfill that data
for existing records. Another thing to think
about is data quality. So now that we're storing that information in multiple places, that information needs to be kept up to
date and consistent. So the way that we handle this was with some background jobs, and a nightly cron job to
resolve any mismatches, and so forth, but you
definitely needed to think about the pros and the cons before making a big decision like this. So our rule was, avoid
redundant data across tables. That's generally a good rule, but when you're JOINing on another table for almost every request,
it's getting costly, you have a high volume of reads to writes for the data in question,
it's probably a good idea to consider data denormalization. So recap. Our first rule was any
columns involving queries should be covered by an index. Make sure you're avoiding
redundant or unused indexes, and sometimes an index
prefix is either necessary or good enough. Use an OR to return records satisfying one or more of several conditions. A UNION might be necessary
to exploit the indexes. Our third rule was, if there's an index over all the fields in
your query, you're all set. If MySQL is making the wrong choice, help it make the right one. And avoid redundant data across tables. Except when reads are slow, the
read to write ratio is high, you can think about denormalizing. So some takeaways. Index, but don't over index, nothing comes for free,
indexes take space, and they slow down INSERTs,
UPDATEs, and DELETEs, so that's something you wanna think about. Tune your query to exploit the indexes that give you the best results. So sometimes it's not
about adding an index, it's about changing your
query to use the right one. There are tools to help you, I encourage you to run
EXPLAIN on your queries and try to understand how
they're being processed, and make sure you examine existing columns and existing tables and existing indexes before you make any changes. You can do everything right and still have performance problems, so
this is kind of the point of this talk, is that you
can follow all these rules and still see problems, so
you need to really understand what's going on under the hood, and get creative in certain situations. So, we are hiring on
the platform data team, if any of this interests you, or you have some experience with it, I'd love to talk to you. You can use that link
if you wanna check out the job posting. Thank you! (clapping)