How Postgres could index itself

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good afternoon everyone I really think one of the next big advancements in database technology will be automatic indexing imagine if Postgres could tune itself based on your workload I think it's much closer than you might imagine and so today I wanted to walk you through two different approaches that I've taken to make this happen and they're both available for you to use on your own systems and try out today but first a little bit about me I work for a grocery delivery startup called instacart we do same-day grocery delivery from all your local stores and we can do it in as little as an hour in my free time I like to do a lot of open source work you can find me on github under an Kane and one of my more popular projects is called PG hero it's a performance dashboard for Postgres you hook it up to your database and it tells you when there are common issues and also helps you figure out how to resolve them so it can be really helpful for types of issues that you haven't seen before but enough about me let's talk about indexing why don't we add indexes well the reason is that we want to increase read performance so we want to speed up our queries and it comes at the expense of write performance so inserts updates and deletes are a bit slower and they take up space so they take up space on your disk as well as in memory when indexes are hot right now we're all doing what I like to call manual indexing and the reason I caught that is it reminds me of cars when you're driving a manual car you're the one responsible for shifting the gears to make the car go faster with an automatic transmission we've taught a computer how to do that the computer takes in all of this input from its environment and figures out how to shift based on that what's the same in both scenarios whether you have a manual transmission or an automatic transmission is the underlying mechanism the and so I think we can apply this to the database I think in index is a great underlying mechanism but I think what we do need a change is who's adjusting that mechanism who's operating it and so we can go from having a human do it to teaching a computer how to do it so in the first approach that I tried I wanted to just mimic what I was doing manually when I was adding indexes to a database and that looked a little something like this first collect the queries so figure out what queries are running and then analyze them and from there figure out what are the best indexes to add and to collect the queries there's this great extension that I'm sure many of you are familiar with called PG stat statements PG stat statements keeps track of the most time-consuming queries on your database and you get something that looks like this you get the query you get the total time that's been running in the number of calls and with a little bit of division you can calculate the average time that a query is running and that's really what we're most interested in for this approach so if we look at the first query it's average time it's just 0.5 milliseconds so this query executes very very quickly so we're adding an index really isn't gonna help this here we probably want to reduce the number of calls for the second query they're a lot less calls so the average time is much higher and so this is likely a very good candidate for an index and let's say that the query is is this one right here a very simple query select star from products where store ID is 1 if we want to add an index here we need to pull some information out of this query and the two pieces of information we need are the table and the column name to index so how do we teach a computer how to do that there's this great extension sorry a great open source project called PD query that allows you to do just that you give it a query and it produces a parse tree and from the parse tree with a little bit of magic you can pull out the tables and the columns that were used in that query this project does it in a really neat way it actually statically links against the post parser and the advantage of that is that it can parse any queries that Postgres can parse so going back to our query we can pull out products as the table name we can pull out store ID as the column name and now we have the piece of information we need to add an index now let's make it a little bit more complicated so now we have a second condition and brand that equals two now we have a couple more choices for an indexed ad we could add it on store ID again we could add it on brand ID or we could add a multi column index on a combination of the two let's put multi-column indexes aside for just a second and consider store ID and brand ID to make a good decision about this we first have to understand what the underlying data looks like here we have stores that have many products so a single store is gonna have a lot of different products in it and we have brands that are just gonna have a few products associated with them and now let's try to get an intuition about what column is gonna be better to index by thinking about how the database is actually gonna execute this query so if we add a index on store ID what the database is first going to do is it's going to traverse the index the index is a tree structure and it's gonna find the location of all of the rows that have store ID 1 and then it's gonna go fetch those rows and that's what you can see here in the top the rows and yellow are the ones that it fetched with store ID 1 and then it's gonna filter on any other conditions so here brand ID 2 and what we end up with are the rows and the green so this is the result of the query if we add an index on brand ID similar situation it's going to traverse the index find the location of all the rows with brand 82 and it's going to fetch them and those are on the bottom in yellow and again it filters and we end up with the data in green so the results of the query are the same so what's faster and the answer is pretty straightforward it's where the database does less work so less work gonna be quicker than doing more work and in which case than the database do less work it was the second one the database had to fetch less tuples and so here adding an index on brand ID is gonna be the right decision now how do we teach a computer how to do this luckily Postgres stores information about the data in your columns and it exposes this information with the PG stats view and two pieces of information that we're interested in for the columns are the number of distinct values in the columns and the fraction of values that are null and so let's see how we can apply those let's say our products table has a hundred thousand rows and every product has a store ID so none of those rows are null and there are a hundred distinct stores if we query on a given store ID we should expect to get a thousand rows back and this is what's called row estimation if we look at brand ID it's on the same table product so it has the same number of rows this time let's say that 10% of the rows don't have a brand for some reason or another and that there are 9,000 distinct brands here we would expect to get just ten rows back on average from the database and so which is gonna be faster fetching ten rows or a thousand rows the answer is fetching ten rows now let's say that our data sets a little bit bigger let's say it's a thousand times bigger and that instead of a hundred thousand products we have a hundred million products let's say the other stats stay the same this time we would expect to get a million rows back if we query on a given store ID and we'd expect to get 10,000 rows if we query on a brand ad and ten thousand rows is a decent number of rows so can we do better can we have the database fetch even less rows and this is where multi column index is come in so the database can traverse an index on multiple columns and so it has to fetch less so what we do here is we would first find the brand ID is a better selection for the first part of our index and then we apply this row estimation approach again this time we start with 10,000 rows we apply it with the any of the other columns that are also in that query this time it's just store ID and we end up now with just a hundred rows if we had a multi column index and so here fetching a hundred rows is going to be much faster than fetching 10,000 rows so we'd like that an index a multi column index on brand ID store ID let's take a step back for a moment and look at some other queries that we might want to optimize so here's another common format of query we have an order by clause with a limit so here adding an index on created that is going to speed up this query and of course we can combine these two together and so here we have a where clause as well as an order by Clause so what will first do is take the approach we took before starting with the where clause would you row estimation to see how many rows we get back if it's a very small number of rows we can stop there because the database can sort those rows and memory and return a result pretty quickly if the number of rows is large we again would want to add a multi column index this time on store ID and created at and so this is the first approach and this is the approach that you actually get if you use PG here oh and you turn on suggested indexes so there's a page that shows your most time consuming queries and it'll look for any indexes in those queries and it will report them to you and so here you can see that it's found two indexes on the most pop the most time consuming queries now there are a ton of shortcomings to this approach and so I wanted to talk about just a few of them one is that it only works on a single table or it only works on a simple queries and the format is a single table with a simple where clause in a simple order by Clause so this is far from all the queries that you want to optimize and specifically it doesn't work with joins also we're duplicating some planar logic and I use the word duplicate very very loosely people spent years and years optimizing the Postgres planner and so this is just a very simple heuristic to try to get at just some of what it does one of the reason the Postgres planner can make even better decisions is that it uses more statistics that are available so in this approach we just use the first two listed there but there are also many other statistics that the PG stats view provides and I wanted to spend just a little bit of time talking about what those are so Postgres also stores the most common values for every column or what it estimates to be the most common values the PD stats view is updated whenever you run the analyze command so it doesn't it's not continuously updated but that command can typically runs enough with auto vacuum that you don't need to worry about it and so here we can see that the most common values in our store ID column are two five and one and it stores their frequencies as well so so we have 90 percent 5 percent and 1 percent and the reason it does this is that depending on the value that you query on the database will change its execution plan so here if we query on store ID 1 we can see that one just one person of the rows meet that criteria and so the database would like to use an index whereas if we query on store ID to 90% of the rows meet that criteria and so here the database would like to do a table scan because that's going to be faster so the database again changes its execution plan based on the actual values you're querying on and what the underlying data looks like which is pretty neat another interesting field is histogram bounds so the database orders all of the data in your column and slices it into it sorts it and slices it into different chunks and it stores the balance of those chunks and what that's really useful for is range queries so if we're looking for all of the products with quantity less than 5 so ones that are out of stock what this will do is we can see that five falls between 0-9 so they're gonna be very very few rows that meet that criteria that have a quantity less than five and so again the database would want to do an index scan here whereas if quantity is greater than five that's gonna have to cover a lot of the table so the database would here do a table scan another shortcoming of this approach is that when we query PG stat statements often we'll get something back that looks like this so the value of it has been replaced with a question mark and it does this because it's normalizing queries it's grouping many queries together that are similar but what happens here is that we lose a bit of data and as you saw before the values that were that we're querying on actually really do matter for speeding up queries and so that brings me to v2 so v2 was takes a very different approach and it's main goal is to address a lot of the shortcomings that we just talked about in version one so starting with the collection process version two instead of using PD stat statements actually uses your log file and it does this to actually get the full-text of queries and so there's this setting called log min statement duration you can turn on and have it log all the slow queries in your database and from that we can parse the logs and actually aggregate the data very similar to a PG stat statements but this time we have to but this time there there are two different things that we we now have one is that we have the actual text of the query we don't have a query with a question mark in it and we also now have multiple samples so we can use many many different samples to try and optimize our our indexes another big shortcoming is that we are duplicating planner logic if we think about what the planner does on a high level it takes in a query and a set of indexes and it determines the best indexes to use for that query it's limited by the indexes that you've created in the past what we'd really like it to do is to consider all possible indexes and if we can get it to do that it can then give us the best index is possible to use so how do we make this happen luckily there's this hook and Postgres called the get relation info hook someone added this hook 10 years ago for a project called index advisor and the description of this hook is right here here's the comment from the source code and so what it does is it allows a plug-in to editorialize on the info we obtained from the catalogs actions might include altering the assumed relation size removing an index writing a hypothetical index to the index list and that last part is what we're most interested in we want to add some hypothetical indexes to make the database think that all indexes are there and as luck would have it there's this there's another great open source project called hypo PG and it allows you to create hypothetical indexes it gives you commands to create them and then what you can do is run and explain on your queries and it'll tell you if a hypothetical index was used in your query so it's not actually creating them they're not actually taking up any space the neat thing about this extension is that you can install it without restarting your database you don't have to worry about downtime and the hypothetical indexes are ephemeral so when you disconnect from the database they're gone and they don't affect any other sessions within your database so they're limited to the current connection so this this was done in a very very nice way so let's revisit the query from before with this new approach first what we'll do is we'll run an explain on the query and what we're really interested in is the final cost you can see here that it's a thousand you can also see that we're doing a sequence scan on products and it's filtering on the two conditions in the query and we're going to take that cost and add it to this table and now we're gonna add some hypothetical indexes we're gonna look up all the columns in the query and we're gonna add hypothetical indexes on those so in this case on store ID and brand ID and we're gonna run explain again and again we're interested in the final cost this time it's 50 but we're also interested in to see if one of the hypothetical indexes we created was used and here you can see that one was and if we look at the index condition we can see that it's the brand ID index and we're gonna add that to our table as well and then we're gonna do this one more time and this time we're gonna do it with multi column indexes so with multi comb indexes the order matters so we're gonna add it to both store ID brand ID and brand ID store ID and we're gonna run explain again and for the sake of this example let's say that we ended up with 45 is the final cost and we ended up with a hypothetical index on brand ID and store ID being used so now that we have all this data the question is what what column do we add an index on so we can see that adding a single column index significantly reduce the cost but adding a multicom index only reduced it slightly and so in this case we would just want to add a single common dex because of the trade-offs that we get with with adding more indexes so this approaches have also available as an open source project one specifically for indexing called Dexter the way Dexter works is that you give it a log file and the recommended way is to actually stream your log file into Dexter and this way it can actually give you real-time suggestions for indexes you also give it a database connection so it can create the hypothetical indexes and run the explain here are three useful options for using dexter the first is the create flag so by to fold dexter doesn't actually create any indexes it just prints them out for you to create themselves so it acts more as an advisor once you've developed or once you do have some trust with dexter you can that it actually works on your specific workload you can then pass the create flag and it'll actually create the indexes it creates them concurrently so it won't lock any of your cables while it's doing its work another option is exclude so there are certain tables that you may want to exclude from automatic indexing one example might need very very large tables because the indexes might take up a lot of space either on disk or in memory and so you don't want to automatically index those you may also have tables that are very very write heavy and so you don't want to take the performance penalty of adding more indexes on those tables either and so you can pass those to the exclude option as well a third option is min time so you may be running some one-off queries on your database and you probably don't want those queries to be indexed and so what you can do is use this min time option to specify a minimum amount of time for queries to run and so this allows you to only index queries that are taking up a lot of your CPU cycles on your database so let's take a look at some of the shortcomings of this approach let's say that here are two queries that we've collected and analyzed we have select star from products where a is 1 and B equals 2 and select star from products where B is 2 if we analyze the first query what might happen is that we find that a is the best index for this query and with the second one we'll find that B is the best calmed index and it might be the case that a is only a slightly better calm to index than B in the first query and so if that's actually the case we don't want to add two indexes we just want to add one on B because we're gonna get most of the benefit out of indexing by just having that one index so the approach right now is very greedy it processes queries one at a time and what we're really looking for is more of a global optimum because we're trying to optimize across all queries on our database so in this scenario what we can do is actually figure out the top end queries so let's say the top five queries there's sorry the top five indexes for a given query and once we have that we can then optimize across all of the different queries on the database and we can use the cost reduction that each index gives is kind of a weight in this optimization problem another limitation is that right now it only works with b-tree indexes however support could easily be extended for other index types in the future also it doesn't work with expression indexes or partial indexes and partial indexes are actually where I think some of the more interesting work will be in the future if you're not familiar with partial indexes they allow you to index just a subset of your table so you can specify a where condition and just have an index on that particular condition and rows that don't meet that condition won't be in your index and so this allows you to get very very fine-grained with your indexes you could actually create many very very small indexes that are specifically tuned for your workload so the values that you're specifically querying on so one example could be you're looking for out of stock products so products where quantity equals zero if you're running this query often you probably don't need to add an index you probably don't need to index all of the products with within their quantity because they're probably not doing any other queries on quantity and if you are if you're doing a query like quantity greater than zero its gonna do a table scan because that's gonna cover a lot of different products so in this case you could just do it a partial index where quantity equals zero and have a very very small index and it would just be the products that meet this criteria so it could be very very to your workload another shortcoming is that we're adding indexes but we're not removing any we're not doing any cleanup and so one easy thing to do would be to remove any unused indexes Postgres has stats on index usage and you can easily get those and figure out what indexes you're not actively using one caveat with this approach is that you have to consider the replicas as well because indexes are shared with both the primary and all replicas so you need to aggregate the stats across all databases in your cluster and combine them together to actually remove these indexes another limitation right now is extension support if you run your own self-hosted Postgres it's very very easy to install dexter and whatever extensions you would like however if you're with a hosted Postgres provider many of them just support a limited number of extensions and with some of the biggest Postgres providers hypo PG is not currently one of those supported extensions and so what you can do is what we do at instacart is actually spin up a self-hosted instance on a regular cadence say once a month pull all of your data into that self-hosted instance download the log files and then run dexter there for index suggestions it's far from ideal you lose out on the real time aspect of it but it's one way to to see what columns dinh decks if any of you here work for one of those hosted Postgres providers and you get a little bit inspired by this talk would be great if you could help make that happen so now we have a solid way for the database to index itself and with a bit more work I think we could actually get an approach like this to ship with Postgres before I wrap up I wanted to say a big thanks to PG query and hypo PG these are two really great open source projects and without them this work would have been impossible and finally I wanted to invite everyone here to get involved so please go download Dexter try it out on your own workload give feedback how contribute to the project and together we can make manual indexing a thing of the past thank you questions there hasn't been any work so far for for those particular cases but I think that's a great place to to work on in the future so the question is what's the performance impact of it sure yes so the yep so the question is you know are there any concerns around the amount of log volume that this would produce and the question is certainly depending on or produced yeah so so depending on you know what what your workload is it's definitely something that you have to be aware of what you can do is actually try setting the value of log men's statement duration to be sorry yep gotcha sure yes so if there's the question is you know how fast can dexter process log files and I think that's just something that you'd have to try out on your on your own workload usually with with what we're logging and it's the card it can do a couple days worth of log files and usually a couple minutes but it's certainly you know not we're not logging that much data so but I think that's another thing that could definitely be tuned overtime if it's just rewriting some of the the code to parse those things out how many eunuchs have we have we created at instacart so it instacart we just we do the approach where we actually use it more in an advisory capacity so the tool itself hasn't actually created indexes but we've we found probably 20 or more indexes in a couple of times that we've actually pulled it to that we decided to add to it so what it ends up doing is it'll it'll group all of the similar queries together based on their parse tree based on a hash and then from there typically the way that I'll use it is specify the minimum amount of time that we want a query to run so the minimum amount of CPU time that all of the queries together have taken and then and then from there basically try to optimize all of those okay sorry the so the question was if you have queries that are running relatively infrequently but they're taking up a lot of time could you be over indexing on those particular queries and the answer is certainly and I think that's where it really is up to you to really tune the minimum amount of time that you specify you could also I guess exclude queries on you could maybe exclude specific queries could be some future work for this for this project of let's say your analytic queries you have a comment or something in the particular queries you know that could be another way to approach it in the future not right now but that that could certainly be something that's that's easy dad yeah so the question is have we tried to determine the the right cost of actually adding an index and the answer is not not right now so so yeah there that that could be another area for for future work is really trying to optimize on the total database time that you're saving okay can you say that again yep so the question is how does it handle bind variables in your logs and the answer is unfortunately it doesn't do that right now because there's no way to actually get the the values of the query and with this approach you actually need you actually need the full query because you have to run an explain on it so you actually need queries with values in them okay yeah so it doesn't try to write right now it doesn't try to combine them so so yeah I mean if that's the case then it probably with more work that that could be possible yeah in the back the question is is is there any thought around using PG stat statements to replace the log parser and the answer is yes that's actually an option so you can actually pass a an option for it to use PG statements so for the times where it actually doesn't return a normalized query you can actually get the full text back I'm not actually sure when you get the normalized query because we we grabbed from PG stat statements every five minutes at instacart and some of the time the queries get normalized some of the time they don't and so if you're running it like that usually you can get a pretty good sample of ones that have not been normalized other questions yeah yeah yeah I think I think that um I think that work can definitely fit into the kind of the global optimization approach to really try to figure out what are gonna be the best indexes overall if you sort of have a trying to limit the actual number of indexes that you're adding so I think that makes sense anything else cool well thanks everybody

Info

Channel: Postgres Open

Views: 3,743

Rating: 4.9411764 out of 5

Keywords:

Id: Mni_1yTaNbE

Channel Id: undefined

Length: 36min 31sec (2191 seconds)

Published: Wed Sep 27 2017