Look It Up: PostgreSQL Indexes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you very very much for coming um first of all i'd like to um thank uh salesforce for your for hosting and providing the pizza and everything they um and um they are hiring for postgres and turtles developers i believe did i get that right yes okay close enough um so talk about crochet over there for details um thank you very much for coming to this installment of the san francisco postgres users group we always need speakers and we always need hosts it's an incredibly inexpensive way of doing recruiting for postgres people so present it to your employer this it's a great opportunity to speak we do you you can be a first-time speaker you can be relatively new to postgres we uh we love war stories and case studies and this is also a great opportunity to try out material before you present at a conference or if you're due to conference presenting um i also wanted to mention postgres version 12 is in beta so please download it and test it the postgres community relies heavily on individual users and organizations that use postgres to test um in real life deployments so download it um i wouldn't throw it in production yet but um do you include um download it put um and run it run in a test environment with a real workload and give your results back to the community um forthcoming community events uh postgres open which is annual community conference um for uh all things postgres will be september 11th through 13th in scenic orlando florida and there's the there's that okay and so here's what we came for today um so let's see we're going to talk about postgres indexing um my name is christophe pettis i'm the ceo of postgresql experts we're over in alameda uh we're a small postgres consultancy uh we've been around 10 years now actually this is our 10th year oh my goodness and i've been doing postgres since 7.1 so that was a long time ago uh email address that's my personal my somewhat neglected sadly personal blog tech blog and there's my twitter account okay so about indexes um we're we don't need indexes thank you all for coming um um in fact by definition in the standard we don't need indexes um because an index that never ever changes the results of a query and in fact you can do you can write implementing 100 sql standard compliant database with no index functionality at all it's not required by the standard so why bother well because o n is bad we don't like o n operations we like o n squared operations even worse but um because at best without indexes all queries are sequential scans it has to pick up and shake every tuple that it's going to consider and decide if it passes the predicate condition or not and put it back down and this is really bad on a 500 row database this is fine on a 50 million row database this is not fine um the whole point of an index is to change these on operations into oh something better than that that's what an index does is it reduces the ordinality of this operation you know order um login or um or order cut or constant order would be great if we can pull it off but there's a catch and this is something to um the to understand when you're dealing with uh indexes in general or almost any algorithm at all which is people look at and say oh it's it's an o log in algorithm this is great that's much much better than o n yeah because the problem is all of these have a constant associated with them um which is that pesky constant because you'll see o notation is actually quite abused people say well this is an o log n algorithm so it must be much better than o n but that means just for a sufficiently large value of n this will perform o n but sufficiently large could be gigantic it could be like the number of particles in the universe so this is just a reminder that indexes are essential for database performance but they don't result in speed improvements in all cases and it's important to use the indexes correctly and pick the one that actually will help you in your particular workload so that being said let's look at post chris's amazing indexes because one thing that's cool about postgres is you have a lot of index types to choose from there's b tree uh there's hash just gin sp just who has ever used an sp just index yeah that's the typical result it's a shame because they're actually kind of cool bryn and bloom indexes bloom indexes are incontrib so it's um what's so nice about them is you can match your index type to a variety of workloads and a variety of uh um querying patterns and this is really really nice so this is kind of unique to postgres the question is how do you know which one to use because there are a lot and so what's the what's the um you know we should give a talk sometime about that ah maybe that's this talk okay so let's first talk about b3 indexes um this is like the most powerful algorithm in computer science that no one knows why it's called that um are they is the b for balanced or broad or boeing or bushy boeing is actually a candidate because it was originally the original paper was written by people at boeing you know is it the one that came after a tree you know i don't it's up and they've been around since 1972 that's when the first b tree paper was published um and this is kind of the default index in postgres if you don't specify anything else you get a b tree index and pretty much every database everywhere implements b3 indexes here's that graphic everyone's seen this graphic that describes how b3 indexes work which is that they have a series of buckets and the interior nodes say okay everything less than 7 goes there everything between 7 and 16 goes there and everything's greater than 16 goes there notice that none of these are exact because no one can ever remember if this is inclusive or exclusive ranges um and the knight these buckets in the original bee tree paper these buckets were fixed had these each node had a fixed number of buckets in it like four um but postgres doesn't work that way so be true the reason b trees are so popular is it's a very very powerful algorithm they first of all they tend to be very shallow compared to other data structures you don't have a lot of layers in them which is great for a database that's on disk because it means your disk page accesses um it provides o log in there's our o login that we wanted so badly um access to the leaf nodes which is significantly better than o n and they're easy to walk in order directions especially as the the b star trees that postgres uses um so you can help with they can do things other than just index lookups you can do order by you can do merge joins you can do all sorts of cool things with them unlike the standard published b tree b trees they have a variable number of keys per node since postgres has a large number of indexable types of varying widths and each postgres page is 8k by default and no one ever recompiles post scores to change that um the entire key is copied into the index which is kind of what you'd expect you saw there was you know three was literally living in the index there um which means that larger values mean um fewer keys per mode nodes so deeper indexes so if you have a single character if you have a short you know a 16 characteristic versus a uuid you're going to get a deeper tree with a uuid so you'll have to hit more layers in the in the tree so okay we're done we've solved the problem indexing is now a solved problem so problem one is you know there's there's that little the entire key is copied into the index thing this is not so great if you have gigantic character strings because those are copied in their entirety into the index node um postgres has a facility for handling large types like this called toast and indexes can use toast just like database table just like the table can the the primary table can use toes but this is not efficient so you want to avoid this the also b trees require a totally ordered type which means they support equality greater than or less than for all values not every type is totally ordered so not every type can use a b tree because it does retreat any to index a type in postgres with a b tree you need total ordering so that's b trees um the next index type we're going to talk about is hash this was a long broken feature of postgres they've been in postgres for a long time but they really didn't work very well like they weren't wall logs so they weren't safe on crash recovery you the system would crash you'd have to rebuild the index and all this stupid stuff and it wasn't so great postgres 10 fixed all that so now they are real first-class index types um it converts the value to the input value to 32-bit hash code the the type defines the hashing algorithm so you could look up the type um uh yes it's what it um it's for character strings i don't i don't actually remember um for integers it's just the value of course and for uh if it's less than 32 bits i don't remember i think it's a crc for strings but um and the hash points to uh to to buckets of row pointers so so the you have a big old hash table as your index um it only supports one operator which is equality no greater than no less than no prefix searches no nothing um but that's a pretty important operator you know looking up things by equality is a really common database thing to do so indexes are much more the nice one of the nice things about them is that they're much smaller than b trees b3 indexes can be very very big um and these are especially small for large key values so if you're copying these you know 288k keys into the into the index the hash index will be much smaller if there are a few collisions in the um access can be faster too because it can go directly into the hash table rather than having to walk a bunch of as many pages this is really good for long values where you're doing equality for example long character strings pre-hash indexes frequently people had to hack this by like taking some indexing on the substring you know of the first 20 characters of a string and then doing any quality comparison afterwards and stuff like this you just use a hash index now very very handy um examples of things like urls um long hash values you know like you're so you're storing hash signatures and things that kind of stuff okay just indexes the hashtag tiny sizing to it it's it auto sizes it actually doubles inside it it starts at 64 buckets and then doubles each time it fills so that could be an exciting operation when it fills up um so just indexes now just is actually not a specific index type it's a framework for building indexes um the idea was writing a core low-level index type which is in postgres is called access manager is really a pain in the neck you have to have a deep understanding of the way the wall works and things like that and it was so the idea was to provide a framework that took care of all that for you and you just and you could write just the minimum amount of code necessary to get your type to work um the result of this is what a gist index actually does for you depends on the type that you're indexing um so a for example you have a cartesian plane and you have a point a really uh kind of a fat point work with me here add a box that has the point inside of it and this so what is what do these operators mean for this i mean what what does a bot what is equality for a box at a point well you know we could just make something up we could say well if it's they have common center they're equal or something like that that's not very help useful though um but containment makes perfect sense you can say this points inside that box that's a completely reasonable operator um it's called just because of generalized search tree yeah okay they had to force that one tiny bit um it's generally used for any type for containment or proximity is a meaningful operation postgres defines a lot of operators that mean things like contains overlaps is adjacent too you can actually consider total ordering to be a special case of proximity ranges geometric types text trigrams all these kinds of types make perfect sense with these kinds of operations they're not as if it's not as efficient as b tree for a classic scalar type with ordering or for simply quality comparisons so you're not there's no benefit of using these for a place where a b tree would be perfectly suited but in a lot of things a b tree just doesn't work okay onwards to gin indexes there's this kind of thing in postgres where the uh the postgres pro people over in russia keep coming up with new index types that are named after various alcoholic beverages there are there's also rum and vodka indexes whose details escape me because most of the papers on them are in russian a lovely language i do not speak um they're also not part of the postgresql um so gin is a generalized inverted index the forced acronym wasn't quite so bad there the problem is both gist and gin perform very badly if there are lots and lots and lots of identical keys though that that's not a great use case so if you index so if you you know in a degenerate case if you have a 25 million row table and three unique key distinct keys neither of those index types are going to be very efficient because they're going to be very bloated but the problem is if you're doing full text search you kind of have exactly that problem because you have a relatively small corpus of words and a relatively large number of records that contain them so just standard b tree and just aren't going to be super helpful there thus we have gin indexes gen index has organized the keys themselves into a bee tree and then the leaves of those bee trees are lists or indexes um or lit are either lists or b trees themselves or pointers into the rows that contain them so first there's the first level of descent is you go down and find to match the key and then you get a list or a b tree of every row that contains that up that that key what's nice about it is if you have a large number of identical keys it scales very efficiently this is pretty and there are lots of use cases for this like full text search if you're indexing array members and json keys stuff like that so this is a very common use case okay our friend spgist um sp is served for space partitioning gist it's similar to just in concept it's a it's an indexing structure as well um but it has a different range of algorithms for partitioning in classic gist it's designed for a situation where classic just index would start being highly unbalanced and we'll talk more about the use cases of this later as we discovered it's not super common we used next one we're going to talk about is brin indexes a relatively new feature postgres b3 indexes can be really large you know gigabytes and gigabytes of indexes the index it is very common for the indexes on table to be much larger than the table itself um so the in postgres ease the heap is the um is the non-index parts is the actual tables um one things that b trees assume is that we know nothing about the correlation between the index key and the and where the the row actually lives on disk there there assumes a complete you know they can be anywhere it's completely random which is you know the safest assumption of course you don't want to inherently tie you you don't want to create a generalized index type that assumes that the rows have to be sorted on disk or something like that but often that happens as a consequence of the way the table is built so lots and lots and lots of postgres tables have something like that so created at timestamp now and like if you're keeping logs in the in a table it's going to be growing continuously mostly appended at the end very few deletes and the with monotonically increasing keys you know serial primary keys or time stamps or things like that and if you don't update them very often which causes the row to move around on disk the the key will be strongly correlated with the position of the row in the table because by default without any holes in the table it'll keep appending to the end and brin indexes take advantage of this um instead of instead of a tree it records the ranges of keys in the pages that probably contain them um it's a somewhat it's a lossy index so the the results have to be rechecked at the time but it's much much smaller from for a b3 index it's not uncommon that a b3 index that's multiple gigabytes will be less will be like 128k on disk in a brand index and much faster to retrieve if the correlation assumption is true it can be much faster to retrieve like get me all the orders from last year than a b tree because it it can go in and just suck up the range of pages on disk um it's not great for heavily updated tables because the rows will move around a lot small tables because you won't get the advantage of over b tree and if you don't have a table if the tables don't have a monotonically increasing key like that that you're indexing on bring will help you in that regard in that case you just want to use a b tree that's what we'll talk about is bloom indexes which are based on bloom filters this is a contrib module in postgres um it's like a hash only different there see that's very intuitive um the way bloom indexes work is they take you you generally create a bloom index on multiple columns and it takes the whole set of columns and combines them down to a signature it's based on the bloom filter algorithm which is very fast for set inclusion so it's very useful for multi-column searches so for example let's say you have 12 columns and these are attributes that you want to search on and your queries are kind of arbitrary like they're being driven by a user interface for example so sometimes you search on these three sometimes you search on a different four things like that that's a hard problem to index on in general because what do you do you create a b tree index on each individual one well that ends up being a big set of indexes because you need a separate index for each one bloom indexes are very fast of at finding candidate rows in that circumstance so you could create a single index on all of the columns and then do the search um there and they're much much smaller and more smaller on disk than a b3 index and thus they're potentially faster for a large number of attributes okay so those are the index types now let's talk about a little bit of the pragmatics of using these things the first question i always ask yourself is do you need an index at all indexes are expensive um they cost a lot to the you have to store them on disk they significantly slow down updates and inserts um and if you're if you're doing backup and if you're doing backup restores they slow they'll slow down if you're doing this quick print style backup and restores they take up more disk space if you're using pg dump and pg restore it has to rebuild the indexes on restore and that can take a long time and you know remember that pesky k may not speed things up at all as a very rough rule of thumb you only want an index if your typical query will return less than 15 to 20 percent of the table um the the usual reason that the planner decides not to use an index when you think it should is because it doesn't think it's going to it thinks it's going to return a lot of rows it's more efficient to do a sequential scan it's really important to have good statistics to make indexes worthwhile so make sure everything's getting analyzed and vacuumed properly can and also consider increasing the statistics target by default in postgres is 100 which means it keeps us a histogram of 100 buckets of the distribution of values for each column consider increasing this if a column has a lot of distinct values and more distribution than 100 buckets can capture uuids tail tail entropy text heavy text strings like urls hex hash values things that have things that where the just 100 buckets isn't going to cap really capture the range of values but don't slam up the statistics across the whole database just pick it column by column so here's an example let's say you know you have 10 million rows and 100 buckets and the field isn't unique and there are 25 000 distinct values and you do this and that's it's a uuid column and sensor id is the one that has 2500 to state columns this is in the primary key this is just like a foreign key into some other table the planner thinks a million rows are going to come back because they're evenly distributed so it's going to hit one bucket in the statistics say oh this looks like one 100th of the table that looks about right but it made it so main decided index isn't useful here or it might but um but the the reality is that significantly fewer rows are actually going to come back than that unless of course your table is very heavily excessively populated with this particular one but assuming a relatively even distribution so cranking up the statistics on this particular field will probably generate better plans indexes um participate in multi-version concurrency control just like everything does um and indexes store every version of a tuple until vacuum cleans with a dead one so how familiar are people with multiversion concurrency control show of hands okay postgre as a quick overview postgres stores when you update or delete a row from postgres the on disk um tuple does not go away immediately it isn't immediately recycled and made available for use vacuum has to come around and clean it clean it up indexes because indexes store pointers to to dead tuples just like it stores them to live ones and it is until um um excuse me vacuum comes along and cleans them up that those tuples are deleted so indexes can bloat just like tables can bloat in postgres um there is an optimization called hot in postgres that will somewhat reduce the size of indexes by under some conditions it will not have to store the pointer to the dead tuple but it doesn't completely eliminate this problem and in the default case index scans have to go out to the heap to determine if an index is visible the current transaction so it doesn't have to just go it walks the index and then for each um potentially for each index leave for each leaf entry it has to go out to the table and say can i really see this 2. has it been deleted that could really slow down index scans um there is an optimization postgres for index only scans where it marks it consults the visibility map and says well this page is fully vacuumed there are no dead tuples everything's fine so i can just re i know that this is a hit and if the if the planner when it starts says well it looks like everything's pretty much up to date i'm going to probably get a lot of good hits of hits on fully visible pages it'll use an index only scan which is significantly faster the good part is you don't have to do anything this is the decision the planner makes the only thing you have to do is make sure vacuum is staying up to date because it's the one that updates this visibility map some index scans are lossy which means it knows the tuple it knows that there's probably a hit on the um on this page or that this index this um this tuple probably works but it doesn't know for sure just walking the index so it has to go out and get the the the tuple and look at it to see if it passes or not um it means it has to retrieve pages and scan them again and throw away any rows that don't match you frequently see this in plans with bitmap index scan and bitmap heap scan so um some index types are inherently lossy like just like just and gin are inherently lossy scans because they keep pointers to pages not individual rows a relatively new fancy thing you can do in postgres is covering indexes uh postgres kind of was late adopting these other database platforms had them for a while um queries you know a query rarely returns you you rarely search on a primary key and just return the primary key i mean you knew the primary key in advance why would that be the only thing you returned um you almost always want to return other columns um traditionally postgres had to then go out to the table get the row and fetch those columns as well because after all you know they aren't in the index with postgres 11 you can add non-index columns into the index so if it's doing an index only scan it can return those columns without having to go out into the heap this is the decision you have to make when you're building the index of i think i know from my query workload i'm going to be fetching these non-index columns a lot so i'll just include them in the index the problem is it doesn't help on non-index only scans and just remember these things have to be recorded in the index so you're increasing the size of the index with each of these covered columns that you add if you use gin indexes there's this kind of common thing that happens on gen index is people set up a gen index and they'll let it run for a while it's building it's building and building and suddenly we'll get a call saying my database grinds to a halt once a while and i don't understand why and well the first question i asked do you have an update an update heavy gen table with a gin index on that column and the answer is usually yes and the answer is that's because it's doing gin posting because gen indexes are very quite fast to query but compared to other types of indexes they're really slow to update um so they're so slow that in fact postgres is when you update a row or insert a new row postgres doesn't actually try to update the index it sticks them it sticks the new entries in this separate thing called the in a post called the posting list and it updates the index at vacuum time because it's much more efficient to do it in bulk um and the problem is this can result when it has to flush this posting list it's going to result in the surprising spike of activity like everything's running along and suddenly wham up it goes um you might consider especially if you're doing things like bulk loads there is a function you can call called gin clean pending list that will do that will cause this cleanup to happen you might consider calling that at the end of the bulk load so you don't get one of these mysterious spikes at some future point indexes can of course be declared unique which means that there can only be one in one row that matches that particular entry b tree indexes support these um no other index type currently supports unique in the unique clause um one thing one thing you can um it's what you can do with unique indexes is you can do optimistic insertion with recovery on index conflicts just you know try and slam it in you get an error and then you keep going or you can use the on conflict clause that says well okay i'll do an insert and if there's a conflict i'll just keep on going or switch to an update um one thing to be aware of is unique indexes can do tend to hurt concurrency a lot because it has to hold because once you've inserted or updated up a unique key it has to hold the lock on that so anyone else can't sneak in and insert it in a concurrent transaction and especially on high um on like using a serial type where there's like a lot of clustering on single index pages during high volume inserts this can be a serious concurrency hit if you're using a type like serial or uuid which is guaranteed to be which where well serial is guaranteed to be uh ueid is um is unique in the sense that if two uuids close together conflict you're in you we have bigger problems like you just like your random number generators broken um you might consider making the key the index not unique to avoid these concurrency problems um also b3 indexes provide a generalization of unique um called exclusion constraints which is what are unique is only allow one entry don't allow a second entry that passes equality exclusion constraints are a generalization of that only one value passes this this particular operator is allowed in this is very handy for things like only like the uh what sometimes the room reservation problem is that you want to have a unique index where or you want to have an index that it will reject two reservations for the same room number with overlapping date ranges there's no way of doing that with a unique index because there's nothing that's unique the date ranges could be disjoint but with an exclusion constraint you can set it up so that you can that um don't allow a single reservation with this with room numbers that are equal and date ranges that overlap it's a very nice powerful technique okay so there's a ton of indexes and a ton of concerns so how do we decide which one to use well here's a decision tree for using them what index um so first gather some information like typical queries on the table um the columns and data types and operators that are being queried and including those in joins that joins a query just like anything else and how many rows these queries typically return okay so how many rows come back does the query typically return a large percentage of the table 25 50 something like that and this includes hidden row fetches like count star which in effect has to go out it has to fetch each row in order to determine its visibility to the current transaction if so you probably don't need an index so in that case you might wonder and if this is underperforming you might refactor the query so it doesn't have to do this scan summary tables or other techniques because just throwing an index of the problem probably won't help here and small tables that fit in memory probably don't need indexes all except to enforce constraints postgres in theory use the visibility map on the index it does it does so counselor is uh is optimized in that regard so but but if you're doing a count star across um across 50 of the table it will in effect have to walk fifty percent of the index to determine the determinant and that you probably don't want that okay if you have a multi-cal um predicate query where you know where this this is this relate this equals this this is greater than that blah blah blah um which column should we index in that so always start with the most selective predicate the one that will cut the number of rows down the most um if the predicates individually together don't cut it down but uh individually don't cut the results down but they do together then that's a good time a multi-column index will be useful but first let's just consider single column indexes we'll talk about multi-column ones later so is the column a small scalar you know in bigit flow uuid timestamp tz um inet and care types have so their own characteristics so worry about those later um is the value a primary key or is it otherwise unique if so if any those are true use a b tree that's your choice um is this field monotonically increasing or i guess decreasing it's a possibility a large rarely updated table and um so it's insert heavy but very light on updates and the query is generally doing a range operation for example it's like a log table if that's true a brand index is probably your right choice so we could see at the slide that um if none of those are true then probably b tree so when you talk about uh brain index field or rather they didn't fall i really updated it at all because it'll move the it'll move the tuple around even if the um even at the index fee even if you're not changing any other part if you're only changing a non-index column is it monotonically i mean really that's it the values are clustered it doesn't necessarily have to be monotonically well they have to it will um by well they don't have to be monotonic in the sense that you go plus one plus one plus one but they but the but they are they're continuously increasing um uh the continuously increasing if you have if you if they're jumping around if you can't because what you what you want is that the um that the value in the index is uh the value in the in the column is highly predictive of the insert order yeah okay i i sort of have a situation like that where you know there will be a group of things with one value next and and the next day a different value yeah give it a try then yeah that's um um so if the if the index is primarily intended to support order by descending because uh indexes will help with um descending out with with sortation as well then created as a descending index otherwise just create as ascending index is the column a text field so you know varchar text or car if you're weird their car car is like so bizarre um it's you know for those people who are nostalgic for punch cards car's there for you um so are you doing full text search trigrams or other search techniques including unanchored searching like you know percent sign blah percent sign um that's a trick question we'll talk about those later is the data structured and prefix heavy and you're typically doing prefix searches for example if you uh um if if it's not like just random text or names or things or addresses or things like that but it's things like so um where the data is inherently structured like you know um a hierarchy separated by dots you know that kind of stuff or urls where you're typically querying on the domain that's what spgis is really good at so consider using sp just for that is it small like under 200 characters is kind of arbitrary but around 200 characters and do you um do you require to or do you require total ordering if both of those are true if either of those are true use a b tree otherwise a hash index might be the right way to go is the column a by day why are you indexing a buy day please do not index buy days um if you must index it by day use hash or cons or calculate a hash and store it separately do you ha is it a range or geometric type postgres has ranges as far as built-in types and they're really really handy or geometric types including postgis geometric types and this is what just is therefore uh posted all post just indexes use just so you're set there um if you want us there's there's a problem that sometimes in geo databases sometimes called the starbucks problem which is you this is like you think the thing a geodatabase should be good at is say find me the nearest five starbucks right you know what else would you use geodatabase for ever um but that's actually traditionally a very hard problem in geo databases because you the geodatabases really had until really have one basic operation which is get me all the points in this box well if we're standing here finding the nearest five requires a very small box if you're in you know bozeman montana it requires a really big box to find five starbucks and guessing the size of that box is a hard problem um fortunately postgres posted just uh postgres first and then post just solve this problem with the nearest neighbor index which just handles this problem for you so use just for that um you can also if your data distribution is highly skewed um you might uh like your um you might try sp just you know just create the index to see if your queries run any faster with it um this this is very good if the if the points are are highly clustered in one like one quadrant of the the the whole space are you indexing inets if you're just doing equality you know inet is a generic type for both ipv4 and ipv6 addresses if you're just doing equality just use b3 much fastest type you might try hash probably isn't going to be any better but especially not for ipv4 for obvious reasons because they're about 32 bits but try hash of c is a better are you doing prefix searches though which is not an uncommon operation both especially for v6 addresses um consider sp um just because it's very good for this kind of prefix search stuff and the ipv6 and ipv four spaces are highly in unevenly distributed nsp just is very good for uneven distributions like that is the column array or a json b are you just doing equality then just use hash um if are you searching for key values at the top level within an object use gin um if you um this is this is slightly obsolete now you can you can use gen for interior indexes values i was looking for this slide i didn't find it for for some reason um you can use expression b tree indexes or now you can use gin for any key value search so that's cool who's using json without the b anybody who's using json with the b yay okay good remember jason posquez does have a type that's json without the b which is basically a wrapper around text it just takes the literal json blob and drops it into postgres and all it does is check it for syntax for for correct form that's correctly formed it doesn't do anything else to it um so why is this column json um your only real option on how to index this is with an expression index i mean i guess you can do equality but why would you do equality due to unstructured json blobs if you need indexing on a json type convert it to json b which has full indexing support okay we're back to text or fuzzy searching and by fuzzy searching i'm going to include something that's not strictly speaking fuzzy but unanchored searching where you're doing like percent sign fred percent sign where you're looking for things in any position um if you're doing full text search create a ts vector from the text which is the text the the parse down text search vector and create a gen index on that you can either store it as a separate column or just create or an expression index either way works fine um separate columns are better for complex ts vector creation because cs vectors can have multiple inputs like you can have the title a summary a title an abstract and the body of the text as all inputs into a single ts vector with different priorities for that a separate column is the way to go if you're doing fuzzy search create an index other column using just trigram ops which is part of the tri uh trigram package this is also will help you if you're doing those unanchored searches which are otherwise extremely slow in postgres because no other index will help you so is there more than one column in the predicate you um in your typical query are you saying where a equals one and b equals two consider creating a multi-column index if the of the predicates together are selective but independently they're not so if for example if a equals one b equals two cuts down to like four rows even though there's a large number of a equals ones and b equals two and multi uh multi-column index will probably be useful here just remember that if you that it index on a and b postgres will almost never use it for a search on b i have actually i used to say it would never use it i've actually seen that happen i'm not sure what advantage why the planner is doing that but it will use it just find the right index type for each column individually and create the index on them based on the most selective column um one situation you can run into is for example if this is a integer and that's a geotype they both you um this one needs a gist but you can't have a multi-index type multi-column index they all have to be the same index type so they all have to be b tree they all have to be just the good news is if one column requires a just index there's a package called b tree gist so you can create you can include things like just basic ins and other scalar types in b tree indexes or in just indexes excuse me so um if the query pattern is an arbitrary quality comparison of the various columns consider a bloom index so this is common like for example you have 12 20 different attributes and you have a user interface where the user can type in different combinations and search on some and not others and things like that very common usage pattern very hard to get the very um very hard to optimize for because you you can create a b tree index on each column individually but that would result in you know what 20 different b3 indexes that's a lot of indexing you might consider distilling it down to a single bloom index um if the predicates are selective independently so if you have a equals one and b equals two but a equals one cuts it down by a lot by itself without the b and v and b similarly you might consider creating individual indexes on both because postgres can do queries on individual indexes and then and them together so test that pattern instead does the query contain an expression well then an expression index you can you can index on expressions in postgres just like you can index on single values um for example this is kind of a common pattern you know index an accent lower name so that you can do case independent accent independent searching on a name um don't forget that postgres does have a built-in case independent text type for just the lower problem though so just remember just make sure that particular expression is heavily queried because it's expensive to maintain these indexes um if you're getting extremely clever and you've written your own function and you're going to index on your own function just remember that you have to declare it immutable and it really has to be immutable an immutable function in postgres is that it is a pure function it will return the same value with any input regardless of the state of the system at that moment so it doesn't query a table it doesn't use a uh it doesn't care about the time of day it doesn't you know any of that stuff so it's um so because the um you'll start getting weird behavior if it's a non-if the function is not actually immutable um what if one predicate is really selective so like customer id equals 12 and active and active is really selective like only 10 of the orders in the whole system are active um consider creating a partial index so you can create an index and put stick of predicate on it and then in this case it will only use this index so it won't even consider non-active orders active orders were not orders were not active and this index will be much smaller would be 10 the size and considerable impor performance improvement there if this is a pattern if this is a very common pattern okay and so with all this there are some tools that you can use to help out with your indexing the first question is do you need an index at all you can look at the view pg stat user tables look for tables with a lot of sequential scans postgres keeps track of every time a table is sequentially scanned in this view and if you see a very very large number take a look and see what the query pattern is on that and say and see if there's something you can do to help just you know not all sequential sequential scans are bad just dig into the particular queries look at their execution plans see what's going on you can find you can use pg stat statements text logs and pg you know run your text logs for pg badger very helpful tool um see what's going on there once you've created it is it actually being used well there's pg sat user indexes look for indexes that aren't being used at all drop indexes aren't benefiting you if they're used to their primary key indexes are used to enforce a constraint you may need to keep them but other indexes just drop them indexes have a very high intrinsic cost adding seven indexes to a table um increases insert times by about 28 runs at 128 the speed so they're not free our index is getting bloated that is to say they're they're very large compared to the actual amount of live data inside of them um vacuum can't recover um space as efficiently in an index as it can from from the heap from a table because indexes have structure and it can't always prune them down the way it can the the table if they if you have indexes that are continuously getting bloated for high update rate travels for example um it's you might want to consider in rebuilding indexes periodically um there's a script this um script here for doing an index bloat check our index is corrupted it doesn't happen that often but it does happen you so you get weird you know once in a while that you get bad results or you get errors from when you try to query something and it's using that indexes postgres 10 and higher has a built-in tool called am check you can run on your indexes it will never return it's it's um they do every the author has done everything in his power so they will not return a false positive although it may it won't guarantee that the index is good um so that one thing you can run um the good news is corrupted index is really easy to fix you just drop it and recreate it so to conclude indexes are great um just remember they're an optimization and always create them in response to create a particular query situations just so include throwing indexes on columns because you can this is kind of an anti-pattern in a lot of arms because it makes it so easy to create an index like you know um in django you say db index equals true and you're done but they're they're not free experiment postgres has lots and lots of indexes don't just don't just throw a b3 index on everything look experiment with these different types because some of them can be much much more efficient and just monitor their usage in size once a while poke into pg stat user indexes and see if you have any indexes with usages of zero or very very low number and consider dropping them and similarly go into pg stat user tables and see if you have lots and lots of sequential scans on something and just remember that pesky constant and that's everything thank you questions and and by question i mean something that ends with a question mark that you don't know already yes so you mentioned a really interesting situation where if you have seven indices it can go 28 times slower i was wondering if you could elaborate on that is it just due to like currency issues or what kind of things well every index has to so potentially for example you insert a new row it has to update every index indexes are not cheap to update because you have to hunt down find the right place in the index drop the row pointer in sometimes it has to do bucket splits especially um because a bucket's full full up full and a bucket split on v3 index is not a cheap operation it can propagate pretty far through the index structure so um it's so you can you um you can depending on the kind of the type of index and again it gets worse with with expensive updated indexes like gin um so you can you can significantly slow down pure inserts into a table by throwing a lot of indexes onto it compared to how you know because if it's let's say there are no indexes at all really all it has to do is just drop the rope right at the appropriate location on disk and it's done so so yeah they can be quite it can be quite expensive so whether it will be exactly 28 in your workload who knows but it will definitely not be zero but it will not be one point x you know 1.0 x to um to to throw a new when it reaches each index will slow everything down a little bit so there's any uh new index related features there are there are a lot of index optimizations there's no like whole new index type or anything like that but there are a lot of index optimizations um trying to remember i have to go back and look at the release notes now i was just looking at them too but yeah check out the release notes there's a whole section on index optimization yeah so speaking of the cost is there some monitoring or some easy way to quantify that additional cost of insert each additional index well it's there's no um you can well you can certainly monitor it directly and you know see you know by looking at how long each insert is taking you know and especially on bulk loads and things like that like copy where it's easy to capture the whole operation as a single um um you know it's also the there's no build there's no simple tool that just says well if i create this index how much slower is an insert going to be unfortunately um this is something you probably want you want to test um if if the if the you know i don't want to scare people away from using indexes period of course if the index is going to be valuable for query operation by all means create it the the main thing i'm trying to do is that there is a tendency sometimes created indexes perspectively like well i think i'm going to query on this column i'll just throw an index on it and don't do that you know do it on look at the look at a slow query and say ah i understand why this query is slow this index will help um i noticed that as an example of the tree that was like if you do just order by descending then you might want to create the b3 so could you elaborate on that because i thought that you can scan index backwards you can you yeah yeah it is it's faster if the index is it it's slower to scan an index backwards than that is to scan it forwards so if what you're going if the primary function of the index is to um is because of order by descending it'll be faster to create the index in descending order um it will still you you however i wouldn't like create two one ascending and descending unless you have a very very specific workload that requires it um but so but it is so a you if you have an ascending index you can use that to process the setting but it's not going to be as fast as if the index was descending was already built descending just when you say make an index and try it out there's a lot in that sentence how how do you test an index well um i mean if i'm sitting there with the query with the the query in front of me i can you know create the index and run the query again um there is a tool whose name i'm totally blanking out on apologies for that that actually will let you prospectively will let you try out indexes um in the sense that it will create the it creates them but marks them invalid which is a stated index can be in in postgres but it kind of hacks the planner so it'll try it anyway um so you can use that tool to say well if i create this index what new plan will i get it can't actually run the query of course because the index doesn't exist really hasn't been built but that's very handy if you're working on a giant database and it's going to take you the rest of your lifetime to create your index and you want to try it without seeing that i'll have to remind myself what this tool is called um it's googleable um but but also i mean there's a lot of caching that's happening yeah i mean i just do everything on a hot cache and you know because the main thing i'm looking for is does the query plan change yeah yeah because you know because you know caches come and caches go the query plan is more is much more stable so index it creates a processing list right and does somehow like affect also uh rebuilding the index that's slow down well it um the when it rebuilds it it rebuilds it as a in its normal structure it doesn't use the posting list for a create index operation um this is only when you do updates and inserts that change the that um change the index the the gen index column um then it will keep the stuff in the posting list until it does until it flushes it until it flushes it yeah so is there something similar to own store index from sql server workers uh right like column store index in sql servers they have it for analytics yeah um there isn't in post there are um extensions you can plug into postgres now it's not where it's cut where the where the table is sort call is a column or store storage type interestingly in postgres 13 we are we're getting pluggable storage for the first time um so in theory that that stuff can run in core um actually in postcode 12 the framework comes in and probably in 13 will get actual things in core um there's no but there's nothing in so these are available as extensions um there's cytis has one that's a columnar store you can just drop in to postgres um but there's nothing in in the in the community distro court like that any other questions what's the easiest way in the postgres catalog to find out what columns are indexed and what to name those indexes oh well you can look at information schema yeah it has it there um you know you can go directly into pg class all the indexes are there with the columns that are in them you you have to walk you know fan out across multiple tables but yeah i tried it it's very difficult yeah it's it's not it's not a suit it's not really intended as a user-friendly data structure they're um but um they're but you're writing a query and you want to know what indexes there are well um well if you want to know what indexes are on a table there's always backslash d plus and c you know if you're looking just for for an output like that that's what i would use you know i just do backslash d plus on the table um you know and if you want a you want you can capture the query that sends the database because all it does is send a package query to the database and construct that stuff there's also a function you can call on the index that gets its definition in a form that you could actually use to recreate the index if you wanted to one so on some internal tables like pg underscore prop you can set an admin permission to create indexes on those so was there any dangers to that because now you have two types of indexes you have ones that are built into the catalog and you have ones that you build kind of out in user space i can't off the top of my head see a danger in that on a sort of a database integrity level i can see operational problems because you can so because um for example those aren't dumped by pg dump and things like that so if you migrate the database suddenly you're not going to have them and if they were critical you know in some way um i'm so you know anytime you start fooling with the system catalogs you should be sure you should be sure that this is the only way the the this is the only conceivable way you can fix whatever problem you have and i've certainly never you know i've never felt motivated to create x-ray indexes on system catalogs but you know never say never so okay yeah thank you so let's say there are two multi-column indexes with the same prefix right and they both in use i suspect one of my very done like i could probably drop one isn't it like a safe way to try to eliminate one like if i just mark it in valid can i reveal it quickly then if it's actually well the only way if it's marked in val the only way to get it back is to rebuild it there's no if i do re-index will it just reveal based on existing index yeah it will rebuild it now you re-index of course locks the table while it's running so you may want to do that what you probably want to do is draw you know for kind of maximum minimum downtime you can drop the index and rebuild it concurrently do it create index concurrently um there's um yeah if you have an index like on a and an index on a b you can probably drop the index on a unless it's being used to enforce a constraint like a primary key because postgres can use the index on a a b for the same kind of thing the only place where that becomes a little bit fraught is if you have a call an index that is on a b c d r q you know one of these really first of all those are kind of dubious you know indexes at all but if you have an index on just a and index on 12 columns the first one is a those are really big keys that it's storing and so you may start getting performance problems for using a depending on what your workload is if you're searching on a just a by itself a lot but that's a pretty edgy case that doesn't come up that often so anything else okay uh test postgres 12. we um we we try to meet every month we are always looking for speakers and we're always looking for venues so please um get in touch with us at meetup and or via meetup if you can feel like uh offering any of those and thank you so much for coming [Applause]
Info
Channel: San Francisco Bay Area PostgreSQL Users Group
Views: 67
Rating: 5 out of 5
Keywords:
Id: Vn5CmTu3_0g
Channel Id: undefined
Length: 61min 8sec (3668 seconds)
Published: Thu Jan 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.