Postgres Open 2016 - Identifying Slow Queries and Fixing Them!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
awesome very cool so I'm Steven frost you might have seen me before at this conference maybe not I could be wrong we might have missed the opening that's fine too you probably got email from me right a whole bunch of email hopefully not too much email all right so we're gonna be talking about slow queries and I'm fixing them so a bit more about me just for for giggles I'm the CTO at crunchy data I'm also a committer major contributor for post-grad worked on all security 9 5 did column overages in 84 I wrote the role system which is fun and I think contributions to PL PD SQL and post GIS all right so when it comes to finding slow queries right there's kind of three main approaches that I encourage people to use first of all is you know logging basically everything in the world there's upsides and downsides to this but the upside is that you get lots and lots of information the downside is logging can actually slow your system down so that's something to be aware of and to be cognizant of and something to be considering right once you've done that the next thing is doing log analysis alright so with all those logs get tons and tons and tons and tons of logs but you need to do log analysis so one of the ways you do that is with PG Badgers this has become kind of my favorite go-to tool and I'll even you know show some results of it as we get through this presentation so you can take a look at it a newer capability that people may not be familiar with is something called PG stat statements I'm gonna talk about PG stat statement and how you can use that to look at what queries are running in your system and this is something that you can run as the database is running and you can see the queries that are being run kind of right then right with PG badger you have to log the query and then after the query from the log go back to it and pull out all the statistics and whatnot PG stat statements is alive right it's what's happening right now kind of information which is really really handy so when it comes to logging there's a lot of different options in PostgreSQL cons and these are all really very important at least in my opinion these are the things that you want to be looking at I'm gonna go through each one of these and talk about why they're important and what what matters to them so first off log mean duration statement so this is what really allows us to get the information that we need in the log file for something like PG Badger or even some of the other tools to actually be able to analyze those queries and roll them up and give you that the statistical information that your log you're looking for right so setting it to zero means you're going to log every single statement sent right if you want to set it to be a little bit less aggressive than that you can set the number above zero the number is in milliseconds by default I think you can actually add in what unit you want in there we've done some made some improvements especially in like nine five to allow that and then whenever a query takes longer than that like so you could set it to a second you could set it to 100 milliseconds you could set it to 50 milliseconds depending on what your environment is and then what happens is that it actually logs the query and that query log includes the duration on the same line as the query was right so here you can see in the result a duration of a one-second query which is about a thousand milliseconds and you can see both the duration and the statement on the same log line right this is really important for log analysis tools if you're using log statement and log duration you end up with those on two different lines right which becomes much more difficult to analyze so you definitely want to be using log min duration statements at 2-0 the next thing is that we need a log line prefix right especially if you're using the built in Postgres logging this is kind of the one that I like to use it's it's essentially ripped from exactly what pg badger is looking for by default which I find to be really quite a good quite a good log line prefix although people can have other options and you can customize it if you want and pg badger actually allows you two options to set it to whatever to configure it so that it knows what kind of log line prefix you have so in this particular one we have the timestamp this is really important for being able to do an analysis across time of you know when a query ran and how much instead gives you more than just duration it's when the query ran and you can tell the time frame time frame is for the beginning and ending of your report process ID session line number these are pretty straightforward who the logged in user is so this one's kind of interesting it's actually the user that logged into the database if you change user using something like set role this doesn't update right it'll still be the the user that was logged in as so that's something to just be aware of when you're using that the database that was logged into application name if set so so many people may not be aware of is we have this capability to have this application name right and psql we'll set that to P SQL PG admin will set it but if you're writing your own custom code you can have this set to essentially whatever you want for each database connection and that can be really handy for being able to break up your log file based on what applications are connecting so might be something to look into although I'm not going to cover it in detail here and then of course the remote host and then this % q just basically means stop adding a log line prefix when we hit this and you'll see why that matters here in a minute but for example here we have just a simple statement which includes this log line prefix so obviously you see that's got a lot more information on it than the prior one right which just had the log duration statement this one actually includes the timestamp and other information that's really necessary to get that to be able to do the analysis that you want the next thing is log check points check points are things that the database does periodically right by default every five minutes or so it'll do a check point which means we're gonna write out all the data to the data to the backing heap files write to the data files log check points is going to tell us all the information about when a check point started why it started how long it ran and a whole bunch of other really useful statistical information that tools like PT badger can pick up and provide information to you about one of the things when I go into a place and people are complaining that well slow queries are happening but it's like every couple minutes right that's like okay that's probably because every few minutes a check point starts and we write out all of the data and while we're writing out all of the data of your you know your entire system ends up being slow or inquiries are slower so that's something to be looking for and sometimes you can correlate that between you know logging of check between when the checkpoints are happening versus when the queries are happening so that's really useful information this is a straight-up connection logging information just logging the connections and disconnections it's pretty straightforward really important and really useful but not too not too complicated so this you actually end up with three log entries you get a connection received entry and then you get the actual authentication information when the connections been authorized and then you get a disconnection so this can help you analyze how long connections have been made to the database and in particular if you have a lot of short-lived connections that's usually a bad thing you want to actually use a connection pool or something along those lines to help improve your performance so one of the things people run into when they're trying to figure out why things are slow is that well they're connecting for every query right which is ridiculously slow and takes a lot of time yes this is connections oh my bullets are wrong you're right yep sorry my bullets are wrong on that one my bad so this is this that the bullets were supposed to be updated but I guess I missed that one slide okay sorry alright logging of LOC wait this is another one that's really really important that people don't always realize right why is your query slow well maybe it's waiting on a lock alright this happens all the time and people don't know it well if you turn on log lock waits after one second postgrads does something called a deadlock check okay and it will run this deadlock checking routine that is looking to see if there's any dead locks between the existing queries that are running so I'm waiting on a lock if you know somebody else is waiting on me for a lock you know we can end up with a dead lock so we have this deadlock detection but the other thing that happens during demo tection is that we have an opportunity to say okay these are all of the locks that are in the system and who's waiting on who so after a query has been waiting on a lock for more than a second if you enable log lock waits you get this really handy information about who's waiting on who so in this case you can see the I have one connection one process this is gonna be difficult for me um but you know the essentially the first line is saying we're still waiting this process to nine five five four still waiting for this share lock on this transaction right after we've been waiting around for a second deadlock timeout hit we ran the deadlock checker we'd find out that we're waiting on a lock right and we also know who's holding that lock right so this process two nine six one seven is holding the lock that we need all right and we also have the information about what kind of lock so in this case we're sitting on a two pole lock right so the way this happened was is that I went into one session I did a select star for update on a table and I started another session and did a select star for update on that other session right and so we actually see the tuple and we see what tuple it is right as well that's the zero one that's the first tuple on the table basically and we see what the relation is and then we see what the statement I was running was right so this is very very useful information for doing deadlock and net or lock analysis so if you don't have this enable definitely recommend it another thing that can cause problems is temp files right whenever you have temp files being created that means the database is having to do some amount of disk i/o so I'm out of work where we're actually writing data out right in this case you can see that I ran a a select star from t1 order by one so what's happened is that we're doing a sort right and what that sort is trying to do is do a and it's using a temp file to do this sort that tends to be expensive because it means you're going out to disk to do a sort right at disk based sorts so the way you can detect this and realize it is by logging these temp files that Postgres creates right you just set log temp files equal to zero and then every time Postgres creates a temp file when it's trying to run a query for anything it's going to log information about what that query was and that can be really helpful for figuring out okay this slow query the reason it's low is because it's using a lot of disk and now I can go look at that particular query and say okay why is it using all of these temp files right it's doing a disk based sort or is it doing you something else yeah so all of these that I've covered all of these logging options require a reload only there are some options that require a restart but none of these do so its feel free to ask me questions as we go I'm happy to answer them otherwise I'll just kind of keep plowing through things all right so well our tempo so log Auto vacuumed so another thing that people often complain about is hey auto vacuum is running right and it's vacuuming stuff and it's killing my system and I want to stop it right don't ever do that for starters because then it just becomes a real problem because you end up with a lot of bloat I'll talk about bloating a bit but if you're curious about what the heck auto vacuum is doing enable this law gautham a cumin duration set it to zero that way you see everything that vacuum is doing across every table every time and you get all of this wonderful information out of that about how many dead tuples there were found how many it was able to mark is completely removed how many are dead but not removable lots of really great information that you get back from log Auto vacuum in duration that allows you to really do this kind of analysis work you also get to see you know what kind of system utilization was required and how long that auto vacuum step took all right so now we're going to talk about a bit of log analysis right so PG Bajor is the tool and it's actually really really straightforward right there are some other tools out there but PG Bajor is definitely my favorite on a Debian based system it's just apt-get install PG Bajor right that's it you're done all you have to do is run PG Bajor and pass in the log file and that's it you've got it right that's that's all you have to do and it's fantastic so what I'm gonna do is I'm gonna well I'm gonna try we'll see how well this works but if I'm lucky I will be able to bring up the there it is report look at that so this is gonna be again pretty tricky I'm trying not to move too far because they're recording this but here you can see the the statistics that are available with PG Bajor right and this is just you know these are just some global stats one of my favorite things to do with BG badger though is to come over here and look at things like you know most frequent queries so here what I've done is I've used something called PG bench right so I've run PG bench against my system a couple of different times and now I have all this nice information about okay you know this query here's the time total you know the minimum amount of time it took to run the maximum the average time it took to run so a lots and lots of really detailed information for finding these queries another so the I mean I could spend an entire talk just talking about PG badger unfortunately I don't have that kind of time but you know another good one is this histogram so unfortunately I don't have very many queries running so it's not very interesting of a report here but it can be really helpful and you can see it's got this slow query you know into slowest individual queries down here now included in this is the actual loading of the PG bench data so that's why you see that there but you know there's lots and lots of detailed information in here about the different queries trying to figure out where they went about the queries that are run inside of PG bench so that's what you can see from that report that's that copy and then I think all those ends are basically whenever we're committing whenever doesn't commit it doesn't end and what that's gonna do is that's gonna require that we actually write in a right ahead log record and flush it to disks that's why it's gonna take time so as I was gonna say the other one was time consuming queries yeah so this one's also really good so this actually calculates you know over time which ones took the longest amount of time so here you can see these update statements which are you know it's not too surprising because there's just so many so many times that they were executed that they end up being a large chunk of the overall time involved in the in this particular run of query information so in here you can see these inserts are obviously taking up a significant amount of time also even though the individual queries aren't taking very much time it's actually really interesting see just how much total time has taken and then the other information that is particularly useful is when you see these kind of numbers right the min duration versus the max versus the average here you can see something interesting is happening right because what you see here is that there were times when these took a long time to run right 300 milliseconds for a given query that's the kind of stuff that I end up doing a lot of analysis kind of work on yeah yeah I don't know what that is what it is so anyway so that's just you know a really really really handy tool like I said definitely you know install it run it against your vlogs play with it I'm not gonna have time unfortunately to cover it in complete detail could be up a lot of other material to cover but I definitely wanted to show everybody that all right so that yes so let's see does that show up is the same query in PG Badger I believe the answer is no not on that not with PG Badger but what I'll show you is PG stats statements where we actually normalize the query and it will show up that's the same query right right okay so you can't get that information out of a PG bedros well for the top ones right you can't get it for all of them that made up what I was thinking oh you can't get it for all of you get it for some of them right whereas yeah so that's really pretty handy I'm building you can use yeah yeah so anyway any other questions on that alright just wanted okay good stuff okay so PG stats statements so I was gonna talk about is really really handy and and so in order to install it you have to add it to what's called shared preloaded libraries and then you have to for this case you have to restart postcodes because it's in shared preload library you actually have to restart it and then after you've restarted post CRIF you have to log in and run this create extension statement alright great extension PD stat statements alright and here's what it looks like in terms of the different all the different columns that you get so what you actually have here is a view across the function that's provided by PG stat statements that provides out all of this information and essentially on our per query basis now the number of queries that it stores is configurable but when you change it you do have to restart Postgres for that because shared or a PG stat statement is using shared memory to be able to track all of this information about all of these so unfortunately if you do change it you have to restart I believe the defaults like 5,000 though so it's got a pretty good chunk of space in there for the queries that you can see through this view so the information you get out of this is has a lot of similar information to what you saw on PG Bajor but it's more current because it hasn't had to be written out to a log file and then run through the log analyzer so you get calls you get total time men max meantime standard deviation number of rows that were returned by it so lots of really useful information and there's more right so you also get shared block hits shared blocks read you know how many blocks were dirtied or how many blocks were written all is really really useful information is available through this PG stat statements view so I'm not gonna kind of go over each and every single one of these things but if you have any questions about anything I'm happy to try to answer awesome good stuff so here's an example of when you actually run some queries and then we'll get the results again this was run with PG bench so here you can see there's a query ID that has been generated that query ID is consistent for the most part for a given query it's essentially a hash of what that query is there's been a lot of arguing inside of the post-credits community so it's not something that you should necessarily trust across Postgres major versions I think generally speaking we tried to say that it won't change across Postgres minor versions although I wouldn't be surprised if that ends up happening at some point too but so far I don't believe it has so what this does though is it gives you the the query itself the number of calls and then again total time so here you can see you know how much total time this running this query took now this is all by query ID and also by user right so believe you have different users running it you'll get different entries inside of here as well I mean you can see number of rows and lots of really really useful statistical information for doing analysis so something that you might consider doing is periodically querying this table right for this information yes default retention for this information so you can tell posts you can tell P do set statements to save it out and reload it on restart and I think that's actually the default but you can also turn it off so as long as it's you know as long as it's within the queries that we're currently tracking which again was at 5,000 max it's going to continue to save that information there is a reset option the users of reset function inside a PDF statement so you can call to reset that information right so that may be what you're looking for is a I want to reset it and then you know for the next hour I'll get the information and then I'll reset it again so if you have something that's coming in periodically and checking it that's typical to do any other questions on that yes do we know how much overhead there is I'll say it's not free but I couldn't tell you offhand what the numbers are for how many percent or anything that it is unfortunately I don't I don't have that I don't think it's that much but it's gonna really become come down to your workload probably there's gonna be the main issue if you have lots and lots of really really simple queries it's gonna have a larger impact than if you have oh you know fewer larger queries that are running any other questions on that all right good stuff so here's just another example this is just a straight up select query that's run that's another example of a query ID and here you can see how the query ID is a different ID because it's a different statement and one of the things that this will do as was brought up before is that even if you're not using prepared queries this PG stat statements in newer versions I forget exactly where we changed it but in newer versions will actually normalize the query for you and combine queries that look that have essentially the same parse tree right that look the same even though they are not using even if you're not using prepared queries so that's one of the really neat things older versions if you were using prepared queries it worked just fine but if you weren't then they would be different records inside of the table not woods that sucked that wasn't helpful alright so now let's talk about you know understanding why queries are slow right so queries can be slow due to configuration issues bloat and query plan right and probably some other things but these are the things that we're gonna talk through here so the first thing is post goes configuration I'm not gonna go over all of these there's lots and lots of different configuration options that you can see here I'm going to talk through some them as we go and then I'll cover a few at the end as well but just be aware of these things all being ones that you want to be particularly looking at in the Postgres config file for consideration all right dead tuples and bloat so the way post-grad handles dead tuples is with vacuum right so vacuum goes through and marks records is reusable reusable tuples are used for doing new inserts and updates as we go through and make changes to the table and insert or update things but if those tuples exist inside of the table postcode has to consider them right tables can have lots and lots of dead tuples indexes can have bloat too and so there can be cases where you want to consider re-indexing right particularly if the index has changed significantly so you know if you're doing something like a time-based index one of the problems you korone run into if you use the same index and you don't use any partitioning is that you can have a case where the the range of values that you're now indexing is completely different than what you were indexing a week ago right because now everything shifted you've rolled off the old week you have a new week that you're doing and all the old records are gone right but we still have to have all of the middl pieces and all of the pointers to all of those so you can end up with index bloat in those cases so that's something to be aware of right so there can be cases where it makes sense to reindex although I try to discourage people from doing like we're just gonna re index every week kind of approach because that's a little bit too hard and fast of a rule in my opinion but definitely look for bloat and consider if you want to be Rhian dexing more frequently than you are today or if you're not at all check Postgres is a great tool so if you're using Nagios or anything that looks like Nagios check post quiz dot PL it can be very helpful I will say that it helps you identify tables that you probably want to look at to consider if there's bloat there don't necessarily take it's metric as being gospel though right because some boat is definitely very useful and some tables may be a lot of bloat is useful write what you know what happens is that if you don't have room in the table for a new record to go in we have to do a relation extension and that can be very expensive in Postgres right and that you know so there's a lot of cases where it's actually better to have that room assuming you have the disk space for it inside of the table to be able to insert that new query rather than actually having to go and extend the relation itself take out a heavyweight lock to do that which can be kind of painful especially if you're doing lots of concurrent bulk loading that's where it gets to be really really painful illuminating all bloat actually requires us to rewrite the table cluster and vacuum full or both options for doing that I wouldn't generally recommend it cluster though allows you to rewrite the table in the order of a particular index which can be useful in some cases although because Postgres is secondary indexes just index directly against the table unlike in some other systems it's not as big a deal to cluster around your primary key right and a lot of systems it's really really important to have a clustered around your primary key because everything is being hooked up via that whenever using any indexes right we don't do that in post growth so it's not as big a deal you may still want to do it if you have one index that you really hit really hard and you find that it's valuable to have it clustered especially if you're doing something like an index only scan across that table okay right I can see reasons why you might want to do that but I wouldn't necessarily go to it immediately generally speaking though I would recommend clustering instead of vacuum full it's kind of my go-to because at least then you you know it's clustered at least once note that in cluster and Postgres though does not it's not maintained right so cluster is a one-time operation takes out a heavyweight lock and then it's not maintained so does somebody be aware of all right so let's talk about how Postgres gets data right whenever we get data out of the database you know and this is how all gonna fold into how query plans work right so what we have to do is we have to get the information from the disk one approach is sequential scan what this does is it basically starts reading at the beginning of the table and goes all the way through to the end unless there's a limit clause on it right but generally speaking you have to go through every single record this represented by a seek scan node right it's great for bulk bulk operations and it's also if you look at query plans a lot you'll see things like a bitmap scan right that's actually similar to a sequential scan in the way that it works it starts at the beginning and goes to the end but it only visits the records that were noticed in the index as having potential pages so bitmap heap scans can be really helpful and it's something some people only know it as no it has a skipping scan right where you're you're skipping over or set two pages that are not interesting to you the other approach that we can use is to use an index right so we scan through the index and this is going to be represented through an index scan node and find those specific entries inside of the index that we need and then go back over to the heap and pull out the data that we want right this often ends up being a forgotten piece right you end up not creating the index that you need and that's one of the reasons why queries can end up being slow because we don't have an index we're gonna be doing sequential scans or up across the table to be able to pull out the information that you're looking for one of the other nice things about an index is that the data can be returned in order all right and this can be really helpful for certain type of types of plans like merge-join like we'll talk about in just a minute another really cool thing that Postgres has is index only scans now one of the things you have to realize is that index only scans are actually index mostly scans right because what's happening is that we have a visibility map and this visibility map is maintained by vacuum and by auto vacuum and that visibility map tracks which pages inside of the table are marked as what's called all visible all visible tables mean that if the value that we want if the column information that we want is in the index and the page is marked all visible then we don't have to visit the heap right that means we know that that tuple is valid and we can just pull it out and return it directly to you right so that's an index only scan but it doesn't mean that it does mean that you have to have the columns in the index that you need right you must be using vacuum and there are cases where we'll still have to go visit the heap because not all of the tuples on that page are visible right and therefore we have to go visit the heap and figure out okay which ones are visible and which ones are not visible in that cases in particular the individual index entry that we're looking at that kind of makes sense to everybody all work so these are the ways that we can get information off the disk with post growth all right so now we're talking about Nesta blue or talking about joins right how we're putting these things together so we have a couple different ways of doing that we have a nested loop join and that means basically we step through one table and every time we get a record from that table we go look up the record that in the other table that we need right this can be really fast for small data sets right but it tends to suck for large data sets because we're you know continually doing that index you know lookup even if it's an index lookup can be slow because it adds a lot of time right merge-join is our second join type so what we're gonna do is we're gonna take both tables that we're joining and we're gonna sort them or we'll do an inorder index traversal depending on what we're doing and then we're gonna walk through them right and combine where they actually end up matching it can be pretty good for bulk operations but sorting is expensive right even doing an inorder index traversal can be expensive right so those are things you have to be looking out for right hash joins is the last one it's kind of my favorite go-to it's what I really like to have a lot of my joins doing and what this does is it scans one table right typically the smaller table but I've seen it go both ways I've seen the scan the larger table and then we take that scan and we build the heap of the hash with it right we build a hash table and then we step through the other table looking up entries in the hash table right so that's how we implement a hash join so what that means is that it does have a bit of a slow start if you will right so one of the things that merge join with an index order traversal is really good at is when you have a few records that you want to pull back right you want like a top 10 right especially if we can use the indexes to get that information out and then join them together on between the two tables that's glorious right that's really fast hash joint isn't good for that kind of an approach because we're gonna go build a hash table of one table that means we have to go through all of that table right to build that hash table once we've done that then we can step through the other table so there are cases where that's not as good as doing a merge joint or a nested loop join but it's really great for doing kind of bulk work any questions about the different joint types alright so aggregates so when we want to pull things together we want to do an aggregate you have a summation or an average or something we have basically two different types of aggregates right we have a group aggregate which means that we're gonna sort all the data and then we're gonna look through the data and say okay whenever there are group by key matches those two records are going to get combined together right and then once we've combined all of the records and we've moved on to the next set we're gonna output that right so this can be a case where you can potentially use a short circuit right where if you have a limit involved you can short-circuit that group by but generally speaking it ends up being a case where sorting is expensive as usual doing an inorder index traversal expensive as usual so a lot of times what you actually want is a hash an aggregation right probably the hash aggregation is that is memory intensive otherwise it's the same kind of concept as a hash state as a hash joint where we scan the table building a hash table as we go and whenever we find entries that match we combine them together alright so what's the best plan it all depends right the database tries to figure out the best plan by gathering statistics using the analyzed vacuum analyze you can look at what those two discs are inside of the PG statistic table auto vacuum also is gathering the statistical information that you need bad stats equals bad plans right so you'll want to be looking at explained analyzed results and checking your results versus what Postgres estimated and in the in many cases you may want to increase your statistics target right or maybe do something else to figure out why are you getting bad stats coming back why's postcodes getting a bad you know getting the wrong answer for how many queries how many rows or tuples are expected to come back so automating collections of those plans this is really important as well so Auto explain is a really helpful module it allows you to get back the information about what all of the plans are in your queries that are being run right so this is another one that's based on the length of time and it logs the explain for the queries if the query took a certain amount of time right so for and this is enabled also with shared P loaded libraries that means you have to restart postgrads when you're and then you can tell it okay log anything that took more than 50 milliseconds right you can all tell the log nested statements so nested statements are ones that are exist inside of another statement and where that happens is something like PL PG SQL right so if you have stored procedures that you're running those stored procedures are running function or running queries underneath of the query that ran the stored procedure that's a nested query you can get the explained information for those using auto explain that's one of the best use cases I have for using auto explain actually is logging into those nested statements cuz otherwise it can be a real pain to get that information out you might have any questions about any of that alright so a lot of different ways do huh auto plain outputs do the log yeah that's where they have that the auto explain outputs to the log to the Postgres log all right so analyzing planes so explain provides a few different ways of getting information now you can you you can get the explained output through XML JSON or llamó and then there's a couple different tools that are useful for analyze them if you like graphical stuff so PG app in three and I guess PGM in four also has a way of taking the query running explain on it and giving you the plan I explained that depends comm is fantastic right this is a really really helpful tool for doing explain and for looking at explain information alright so now let's talk about what we're gonna do right so there's a number of things that are low-hanging fruit right if we have a sequential scan happening you're only getting one record back and you're not doing any kind of aggregation or anything you probably want an index right and you know if you don't have an index and maybe you can't have an index for some reason think about Israel way to constrain the information in a way using a conditional that you can use an index right so that's one of the really important things the next one I talked a lot about the things I like I like hash join I like a hash aggregate right what Postgres needs to make those happen is work memory right so if you have a small data set and you're seeing sorting happening or you're seeing a merge-join used what you can do is increase work amount all right increasing work mem means that Postgres has the opportunity to consider doing a hash joint or a hash aggregate if work mem is too low it won't even consider it right because it knows the hash table is gonna be a certain size and it's not even going to consider that as part of a plan right so that's something to really be looking at one other really cool thing with Postgres you can sit work mam on a connection right and you can have it change during that connections lifetime through the application and setting it on that one connection won't impact the other connections right so if you have a particular query that you know needs a lot of work mem you can just increase work mem for that query and then reset it back and off you go all right another part another thing I've run into is things like statistics right so if you have a large data set and you're seeing and that's the loop happening make sure that you've got current statistics right make sure you've run analyze against it what I see a lot of people happening is that you know if you've never run analyze across the table postcode just has this like default idea of how big the table is and what is in it and that default is often wrong so if you've just created or just loaded a big data set into a table make sure you analyze it before you go on and do something with it right because otherwise you can get just absolutely terrible plans another case is that if you have deletes that are really slow right and you've got foreign Keys involved Postgres doesn't require those foreign key references to be using an index right the table that is referenced has to have an index but the table is referencing it don't write if the table is referencing the foreign key don't have an index and you're doing deletes on the parent table on the referred to table Postgres is gonna have to go sequential e scan the table that's referring to it to find the records and make sure there aren't any records referencing the record you're trying to delete the value you're trying to delete right so that's another thing that's really important to make sure you're creating indexes on the referring table side for those cases where you know you're gonna be deleting records from the referred to table all right prepared queries are really really helpful although there's some caveats to it right so prepared queries means basically plan at once and then run it multiple times right this avoids that cost of repeatedly planning it now Postgres is a little bit tricky here right postcodes has of generic and specific plans okay it's gonna use both initially it's gonna actually every time you run the query it's got a generic plan that it created but it for the first five times it's gonna check and see okay if I generated a specific query plan for this particular set of values that are being passed in is that a better plan or not right once you've gotten past those first five times if the generic plan always won out or was within some threshold if I remember correctly we're gonna go with the generic plan okay going forward but if we decide that we haven't won every time and where that happens a lot is when you have constraint exclusions happening then what will happen is that postgrads will actually keep using the more specific plans which do require a little bit of extra time but it's usually worth it right because the overall query ends up being being better so here's how you can do and explain and explain analyze using a prepared statement if you're not familiar with how to do that you actually do the explained on the executes when the explained acts analyze on the execute alright so let's talk about some different queries so select count star from a table right a lot of people like to do this I generally don't recommend it but if you have to do it you have to do it what people don't always realize now is that indexes can actually help with this problem now right because we can do an index only scan it doesn't actually matter what the column is in the index as long as there's at least one index on the table we will be able to look through the index and check the visibility information from the visibility map across that table and calculate out the count star result much faster than having to scan through the table itself because the index is a lot smaller than the table right so that's something to keep in mind if you have to if you have a count star kind of requirement to consider whether Postgres is using an index only scan there and again make sure your visibility map information is kept up-to-date because if it's not we're still gonna scan the heap or we're gonna have to go back to the heap a lot which can actually end up being worse all right another one is toasting right so if you're not familiar with what toasting is toasting is a case where Postgres will compress a large value and store it out of line were you so that the regular table and the pages inside of the regular table don't have like this huge block of stuff in it right so these are really really important to realize what that means is that if you have a table where you've got five small columns in one really big column like you've had a document in that last column or something don't use select star right because select star is gonna go get all of that information for all of those columns even if you aren't going to ultimately use them right instead select out the individual columns that you want because if Postgres doesn't need that toasted value right it doesn't need to return that column to you we won't go and decompress it right we won't even go look for it right and that can actually make a pretty significant difference when pulling data out so don't use select star if you can avoid it right building is just regular columns just pulling back all that data means you're transmitting more information but I find that where it really matters is when you've got toasting happening another one is watch out for comma joins I really really hate comma joins so because what you can have our missing conditions on joins right so here you know you can here's a case where some you know and this is a query it's a form of query that I have seen before and it's scary right somebody does a select distinct star from a bunch of tables and then they have a bunch of conditionals and they've forgotten one right so they've forgotten to put a join condition between tables C and the rest of them so we end up generating a Cartesian product and then the distinct was thrown in by the developer because they're like oh I don't want all these duplicates I'm gonna use the sink to trim it back down right it's terrible but I see it a lot right in fact generally speaking if you see a select distinct I usually say those were queries to go look for and analyze whether that distinct is really needed oftentimes it's somebody forgot to join condition or has done something else you know that was not intended all right so definitely those are things to be watching out for as well I strongly recommend using join syntax all right so here you can see join a join be using right this forces you to remember for those joining conditions between all the joints right that way you becomes a lot less likely to forget them any questions about any of that yes so how does the optimizer oh you're talking about in an inheritance environment so we should still be able to do an index only scan on the individual child tables but you have to create an index on the child tables right and then we can do it then the count star can go and use the indexes on each individual one to do the counts and then it'll bring it all back together right so that that's one way to do it there are some other ways and I'll talk about that actually in a minute so here's another one that I see people doing where you do an inn generally don't recommend using in is when you can't turn it into a joint right a lot of times this can actually be done as a joint instead and that usually allows for more options available to Postgres for how to execute the query right and oftentimes was a faster way of executing that kind of a query another one is if you're using not in it's usually better to use not exists and use a correlated sub-query if you can write because that is something that postgrads can actually turn into what's called an ante join or a left join right where we want to do a joint and then remove records that have been found that way the other way you can use the not in or even or not exist often is to just rewrite it as a left join where the other side is not right or or not in all depending on what you're trying to do but that's generally the idea that you want to use is that you can you know do that as a as a little F joiner or not exist and that's usually a better form than using a not in right because post goes actually will go run whatever that thing inside of the not in is and return it all and then you know you have this big bump of stuff that's not indexed and we have to go scan every time we want to consider something yeah it can yeah absolutely absolutely so the the point being made is that it this does depend on the table sizes involved right if you have a very small table for the set of ID's then it's perfectly reasonable to use an odd n or an N in a lot of cases whereas actually rewriting as a join might not be as performant although I find that it tends to be pretty good anyway yeah okay no I didn't should never be used according to this gentleman David also agrees and so I think we've got a quorum there so don't use not n so if you see not and don't use it I mean I generally agree with that although some people might object yes yes that is also true so no no handling is another reason why you want to be avoiding not in because you can end up with having the wrong result back if you if you're using not in alright any other questions about that moving on say I'm not sure what am I supposed to start oh I got like four minutes left right all right thank you okay see tes so some of the things that are you know there are some rules are on using CDEs and something to consider when you're using CTS is that you generally want to keep the results of the CTE small so if you're not familiar with the CTE it's this with construct right so it's a common table expression is here you see with CTE as in this case the CTE this is using a common table expression whenever you see with but one of the things that you'd like to do is make sure that the results that coming back from a CTE is relatively small because Postgres will actually generate all of those records for you and then use those to fill in whatever you have that CTE referenced later and if that's a large set that can end up being a bad idea so what you want to do is kind of put your expensive stuff into the CTE and then any kind of post-processing or stuff you want to do afterwards or things you want to do with like this one big table that you're joining again do that later on in the main part of the query but use the CTE to kind of build up to that point right I find that happens quite often and is quite useful to do it that way not only that but si tes allow you to reuse that same result set over and over and over again so if you have some kind of big expensive join you can do this you know create the CTE result from that hopefully limiting it in some way that you only get down to a subset of the data and then if you have to do multiple things across that data set you can do those in the outer part of the query and that will end up being more efficient than rerunning an expensive sub select over and over and over again which is what would happen if you actually wrote those as individual sub selects so if you really really want a fast count star and you just have a table that you want to figure out how big it is you can just look in PG class now this is based off of statistics and so it's not something that you can trust as an exact and it's only good for the entire table but PG class has something called a rel tuple so you can just go look at so if you don't want like a gross estimate for how big the table is use PG class dot rel tuples another approach is using triggers right so if you actually need an accurate result then what you can do is you can offload the cost of doing the count star during the query time during select time by increasing your write load write the amount of time it takes to add things and you can do that using a trigger based approach by having a trigger on whatever table it is that just goes in increments or decrements some value in the in a side table that you're using to keep track of what your count star is so that's the other way of if you really have a case where you have to have a count star answer that's that's one that I tend to recommend to people if they must have something accurate and they can't afford to do a count star at run time alright so a bit of a review increased work mam increase maintenance work ma'am so maintenance work members use by the vacuum it's also used by creating indexes so those are kind of maintenance operations it is useful to have a higher maintenance work mem for regular work mam is for running queries so depending on what you're trying to improve effective cache size is an important one to make sure you set correctly always you know a lot of people want to consider increasing shared buffers and that's often the right answer although Keith over here as a couple blog entries that are fantastic for showing different cases where sometimes it's better to go the other way sometimes it's better to decrease sometimes it's better to increase to really depends on what your workload ends up being partial index isn't functional indexes or fantastic ways to deal with funny values in your table so if you only got a column in your table that is null 90% of the time you probably don't ever care about when it's null so create a partial index that doesn't index the nulls right and if you do that then when you query the rest of the table for where there's actually a value there it's really fast right because now the index is way smaller because we didn't have to index all those other tuples right another case is using functional indexes right you can actually create a function or a create an index overtop of a function right so what you're actually indexing is the result of this function and if that's something that you're querying with a lot you're using that function in your queries a lot having an index based on that can be really really handy there are some caveats on that the function must be immutable you can lie to post-credit sometimes if you know what you're doing and tell it it's not really in B the wool it's not but not something I'd recommend without really understanding what you're doing but that is a caveat on that do make sure you double check that your query plan is actually using the index when you do this because what I've seen is people will go create a functional index and then they'll they'll run the query with like a slightly different thing right somehow the type checking or something in Postgres ends up saying well that's not quite the same function I can't use that and you end up not using the index and that sucks right you went through all this effort to create it then it doesn't get used so make sure you're testing that this is actually is happening right another big one is in particularly if your rights are slow is remove any unused indexes right so Pte statistics the six we gather one things we track is how many times an index is used this is something done inside of PG stat user statistics or user indexes rather so if you don't have an index being used at all and you don't need it for like a constraint or something like a primary key drop it right it's just expensive it's costing you time on every one of your rights all right so that's what I've got any questions any more questions yes Union versus Union also that's a great point too right a lot of times people will use Union to combine result sets but if you know those result sets are not going are going to be disjoint or you don't care about any duplicates that show up you can use Union all instead so that's a great point as well right if you're doing set based queries so I would recommend using CTS and said for an optimization if you're looking for an optimization fence you can do that with a union but I would generally use actually use CTE rather yeah yeah it can also be yeah it can also be an accidental optimization event that's true so that's actually there's another thing to point out about CTE s is that because they're an optimization defense that what that means is that Postgres will consider just that CTE and optimize just that CTE and then it will optimize the next CTE but it won't put the whole thing together right for you and so what that ends up being is that there can be cases where that ends up being slower right it means that we can't create as good a plan as we could if you use sub selects instead of a CTE because it's a CTS our optimization vengeance yes into a CTE you mean there's been discussion we don't actually don't even have to do that right so I understand where you're coming the question was about whether we might consider using what foreign data rappers do to allow things to be pushed down across a foreign data rapper to another server you know the question is could we do that for CTE zits like we could but we actually don't even need to do that right we can just rewrite the entire thing you know essentially as sub sub queries if we wanted to write or you know there's other things we could do with that but it's actually intentionally made an optimization defense right there's been some question about should we give people the option you know we could actually make it optional whether a given CTE is going to cause an optimization offense or not but I think that really cries for use cases right and examples of where that necessary and where it's not so I have to go like run the Lightning talks in five minutes so if you do have other questions please come find me I'm happy to chat about Postgres anything and I will definitely be at the reception so thank you all very much
Info
Channel: Postgres Open
Views: 19,558
Rating: 4.9346404 out of 5
Keywords:
Id: yhOkob2PQFQ
Channel Id: undefined
Length: 54min 36sec (3276 seconds)
Published: Wed Oct 12 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.