Scaling Postgres Episode 177 Planner Workarounds | Grouping Sets | Lateral Join | Not-So-Easy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this episode of scaling postgres we talk about planar workarounds grouping sets lateral join and not so easy i'm kristin jameson and this is scaling postgres episode 177 [Music] all right i hope you your friends family and co-workers continue to do well before we get to our first piece of content i wanted to let you know about an upcoming live show we're doing this wednesday 8pm eastern standard time for the rubber duck dev show that will be focused on database performance now we're not going to be exclusively talking about postgrads but that will probably be a significant part of it given my experience so if you're interested in attending this live show we're going to be doing it this wednesday so feel free to stop by the rubber duck dev show and click on either one of these links at pm eastern standard time our first piece of content is working around a case where the postgres planner is quote not very smart this is from heap.io so in this post what they were trying to do is to get an index only scan working so an index only scan only uses the index to retrieve all the information you need for a query typically with an index scan what you're doing is you're scanning the index finding the record of interest and then going to the heap to get the remainder of the data from it but you can do index only scans where exactly what you're querying is right in the index so if you have an index on an id and all you're asking for in the select is the id it can just look in the index and not go to the heap to find out what's there that gives you a speed boost because you don't have to go to the heap now in the example that they're looking at here they had an events table that had a time column a user id column and a data json b column and typically they would have an index on the time and the user id of this column to pull back data but they really wanted to get this specific query to use an index only scan so as you can see in the where they're looking at a particular data type within the json of these different types in a particular time range and they were also pulling certain columns in the select clause as well so they created this index to try and get an index only scan so they defined data and type so they're indexing on data and type and by the time right here and then they're including the columns that they're going to be using in the select so basically this is an additional payload stored in the index to prevent you from having to go to the heap so they configure this the only problem was it wasn't working and the problem that they discovered and it's even in the documentation postgresql's planner is currently not very smart about such cases and that's the case where you have particular operators or functions working with data so we had a problem interpreting this with this operator as sufficient to not have to go to the heap so what did they do for a workaround they actually used a partial index so instead of doing an index on the json b field for type they only indexed on time but then they use the where clause to specifically define the types that they're looking for so this will only index types with click change touch but they're doing the index by time and still including these columns and that was sufficient for them to get an index only scan and as a result they got about double the performance out of it now i typically use these types of partial indexes with general performance improvements because it is a way to eke out a little bit more performance although you're going to have to deal with multiple indexes potentially like if you typically query certain types you would need an index per type of query you would use but this is an interesting work around when you're trying to get index only scans so if you want to learn more go ahead and check out this blog post the next piece of content postgresql grouping sets roll up and cube this is from cybertech hyphen postgresql.com and they're talking about grouping sets so they have an example table here where they have a country a product name the year and the total amount sold and in this scenario they're testing analytical queries so when you select the country and then do a sum of the amounts sold for from this table and group it by the country you're going to get output that looks like this just per country and the totals you can also do it per country by product name for the total amount sold and you can even do things such as using a case statement to group them differently so all the us a together and then non-usa together using this type of a case statement but you could also pull this type of data using grouping sets so in this initial example here you can break up what you're pulling by country and by product but it uses just one sql command that's more efficient than trying to do a union so for example you do grouping sets your first set is defined as the country which one defines the first column two defines the second column the second group in set is byproduct name so it simply lists one after the other for these grouping sets now we said you could get the equivalent of this by doing a union all but it's not going to be as efficient as using grouping sets so the next thing to look at is a rollup so this is rolling up the totals by a certain area so here we're doing a group by using a rollup for country and then product and what you'll see is there are now totals that are appearing per country so this is the total for argentina this is the total for germany total for usa and then also the ultimate sum of all countries and all products now what they also said is that sometimes you don't want a null to appear here or you want it to say something different so in this example they actually put the actual query in the subquery and then they used a case statement to replace the nulls with total so now you see the argentina total the germany total and then essentially the grand total for everything and again the advantage of these grouping sets or grouping by a rollup is that it does all the calculations in one pass it doesn't require multiple passes if you were to try to do the same thing using unions and then they took a look at cubes and this is all the different permutations of calculations so doing group by cube country product gives you a country and its totals and then also the hats and shoes and its totals so without respect to the country and then the last thing they mentioned is that when you do an explain plan of grouping sets you can see how the actual query plane is different than just using unions so the advantage of using grouping sets or roll up or cube is that it does all of these calculations in one pass so if you're interested in that you can check out this blog post the next piece of content a simple example of lateral use this is from fluco197a.gethead.i and they're talking about using lateral joints and it came from an example where someone was trying to select events from a table where there were no more than 10 minutes from one another so looking at an event what events were around it within a 10-minute interval so he generated a table that had events listed every two minutes and then did a generate series to generate a hundred of them and then he used a query using a lateral join in order to show the events around it now as a reminder for what a lateral is for every row that you're pulling it's going to query other rows so you can think of it as a nested loop so this first row 501 is going to be querying the event table again or whatever table you designate to look for other rows so essentially that's what this is doing for each row of this table selected it's going to be running this query and it's going to be selecting other ids the event name as well as calculating a time lapse so a time difference between the events where the primary key is not the same one that you are on so you're not going to pull 501 for the 501 event for example and where the time difference between them is less than or equal to 10 minutes so for example in the first event 501 you're pulling out five events and they are in these two minute intervals up to an equaling 10 minutes from that event so that's essentially how a lateral join could give you an answer to this question now this first twenty he wasn't doing the exact lateral joint syntax but he does show you a version of that down here but if you're interested in learning more about lateral joins you can check out this blog post next piece of content postgresql six not so easy pieces this is from pg-io they're talking about six areas that can become a challenge when you're managing postgres the first is connection management and this goes to postgres having its connections be a one-to-one correlation with processes a lot of other software solutions use threaded connections whereas postgres creates a new process for each new connection so that's a very heavy thing to do and because of that a lot of people use connection pooling solution now this can just exist within your application framework or you can also implement third-party ones such as pgbouncer and pgpool that allows you to really ramp up the number of connections that postgresql can support and still operate very efficiently the second area they discuss is zero downtime upgrades because the upgrade process essentially always requires downtime the only one that doesn't is logical replication but that has its caveats you need to take into account basically when you're doing a minor version upgrade you basically still need to reboot the system to load those new binaries in and it brings the system down but when you're talking about major version upgrades that can take a while if you're a smaller database you can do a pg dump and a pg restore to restore the database in the new version if you're at a larger database you would probably want to use pg upgrade now they say in either case the databases we would be down for a considerable amount of time that's not my experience if you use the linked mode of pg upgrade again depending on the size of the database you could do it in a matter of seconds to at most minutes if you're doing that type of an upgrade but then they also say that logical replication is a way to do it with zero downtime but there are risks inherent of that and you have to check a lot of things to make sure that it that it's good before you do a transition the next area they discuss is high availability and this is basically there's nothing out of the box with postgres that supports high availability it has a lot of features but there's not an all-encompassing solution so there's a lot of third-party solutions that do that for you and the main ones they call out here are pg auto failover and petroni so you could definitely check those out if you're looking for a high availability solution the next area they mentioned is bloat management and basically this is the tendency for postgres to bloat its data files and that's because whenever a delete happens or an update happens it actually doesn't remove the row or do an update in place it marks it for deletion and then deletes it later because it needs to maintain those old versions to handle postgres's concurrency control but as a consequence you need to go in and vacuum those up so there's a tendency for bloat to develop in tables and indexes now using re-index concurrently is a great way to rebuild your indexes and to keep them efficient but for tables that's a little bit more of an inconvenient solution there's not a great way to handle it but pg repack is a way to re-compact these tables live and online but you need to have sufficient disk space to do it the next area they talk about is query plan management that's basically a way to monitor what's going on in terms of queries in the system so pg stat activity gives you immediate insight into what queries are actively running but pg stat statements is a great way to look at statements running over time and how many database resources queries are using for you to then optimize and this is an extension that you can install with postcards there's also auto explain that explains queries that take too long so you can look at what the parameters are of them and with both of these together you should be able to help optimize your queries for postgres and the last area they talk about is tuning so basically configuring postgres to make it as efficient as possible for your workload now there are hundreds of configuration options to adjust but there's really a minimal set that can get you started and then you can tweak from there but it is a bit overwhelming when you get started if you really want to get into the weeds with tuning but this blog post lists some of the six things you need to be aware of and gives you some clues as to where to look to resolve some of these potential issues the next piece of content logical decoding of two-phase commits in postgresql 14 this is from postgresql.fastaware.com they're talking about a new feature that's going to be coming in 14 that supports logical decoding of two-phase commits so two-phase commits i would say are more a rarity and relatively few people would use them because they're a use case where you have two different database systems not direct replicas but basically it's trying to do more of a multi-master scenario so it's not like a physical replica you would set up with streaming replication nor even just logical replication there are two database systems that are trying to both be masters and in order to coordinate transactions between one another you have these specific features that support two phase commits to be able to ensure that each database doesn't lose transactions and they talk about that here and go into depth about it but this two-phase commit did not support logical replication but they've put the decoding side into postgres 14 and they're planning for the actual output plug-in called pg output to be upgraded by postgres 15. so basically some features will be ready in 14 but they hope to have the whole thing complete in postgres 15. so if you're interested in two phase commits and logical replication maybe you want to check out this blog post the next piece of content using cert manager to deploy tls for postgres on kubernetes this is from countrydata.com and they're talking about where you want to run postgres and kubernetes in their example here they're using the postgres operator from crunchydata and you want to use your own certificates and run your own certificate manager and specifically they're looking at cert manager which is an open source certificate management solution and how you can use that to generate certificates for the postgres operator that crunchy data operates so if you're interested in that you can check out this blog post the next piece of content is the postcard school person of the week is brynn wellen so if you're interested in learning more about print and his contributions to postgres you can check out this blog post and the last piece of content we had another episode of the rubber duck dev show so if you're a software developer and want to learn more about writing secure code you can check out our episode on that in the link provided and like i mentioned if you're interested in learning more about database performance our upcoming episode this wednesday 8pm eastern standard time is on database performance that does it for this episode of scaling postgres you can get links to all the content mentioned in the show notes be sure to head over to scalingpostgres.com where you can sign up to receive weekly notifications of each episode or you can subscribe via youtube or itunes thanks [Music]
Info
Channel: Scaling Postgres
Views: 207
Rating: 5 out of 5
Keywords: PostgreSQL, Scaling PostgreSQL, PostgreSQL Performance
Id: 8uODRoMGxGc
Channel Id: undefined
Length: 16min 29sec (989 seconds)
Published: Sun Aug 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.