Scaling Postgres Episode 181 Connection Scaling | Sequence Cache | Message Queuing | Subtransactions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this episode of scaling postgres we talk about connection scaling cached sequences message queueing and harmful subtransactions i'm kristen jameson and this is scaling postgres episode 181. [Music] all right i hope you your friends family and coworkers continue to do well our first piece of content is when less is more database connection scaling this is from rich yen.com and he's talking about the scenario where you have a max connection setting in postgres and sometimes people who manage the database want to set it very high even though you may not be using all of them and the thing to keep in mind is that postgres is process based and if you only have for example he's talking about a pretty big machine with 128 cpu cores and maybe it's hyper threaded you do have some limit of how much work you can do at a particular given time so having ten thousand twenty thousand thirty thousand max connections doesn't make a lot of sense because you won't be able to use them but there's a problem even if they are all idle and that's what he mentions in this section here but the connections are all idle because there's still work that needs to be done by all the shared components of postgres to maintain the fact that that these connections could potentially be used and this was demonstrated in a great post a number of months ago that was linked in the conclusion here and i highly suggest you check out this post if you haven't already and they're talking about making improvements to actually improve the scalability of postgres connections and they show when you have a high number of unutilized connections your performance drops over time as you scale those idle connections and that's what this post does here essentially recreates that test and in his example here he was reaching 1500 transactions per second with a very low number of idle connections and then as he increased the number of connections you can see the transactions per second of what was running started to decline and somewhere between 10 and 15 000 connections basically it dropped the throughput to a third of what it was when there were fewer connections now this was mitigated somewhat removing auto vacuum from the picture but still it eventually caught up with it as well so the key takeaway is to don't set your max connection settings to more than you need because you're going to run into this problem potentially and that's another reason to check out this post because they have made some improvements to postgres coming in version 14 in future versions where they hope to push the number of max connections that are supported out even further and to alleviate some of these problems but of course you can always rely on a connection pooler such as pg bouncer to be able to support many thousands of connections without having to run too many max connections on your actual database system but if you want to learn more you can check out this blog post the next piece of content uuid or cached sequences this is from dev.2 in the yoga byte section and he's talking about where people decide whether to use a uuid or a integer associated with a sequence to have as a primary key for their tables and deciding between them a lot of people choose uuid for he says it's scalability but he makes the argument that integers with sequences can be pretty darn scalable as well particularly if you use big ends for the integers as well as sequences have the ability to cache values so what that means is that you can have a session store in memory a certain number of sequence values not just one so when you request a sequence that doesn't give you one sequence number it gives you say the next 50. or in his case what he's potentially advocating here is essentially two bytes worth around 32 000 sequence numbers per request so these will all be cached in memory for that particular session so they can very quickly grab what the next one is without asking the central sequence number generator so he shows you how to implement this and has an example with python code where you can see the first thread pulls those 32 000 sequences but you can see thread four here its number starts at around thirty two thousand thread five stocks starts at about sixty five thousand so the next set thread two is at 98 thread 3 is at 131 000. so each thread is pulling two bytes worth of these sequence numbers and they don't have to request another one from the central sequence generator until it is expired the range that it initially retrieved now of course one downside of this if your sessions get disconnected you won't be able to use those sequence numbers again you would have to do another sequence request and get another 32 000. so if you're constantly making connections and disk connections this won't work as a great solution but if you have long-running connections to the database using some sort of a connection pooler this could be a solution to give you scalability with sequences now in my database application i tend to use all integers for my primary keys and i only use uuids when a unique id is generated within some application like a javascript front end that's the only time where i would use a uuid for that purpose basically it's an id generated outside of the database but if you want to learn more about this you can check out this post the next piece of content devious sql message queueing using native postgresql this is from crunchydata.com and they're talking about wanting to set up a message queuing system within postgresql so basically they want to be able to capture some input stored in the table and then be able to have some processor worker pull off that work without conflicting with other workers and then be able to delete it so their general requirements are here are a table to hold events or items to be processed something to enqueue or put the items in the table something to dequeue and consume these items and do so without locking so if you have multiple workers that are dequeuing the work to be done make sure they don't step on each other's toes so to set up the table they set up this basic cue table with a id is the primary key a time stamp as well as a payload to contain what needs to be processed now to select what rows are going to be processed by each worker came up with a command that says select all the rows from the queue table limit it by 10 so get 10 records at a time and use four updates that means this needs to be locked while you're pulling it but the problem is this locks and prevents others from pulling from the table at the same time so he did it as a part of a transaction and you can see that the second back end essentially hangs it can't do anything but how you get around that is you use skip locked so instead of doing 4 update you do 4 update skip the locked that way each back end will know it can pull its 10 and skip over any ones that are already locked so this way back end one can pull the unique 10 it needs and back into can pull the unique 10 it needs so that's a way to pull the data off but then the question is how do you then delete that data what's the most efficient way to do it now the way that they came up with of doing this is by using a delete statement with some joins and returning the rows that are deleted and that is what would be consumed so for example you do a delete from the cue table so it's a delete statement then you do using to kind of do a join to the select statement because it's the select statement that limits it to 10 rows and does it for an update skip locked so the using essentially gives you a join to get the limit in the row locks and what will be deleted and then it says where for the using subquery id is equal to the queue table id and it returns the entire contents of the queue table so you're getting all of the rows being returned from this delete statement so essentially this is a way to delete the rows that you need and get all of that data from those deleted rows to be processed by the worker and if there's any issue with the worker or something crashes this should be rolled back and those rows would still be available for the next worker to pick up and process so overall this is a pretty simple implementation of how you could do a message queue within postgresql without any additional applications or tools so if you're interested in that you can check out this blog post the next piece of content postgresql subtransactions considered harmful this is from postgres.ai and they're talking about sub-transactions so sub-transactions are transactions within existing transactions so when you have a transaction you can actually fire off another sub transaction by doing a save point and that allows you to roll back to a given save point if you want to save work in the process of a long transaction and here's a more visual cue where you have the main transaction here and within it you can have a save point which is the sub transaction that you can choose to roll back to or not now in terms of who was using sub transactions he mentioned a number of popular object relational mapping frameworks such as a django sql alchemy ruby on rails with its active record sql etc there's a number of them that use that the other thing to keep in mind is that if you're using pl pgsql code when you have an exception block that also uses sub transactions now there are a number of problems that you can encounter when using sub-transactions the number one he mentions here is xid growth so the transaction id growth because every time you create a sub-transaction it creates a new transaction id so if you have a lot of these being generated you're going to dramatically increase the number of transaction ids that you're using in the system so you want to be cautious of that because auto vacuum may have trouble keeping up and freeing those x ids for use so you want to be cautious of that the second issue is you can have a per session cache overflow so by default a given session is allowed 64 sub transactions so each session can have one transaction happening and it can have 64 sub-transactions happening within it if you exceed that you get a dramatic decrease in performance so we show some example of that here but there's also this great post from cybertech called postgresql sub transactions and performance that you should check out and it was covered on a previous episode of scaling postgres as well so the thing to keep in mind with this 64 limit if you have a loop in a pl pg sql command and you're doing an exception block within it you could potentially hit this limit very quickly and cause performance problems for your application the next problem you mentioned is unexpected use of multi-transaction ids so these are generally used when you're locking a row more than once like they're talking about the example of select for share but you can also hit it when you're using things like select for update if sub transactions are used in a part of it and there he has some example where someone had a lot of performance problems when they were basically doing a select four update doing a save point and then updating the same row when they were using django which automatically uses some save points they were running into a lot of problems so that's something you need to be aware of using sub transactions with things like select for update the next problem is having sub transactions with slr u overflow and in this case it was basically a significant performance degradation on standbys caused by using sub transactions in a long running transaction on the primary they started seeing the standby not responding and then finally came back and started responding again once the transaction was concluded so that's an issue that you could potentially encounter and he also mentions that actually the cause of amazon going down during prime day 2018 was actually sub transactions now they were actually using postgres aurora but still they had this issue now he gives some recommendations here first of all you want to use some good monitoring of things like transaction id wrap around a weight event analysis from pg stat activity and looking out for long-running transactions to try to identify potential issues you can also check if you have uh sub transactions too within your code base and looking for cases where you have exception blocks in your plpg sql code and its final recommendation is to experiment to predict and prevent failures by stress testing systems now it does say that there are a number of things being worked on to improve some of the areas that he's mentioned here but they are still in process but this was a very comprehensive post and if you're interested in this content i highly suggest you to check it out the next piece of content how to create lots of sample time series data with postgresql generate series this is from blog.timescale.com and this is a great post to learning how to use generate series to generate a series of data for your tables for testing purposes or load testing purposes so they start off very basically where you can say generate a series using integers and it just prints out a series of numbers and with a third parameter you can specify what increments should be used when generating those numbers and you can even use dates you can set a start date you can set a stop date as well as as a third parameter do the interval of how often to generate this series of date times now with that you can also include other columns of course when you're generating a series so for example here they just printed out a string that gets printed out with the series that was generated here you can even do random numbers to give you a sense of hey maybe this is cpu utilization for example for testing purposes then they showed an example of where you can actually do cartesian products basically a cross join to to be able to generate a lot of rows at one time so for every instance of this row generate the number of rows indicated here so it's 10 rows times 2 will give you 20 rows so you can think of using this technique to generate many many rows and this example kind of puts it all together where they're using generate series to generate a time stamp here they're using a device id also using a generate series of one to four and then doing and then doing a random number generator in the third column to track cpu usage and they do some other examples of how to use date times to do this as well so this was a great introductory post about generate series so if you want to learn more definitely suggest checking this out the next piece of content partitioning use cases with postgresql this is from nirat.info and he's predominantly talking about declarative partitioning and the different features it has and when it's a good time to use them so he talks about how it can support partitioning by list hash and intervals it can do multi-level partitioning or nested partitioning you can partition on multiple columns and it they also use foreign and primary keys now he mentions two mistakes here that people use when they want to do partitioning one is partitioning is necessary as soon as the size is important and he says that not is not necessarily the case it's like it's not like once you have so many gigabytes or so many terabytes that you have to partition although i'll push back a little bit here because i think this is one area where i see partitioning frequently used used when the management of the table becomes more difficult so auto vacuum even after some optimization still takes a very long time to run there's just so much data in the table partitioning that table will make the management a lot easier so vacuum can run much more quickly on say the active partitions than ones that are much older for example if you partition by date so i think that is one area that is a prime case for partitioning the second reason he mentions not to do it is for partitioning to spread data over disks and i agree with this with the advent of all the different cloud storage systems spreading over disk using partitioning is not a great use case for it because you can easily spread your data over multiple disks with current hardware the use cases for partitioning he mentions are partitioning to manage retention so if you need to keep data for a certain period of time and you're partitioning by a date you can easily delete the old partitions quite easily with having to go in and do delete statements you just drop that whole partition the next is partitioning to control and explode so for example anytime that data is inserted and or updated there's a chance for that index to become bloated now you can run re-index but if there's a particular load given your table i could see how it could be advantageous to potentially partitions such that it's easier to re-index the older partitions that are no longer active the next one he mentions is partitioning for low cardinality now for me this seemed kind of similar to the re index use case so i haven't seen a lot of reason to do it in this case generally i would rely upon a partial index to cover use cases such as this the next area he mentions is partitioning to get more accurate statistics so i could see this is a use case if you're wanting to collect statistics at a more granular level due to the layout of your data partitioning may assist with that and he also mentions the benefits of a partition join and partition wise aggregate although the performance of that is not significant so i don't know if that would would be a reason to do it and the next reason he mentioned is storage tiering being able to put say recent partitions on very fast storage and older partitions on slower storage so i definitely see that as a use case but if you want to learn more you can definitely check out this post the next piece of content logical replication table sync workers this is from postgresql.fastware.com and they're talking about some improvements that have been made to postgres with regard to logical replication in the publisher subscriber model and this goes into quite a bit of detail about the apply worker and table sync workers and it's all essentially the internal workings of logical replication and enhancements that have been done to potentially improve the initial copy time for logical replication and error handling so if you want to learn more about the internals and details of postgres you could definitely check out this blog post the next piece of content solving the knapsack problem in postgresql this is from w.2 and francesco de zio and he's talking about a way to solve the knapsack problem which is basically you have a number of items that have particular values and each have a particular weight and you have a knapsack essentially with a capacity of 20. so what are the best items to put into the knapsack and get the highest value so with that he uses a recursive cte to solve this problem and here's the overall recursive cte that is able to do that type of calculation for you and he walks through exactly how it works and shows you the results here where the combination with the most value is the socks the hat and the shoes so if you want to learn more about how to do this within postgresql with a recursive cte definitely check out this blog post the next piece of content postgresql the power of a single missing index this is from cybertech and postgresql.com and this is a very long poem about a missing index in postgres so this is quite interesting if you want to learn more about this poem about postgres i definitely suggest you check out this blog post the next piece of content the post-grad school person of the week is dimitri dolgov so if you're interested in dimitri and his contribution to postgres you can check out this blog post in the last piece of content we had another episode of the rubber duck dev show this past wednesday this episode was on modern web application front ends so if you're a developer and want some more long-form content you can definitely check that out our show that's coming up this wednesday at 8pm eastern standard time is on working with date times that does it for this episode of scaling postgres you can get links to all the content mentioned in the show notes be sure to head over to scalingpostgres.com where you can sign up to receive weekly notifications of each episode or you can subscribe via youtube or itunes thanks [Music] you
Info
Channel: Scaling Postgres
Views: 281
Rating: 5 out of 5
Keywords: PostgreSQL, Scaling PostgreSQL, PostgreSQL Performance
Id: UX5Qt3nRjMY
Channel Id: undefined
Length: 21min 58sec (1318 seconds)
Published: Sun Sep 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.