somebody's done the timer so Ola I'll kick often I'm Simon Riggs I'm going to talk about Postgres partitioning I'm going to talk about it for 49 minutes and 16 seconds so hope you've all had your breakfast the lights in here are so bright that I can't actually see people at the back so if you stick up your hand for a question I really won't see so just shout out something okay so what are we going to talk about well partitioning is splitting big tables up into multiple partitions and that's that's the topic but obviously we need to understand why we would want to do that and I'm also going to explain a few other thoughts along the lines of when you would want to do that how you would want to do that and also talk a little bit about some credits for the people that made this all happen within post grades so the the wive it really starts with an understanding of the the problem around partitioning so partitioning is really about solving problems within very large databases obviously some people think very large means above a hundred k and you know i guess they're right if that's what they think but really we're talking about challenges above 100 gigabytes which isn't that large these days and so it's still very applicable so the idea is that sequential scans take a long time indexes are impractical and we have difficulty removing older data so that gives us an opportunity around data organization because the way that the database EEP works we just stick the data in any old place in the tables without really thinking about it too much which is difficult when you want to remove the data later so if the data is properly organized it's easier to remove and it's also easier to add and you can also use that structure within the queries themselves so that gives you a number of solutions if the data is organized naturally over time for example an ascending order ID or a monotonically increasing timestamp then you can see that the data at one end of the table is going to be old and the data at the other end of the table is going to be relatively new and you can actually use that fact to to improve query performance and that's actually what we already do with something called a bring index so whenever people talk about partitioning I always make sure to also talk about the brain index because although it's an index it's termed an index within post graders it does actually it's designed specifically to solve the same problems that partitioning does and the original project that we created that became brain indexes was actually called Auto partitioning that's what we were trying to achieve sometimes you actually want to force the way that the data is organized and that's where partitioning comes in we actually explicitly say which data needs to go in which place so we end up with the main use case for partitioning in vldb is what we call rolling data maintenance so we remove old data by dropping the old partitions and we add the latest data either via bulk load via copy or just a stream of inserts in the other end of the table and also we want to optimize access to data in that typically we find that a lot of data is hot meaning it's accessed frequently and other parts of the data tend to be accessed less frequently so you end up with a sort of Poisson distribution of access to the most recent data so that that gives you a case where you really want to act to optimize most queries for access to a small subset of the data whereas you actually want to allow all queries to potentially have access to any data now if you do that that has a number of advantages because it means that we can not only increase the scan performance but it means we avoid churning through the cache as well which which has a significant effect on performance so partitioning has got some really beneficial outcomes if you use it but that doesn't mean that you should just use it all the time without any thought it's it's a solution to the problem to a problem so you've got to make sure that you've got that problem already or you're gonna have that problem because otherwise it's gonna be a lot of work a lot of pain for for no real benefit so please please be careful eyes sometimes speak to people that say they're gonna use partitioning and when I ask what for they kind of think that's a strange question so really sort of make sure that you're looking at what yours what problem you're trying to solve so my summary of partitioning you can listen to this slide and go and have a coffee if you want because this is really where I summarize where we are postcodes ten feature declarative partitioning at the time it was mostly syntax only it worked for business intelligence queries but we've been improving that over the last couple of years we've implemented if you look at the release notes there's 23 specific features related to partitioning in postcodes 11 post goes 12 and if you look at the the maintenance releases you'll see that there's at least 27 bugs fixed on partitioning over the last year alone so what that tells you is a lot of people using it the feature set has expanded and the status of partitioning as of postcodes 12 is that it almost works in all of the important cases and is properly optimized so it will be fully usable for OLTP in post grows 12 now you might have a right to say so why didn't you tell me when you're introduced partitioning in post grows 10 but it didn't quite work right for our or TP yet I guess it would have been nice if that had been added to the release notes but it wasn't so now there's a whole load of other features being added at the moment to post graze that is kind of interrelated with partitioning and they're there the main goal of those features is sharding now we do not have the sharding feature set solved within post grows 12 so I wanted to make a clear distinction between partitioning on a single node and sharding across multiple nodes is not a solved problem okay so I don't want to confuse you I'm not going to talk about sharding and if I mention the word BDR here that will be the only mention in this talk of sort of multi node solution space okay so I am just going to talk about PostgreSQL and partitioning on a single node so the development of partitioning is something that I personally have tracked over more than a dozen years within post grows I wrote the original feature for constraint exclusion in post grows 8.1 and it took us about six weeks for me to me to write it and then it's online to completely rewrite every line of my code before he accepted it but that as I say it took about six weeks and it was always our intention to improve on that feature set and unfortunately it took a rather long period of time to get there people submitted a patch for partitioning that would work with rules and I had to point out that that wouldn't work because papi isn't supported by rules so it was rewritten again and this done the next time they rewrote it using triggers and that doesn't work either so unfortunately people getting bit frustrated with me at this point and so why don't you tell us how you should write it and and so we did we discussed how the declarative partitioning needed to work we discussed the way that inserts needed to work and so when postcodes ten came along we actually had declarative partitioning you know working as a as a primary feature within the core server and it did perform or it does perform constraint exclusion it eliminates partitions that don't need to be scanned and so it's actually useful for business intelligence performance but at that point it didn't do some of the things we needed for OLTP so in postcodes 11 we added essential features like the ability to create a a primary key on a partition table the ability to have triggers which then allows you to have foreign keys on the table and then in Postgres 12 we've added significant partitioning performance improvements that makes OLTP possible so all of this work I mentioned my contribution nickel was one of the people that submitted the earlier rejected patches so I've just mentioned him for historical note that the recent work on partitioning has been championed mostly by Alain Gauthier and committed by Alvaro Herrera who's written most of the things like primary keys for partition tables as such like and you'll also see another name that you might not recognize but David Rowley has been intimately involved in in the performance aspects of this especially of the optimizer changes and he's really been elected a committer as a result of the quality of his contribution so that's that's a very good step forward but there I could probably add about a dozen names to this slide in terms of people added individual features as part of this project so it really has been a multi company multi-year effort to get partitioning into core in the way that it is now so what exactly have we put into postcodes well it turns out that there's actually many different approaches to the challenges I laid out before which is basically how do you use the organization of the data itself as a way of improving performance so if you look at things like the awk format the ork data format data file format contains some intelligent metadata within the file system itself and that could actually be used to implement partitioning approaches and other open-source tools do use that there's also the using the concept of metadata within an index and that's what we do with with bring then another approach is to get the partition metadata within the catalog of the database and I've mentioned the word hyper table there because you'll see if you speak to the situs guys or the timescale guys you'll see that they use that a table approach now personally I think that's a reasonable and valid approach don't see anything wrong with it the only issue was it wasn't in core so what we've been trying to do is get something get this functionality direct within corpo squares the debate is of course what's the best way to do that and for example in oracle and other systems partitions aren't tables but in the implementation that we've put into postcodes partitions are themselves tables as well which can sometimes be a little bit confusing because a partitioned table does not itself contain any data it is composed of partitions which are tables okay so if you're coming from an Oracle background that can be a little confusing but just to say that there's actually quite a lot of different ways of implementing the partitioning concept within the database literature and in other databases so it shouldn't really be a surprise if I tell you that partitioning is not part of the sequel standard and neither for example are indexes part of the sequel standard because it's an implementation detail okay so we have had to add things to create table an alter table which are non-standard and they're very specific to the postcodes implementation now that's a little bit difficult because the phrase partitioned by is already within the sequel standard but we've completely changed its meaning for for the PostgreSQL implementation so you're thinking just tell us some give us an example for God's sake so here I go so if we've got a table put measurement that is going to take a load of measurements for particular device the the measurement is taken at a particular time and there's a lot of stuff that for Andrews benefit is in Jason B but obviously it could be any type of data that's all and for my benefit I've added a Brin index just to show how easy that is and then if you want to implement partitioning all you need to do on the create table statement is add the phrase partition by and then you specify the type of partitioning you want which in this case is arrange partitioning and we specify the column that contains the ranges of values that we're looking for which is the log TS or log timestamp that's what it means to me column and then that's all we do on the create table the downside here is that then we need to create multiple sub tables using this kind of syntax where we actually give the sub table a name we mention that it's a partition of measurement and then we specify a range of values that are known as the partition bounds so we're saying from X to Y now if it isn't clear already the partition bounds must be unique so they can't overlap any existing partition so that's by definition there's no overlap if you go back to the way that inheritance worked inheritance and constraint exclusion work completely differently and you could have overlapping bounds on your tables which was quite confusing so it's an important aspect of the implementation that the partitions do not overlap and then depending on how many partitions you want you can just create a number of sub tables now in most tests you perform you'll be adding like two or three or four partitions but in production you may want to have hundreds or potentially thousands of sub tables within your partitioning scheme and what I'm going to do for the rest of the presentation is talk about aspects of that in considerations that you need to think about so the partition types that we implement are range which gives you a range of values so you can have a range of timestamps you could have a range of integers depending on how you're using it now for me range partitioning is the key use case for partitioning the other ones are kind of they exist and they're a little bit interesting but for me range partitioning is like 95 to 99 percent of of the partitioning use case a lot of time that will be based on timestamp and so you end up that's why I've used the example of measurement because that then is a the type of use case around the internet of things now you can implement list based partitioning if your lists are based on numbers but if the things that you would like to segregate are like text strings then the only way to do that is with list partitioning so that allows you to have a partition for the United States and another partition for the UK and that kind of thing which obviously gives you the the problem of what do you do with the EU partition should some data in there need to be moved to another partition but we'll we'll talk about that later there is of course also the hash partitioning type which basically effectively just randomly puts data into two different partitions and it's quite difficult to see how that can be useful outside of the context of sharding so I I'm not going to talk at all about hash-based partitioning in the rest of this talk so just so that we are clear about what we're talking about I've got some pictures so the idea here is that we've got a table called time and we're splitting the table down into multiple sub tables now there's no limit to the number of sub tables that you can have but they must all have unique partition bounds within them so this is an example of a range partition where we have a time-based partitioning scheme where we just add new partitions as we decide to what's interesting about the implementation we have with post grows is it's possible to have multiple level partitioning now though it's not limited to just two levels it can actually be nested quite a long way if you would like it to be but most people see two levels as a potentially interesting use case in that in the example I had about devices you might for example have all of your devices or a specific partition per device or maybe all the devices in a particular region in one table and then time ranges within that okay so it's possible to end up with multiple levels of partitioning now that doesn't really cause any difficulty apart from the confusion of the DBA with regard to the large complexity and a number of partitions but if you've got handle on that then it's not a problem so what does the syntax look like to manipulate it well I mentioned that we can add a partition to the table by saying create table partition of but we can also just simply drop that table as a way of removing it from the partitioned table another approach is to actually alter the table and attach the partition so that doesn't create the table and it's and the detached doesn't drop the table it just moves it under the partition table or back out again okay so the attached attached syntax is not the same thing as create and drop okay just to be clear on that so those those commands are fairly easy to use you can script them I believe there are some tools that you can get to generate partition syntax for you and I believe that in the future there will be automatic partitioning within postgres we just haven't written a thank you that's my next slide excellent I love it when that happens so yeah the the thing that you have to think about with the scheme that we've chosen is that the partitions are full strength database objects that can be queried individually and as as a result you have to consider what the locking implementation implications are when you perform certain actions so in Postgres 12 we managed to get a patch through that allowed share update exclusive lock to be taken on a table now that blocks all other DDL it also blocks DML but it does allow you to keep doing selects while that runs now that's that's a very important change but that's only in post close to 12 so post goes 10 and 11 takes a full access exclusive lock on the parent table not just on the partition okay obviously if you do create table partition of it's going to lock the partition itself right that's clear what I'm talking about is the locking that happens on the now in postcodes 12 we still haven't yet worked out how to drop or detach without taking a full access exclusive look so if you're looking for like exactly when you would drop your partitions you need to do that with some care you can't just create a new partition at the same time as you drop an old one because the drop will have different locking effects so the create so that's that's important one of the gotchas that comes out of that is that postcodes 11 added a feature to allow you to have a default partition that basically didn't have any partition bounds now that sounds like a really cool feature apart from every time you add a new partition it has to scan all the data in the default partition to see if some of that data ought to go in the new partition ok so it does a file scan of the default partition so if the default partition is big that DDL is going to run for a long time and remember it's running with an access exclusive lock so it's going to lock out your whole system for however long it takes so basically as far as I can see default partitions are a really bad idea so I'd rather you heard it from me now and if you think you know what you're doing go for it but I'd rather say that to you now then have you go into production and have your system locked out for 20 minutes while it scans the default partition unfortunately we've got a couple of features in postcodes 12 that interacts badly and that we've added some code that allows you to avoid sorting if the partitions are referenced in order so for example if you do a select statement order by time stamp then it can scan the partitions in time order ok and therefore avoid certain aspects of data salting now that's great but there's also another feature that allows you to use expressions as partition bounds so you might be tempted to to have a partition bound that says the I'll extract the the month from a particular timestamp and I'll say this partition can be the month of March now if I do that the extract command doesn't doesn't sort so unfortunately optimizer doesn't yet know that if you extract March from a timestamp then that comes after the extract of February okay so if you use functions or expressions as bounds it can actually interfere with some of your optimizations so if you're using time range partitioning please use precise timestamp values constants so that you know what is going to happen on the partition bounds so with that said I'll talk a little bit about partitioning performance for LTP that's what a lot of people think partitioning was for so that's the the thing that we really need to make work so the goal here is to make all of the operations on a partition table cost the same no matter how many partitions you've got so that's an algorithmic complexity of oh one or okay as I call it now that's a challenge in Postgres because unfortunately the the code path for insert is different to the code path for update and delete which is different to the code path for select so we've had to implement this three times in order to get it to work within post codes that's one of the reasons it's taken so long and it's also one of the reasons why post code 10 didn't solve the problem at all we also want to improve the speed of copy when we're inserting into multiple partitions so what what we've done over the last couple of years is write a load Apaches submit them to to pour to improve these performance so some of you will have seen these graphs before because I had them last year so what but at those points we didn't necessarily have them committed to post close call so you can see where we were in post code 10 has been drastically improved the partition performances include increase significantly but there is still a slight gap at the top between the performance into a single table as against a partitioned table obviously we're trying to reduce that as small as possible but we haven't got it to zero along the way we've had to tune quite a few different things for example high number of tables with high numbers of columns didn't work very well with partition tables so we had to tune that as well but what we finally got to was the case where we removed all of the reasons why partitioning with lots of partitions was slow and we did all of our testing around this particular number of 10000 partitions now that's not to say that having 10,000 partitions is a recommendation it's just that we know that some use cases especially in in like very very large databases where people might have daily data and then further partition further sub partition within they're actually achieving 10,000 partitions isn't unbelievable so I've seen there are a lot of lot of people that want to have 10,000 partitions so originally in postcodes 10 you'll see three columns on the Left represent the unpatched or the post gross ten performance and I say three columns because there is a green column on the left it's just not very tall and that was a fairly small number of transactions per second so that's how bad things were unfortunately but after a lot of work David Rowley managed to to sort of like keep kicking down the doors until we got to the point where patched we can actually make the performance quite close to the non partitioned case yeah so so with the regard to the way that we developed the patches we developed the patches and we put them in working form into second quadrant post squares about eighteen months ahead of their their availability in core post grades so so all of the works been submitted but it's it wasn't in a working form at the time that we did this not was in a working form in core yeah but so the reason I'm showing this now is this is the current state of postcards 12 okay which hopefully is going to be released in the next few weeks but we'll see about that so we did also work on a copy into tables over 10,000 partitions what you should learn from these graphs is there is a slight performance advantage in if you do the the data load directing to a particular partition rather than into the top-level partition table it's a slight advantage it's not really worth gunning for if if it's if it's complex to do but as you can see the gain we've made is is really quite dramatic over postcodes 10 so where do we get to then so post queries 11 sorry this is post quest 12 so when we're loading data into a very large table what we wanted to check was that the load speed didn't go down over time and so what this graph shows is where we took a batch of 10,000 rows and we inserted it into the table 400,000 times okay so that at the end of all of this we had a table with four billion rows in it okay and if you look at this graph it shows that there's just zero degradation over time as the table grows if you put a b-tree on there there's a very slight degradation over time now the reason that I wanted to investigate this particular thing is that time scale had made some claims that post grows performance deteriorated over time and we have been kind of trying to find what the problem was and we haven't actually been able to recreate the results that show that there was a degradation over time so my next slides do compare the performance of post growers against the performance of time scale I could also have compared the performance against second quadrant post squares I could have compared it against cytus that kind of thing but specifically what we were looking for was is there a degradation as earlier results have been shown and if there's not then you know what's going on but if there is let's find it and let's improve it okay so that so that's why we're comparing specifically against time scale in these graphs so if you look at this particular graph you'll see that the the the postcode performance that we have here is showing that time scale performance is about two to six thousand rows per second and depending on whether you have monthly or weekly partitions which is the middle set of numbers or whether you're just having an unpartitioned table then you get different speeds out of the system okay so that's that is showing that in all those cases we get pretty linear performance there's a little bit of a glitch at the end there but I'm not sure what that is and it doesn't really seem to represent anything I think it's just a testing blip any evidence otherwise so what we've then done is taken those results and sort them so that you can see how frequently any outliers exist okay so this is not performance over time this is all the results sorted by speed so you can see that there's a couple on the left that went fast mostly roughly the same speed and then there's a very small number of outliers that in all of the different systems where the performance drops for that time period or that batch only so from that I can say that time scale and Postgres are acting completely the same and that there are some times when Palmas drops and I would guess that that's associated with sort of overload around checkpoint so let's look then at some other performance results these were performed by a Dutch guy as they're not my results they're not time scale results they're independently commissioned results and you can see that in this we have the graph on the Left shows what the performance is on us a select query you can see that on the blue one which is postcodes 11 as we increase the number of partitions up to the thousands the performance does degrade as the number of performance increases whereas in post quest 12 which is the Green Line you can see that the performance is essentially static with the number of partitions okay so hence the the graph on the left is select the graph on the right is insert performance as well and from that you can see that partitioning becomes usable in the common OLTP operations with large numbers of partitions in Postgres 12 so it's fully usable for production systems fully usable apart from some gotchas that you need to understand now why am i claiming that post closed 12 is okay to use if there are some graphs showing well it doesn't scale and it does suck when you have large numbers of partitions well that's because the types of query that do not scale well with number of sub partitions are the ones that where we're not able to apply partitioning effectively to those cases now what you need to understand at this stage is that there are issues related to particular functions as to whether a function is stable or volatile or immutable and depending on the categorization of those types of function it will allow the function to be assessed at planning time or at execution time or not assessed at all until the last minute so if you have a volatile function it will be it can pause performance problems so if you're creating your own functions you need to look at whether that function should be marked volatile stable or immutable and that's especially true when you're using partitioning you also need to understand the time but the difference between planned time and execution time partition removal okay now I could probably talk for another half an hour just on those topics so because these topics are not specifically common it's not just a function of partitioning I'm not going to go into those in detail here okay but just to say if you know what you're doing and you write your queries carefully as I would expect you to do on a multi terabyte server then you will get you will enjoy scalable performance if you are in cautious as the the way you write things then your performance will suck but that was kind of true already even before partitioning came along but obviously what I'm you know what I'm trying to persuade you is if you use partitioning that is not the only thing you will have to think about okay you need to implement partitioning and then also take the same level of care that you would have had to do anyway okay and unfortunately that does need mean that you need to understand how timestamp data types work and how costs work between different types of time stamps these points may be rather irritating to you as DBAs and developers but there's no real substitute for understanding how these things work okay if you don't understand how they work you're just going to walk smack into a brick wall that's waiting for you and so if you don't know what I'm talking about please look up the manual I could talk for half an hour on either of those two subjects okay and still only just get it across to you so I apologize if I'm kind of saying oh like there's a problem and then I'm not explaining it but really I it's complex enough I don't have time to do it given my six point seven five minutes remaining so what what are my takeaways from this so time scale and situs had working partitioning via the use of hyper tables working within their extensions for at least the last three years okay so I take that as a very good sign that the post grows ecosystem is healthy it's working the technology around extensions is having a positive effect on the postcodes community because we've been able to bring a solution into you know into the arena quicker than we've been able to doing poor and but we we do now have those things in poor so the approach that we took with with the patches that we had was the we wrote the patches we made them available in second quadrant post grows but we also published them to poor so second quadrant postcodes is a fork of post grades it's just kind of like a an advanced fork where almost all of the features are being put back into core post grades ahead of time so am i pushing second quadrant post goes on you know okay it's actually the same price as normal postcode support so if you want it is there if you don't want it we kill what is important is that we understand that post cursor is moving forwards and core post grades offers you the stability you need for your for your applications but sometimes you will need other extensions or Forks off post grades in order to solve specific problems okay and the beauty of the ecosystem that we're in is that these options exist for you so and the next thoughts really around that is that if you don't see something in Core postcodes that you you like if you want it changed you can speak to one of the vendors to get them to add it to their product or you can hire people to write new features for postcodes which is one of the things so gorget doesn't so that is all the beauty of the ecosystem okay but obviously if you are using Forks or extensions please make sure to check whether the features that you are being told about don't already existing core or will do fairly shortly okay because I think that's that's a very important point so while I'm encouraging the thought though extensions and Forks allow you to have solutions immediate solutions to your problems obviously in the longer term we all are interested in seeing things go into poor postcodes so Postgres twelve where is it well I see a kind of group of people all sitting at the back there they were likely responsible for for these things now I'm look we're all responsible for when post closure 12 comes out okay who has taken the time to download the beta investigate their use cases and report bugs who's done that okay so when post was 12 is late you'll know who was responsible right so so post Chris 12 hopefully has is going to appear soon it does have a release management team it's been actively worked on in beta for months so it should be in a good position to release the long beta period allows us to synchronize the release of a lot of other sub projects for example the usual suspects of rep manager bomb and a geological etc are all synchronized with postcards 12 so when it when it releases you'll have the full ecosystem not just the core server and of course it's supported by second quadrant and many other vendors that you see here today so there's another shipment of books in some of the things I've talked about today are quite well described in the the latest cookbook and I assume that there will be further versions of the cookbook as well so given that I have 90 seconds remaining just to say I hope that's been a useful introduction to what you can do with partitioning I hope that you picked up actually how simple the syntax is for partitioning and that it is relatively easy to use we will further extend partitioning in later versions to be even more easy to use but the things that will remain as challenges for you on a very large database is to take a lot of care with your queries to make sure that they are correctly picking up the full advantages of partitioning and that's not really any different to if you create an index you need to be care for little queries use the index okay so that's not really a change in the way that we think about things but it is something that you continue need to need to take care of anyway so thank you for listening to me if you want to talk to a small second quadrant over there or if you're motivated to talk to other vendors then go and talk to them as well so last day of the conference I hope you enjoy it and thank you very much