UUIDs are Bad for Performance in MySQL - Is Postgres better? Let us Discuss

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys this article is a little bit old 2019 but it's a really really good well written article eve tradu well done that is a very good article discussing uuids are popular but bad for performance let's discuss that's the title of the article i'm gonna i'm gonna prefix this that there is some misconception here that i'm going to talk about at the end of the video but here eve only discusses my sequel obviously and the specific architecture of my sequel when it comes to the random nature of uid can screw you over so let's go through that so i'm going to go through the parts of that i'm really interested in and then jump through the stuff the article i'm gonna reference it below for you guys to read it as your leisure so what are you ids or universally a unique identifier uh i i usually refer to them like gweds globally unique identifiers i don't know if they're honestly there is a difference or not there might be but this is this is how they look like very very useful guys grids and uuids are very useful specifically if you want to build an item potent back end api such that the client is responsible of submitting row inserts or orders and you want to make these uh post operations or these rights id importance that means if the client had to repeat this somehow it has a failure it can safely repeat the same operation without a side effect essentially very very important item potency i talked about it right here if you want to learn more about it kafka implements some some something similar to this to achieve uh item potency uid is one of the most popular things to achieve that obviously it's used a lot uh in other ways to guarantee the uniqueness because you don't really get the the chances to get a collision is very low right there is the details of how uh he's going he's going through the details of how you build we're not really interested in that we know how it works so what is wrong with uuid value let's read a little bit and then discuss in order to appreciate the impact of using uuid values as a primary key and that's that's the discussion here we're using uuid as a primary key and not just a primary key we're using as a primary key in my sql because it's a very different thing a primary key in sigma sql different than the primary key in post case which is literally there is nothing called the primary key in in postgres right it's all it's always a secondary key right we're going to talk about that clustering concept of this really is the key or as oracle calls it that's my preferred name index organized cable it is important to review how anodb which is the default uh database engine for mysql nrdb stores the rows of a table in a b3 for the primary key in database terminology we call this a clustered index the clustered index ordered the rows automatically by a primary key so the table is organized based on the primary key index right so think about like if you have a a primary key with a value of one the entire row exists next to the index next to that value and so on value two the entire row next to it there is unlike postgres where the heap exists somewhere and the index exists somewhere and there is a pointer to that thing it's slightly different okay clustering has advantages and disadvantages when it comes to that obviously you have thing everything tucked nicely together and ordered because they are literally an index right so they are ordered the table is ordered okay focus on the word ordered here when you insert a new row with a random primary key value okay anodb has to find the page where the row belongs load it into the buffer pool if it's not already there insert the row and then eventually flush flush the page back to disk with purely random values and large tables all b3 leaves pages are susceptible to receive the new raw there are no hot pages because of i'm gonna explain this in in a diagram that i draw in a minute but essentially when you do everything is in a in a clustered index everything is grouped together so uh if you pulled one page you have all the rows in that same page so if technically if you want to write row number one and you just pulled the page and you want to write row number two that page is already there you just slice it right in the memory and then you do three four it's all tucked into that same page but if you're inserting row number one and then row number hundred and then row number seven thousand and row number eight thousand these are different pages so imagine the reads that happens and you might say jose the the uid is not really a number it doesn't matter this exactly happened with any random thing it doesn't have to be id and that's the misconception that i think is happening yeah when you have uid obviously you have randomness but you can't technically have randomness with integers right your client could be submitting your primary values if it's not sequenced right as a random audience and you get the exact same behavior uh in order to explain this better let's go to my um i drew something uh quick and dirty here to explain this let's go let's go to this and explain try to explain it here all right guys how about we take an example of how this actually works instead of just reading so i drew a little bit a few diagrams to explain this let's assume this is my memory this is the clustered index these are called pages and each page has a specific size and it can hold n number of rows right so let's say for simplicity it had it can hold 10 rows okay and since it's clustered it has to be ordered so page a is row 1 to 10 page b is row 11 to 20 and so on okay so let's do an example where we're gonna do a sequential right so one two three four five beautiful nice beautiful sequential rights one after the other let's see the effects and then we're going to do random writes and see the effects on mysql or a clustered index because you can do clustering in postgres as well but it's not just on it's not on by default so now if i'm inserting row number one the id field number one which is the clustered index that's the id field and say okay the database will ask okay row number one which page is this very quick check to see that oh it's actually page a you can just it's a very quick mathematical algorithm to find out which page and then one once you find the page you pull it from this that's a read seek on disk pull it in memory and then the next step is we write the row inside that page making it a dirty page so we just touch this page in memory and you might say hussein we're touching the the stuff in memory how about we need to write it back to disk otherwise what happened if my sequel crashed don't worry about it there is something called the binary log or it's called the wall right the right ahead log where things are written to disk in an append only mode just in case of a crash but don't worry about it so in case of a crash this can be reconstructed this is out of the scope of the video but assume that we are durable this is an asset database after all right so i wrote this it's a dirty page so i read once and i read to the i wrote to the memory not so bad and i wrote written to the to that page let's write now raw number two what will happen the database say okay where is number two row number two is also page a well wait a second i don't need to go to the desk because page is nicely available for me in memory so all i have to do is just write number two in that same page so just write it under so that's a quick beautiful fast try okay let's do about three same thing three is also there four is also on the same page so i didn't do any read to the disk it's all beautiful right to memory so it's so fast okay up until i reach page number 11 page a is full at this time right so the mysql a certain time it will decide to flush it to disk right i'm not going to go through when exactly because that's three or four parameters in my sql to configure when to flush it to disk but assume that a is now flushed but now i want row number eleven remember i want to insert row number eleven oh it's in page b so i'm gonna do a read and then into the disk pull it to memory right my row eleven five wrought row twelve same thing it's a nicely available memory do you think 13. you see the idea right so now this is what happens when we are in sequential what about random that's the uuid the grid case where every single value the primary key is completely random it's not sequential so let's let's take the same example but as integers because it's the same exact problem so i'm going to insert through a thousand okay let's throw a thousand as page p let's go ahead and pull it to the memory write it and then the next right is row number one or number one is a so i'm gonna read and then pull it in memory and then boo do that not so bad it's still because it's one read and one right technically to the memory it's not that fast there is another right to the binary that i do i'm not showing here but we don't really count it here just this is what matters now the next one is 20 completely different page php 50 completely different page we're not reusing those beautiful cached dirty pages unfortunately they're just sitting there it's so sad when you see this it's so sad it's an insult to the database let's just let's go ahead what happened my memory is full my buffer pool is full what will happen to the next right the next flight is 80 and guess what well 80 is k and it's not in the memory so i have to read it and i have to write it to the memory but oh it's full what to do go and you are obliged to flush some of the pages to disk the old recently used pages the based on the caching algorithm so now the buffer is pool oh let's just go ahead and flush p so we're gonna flush p read first of all read uh try to get a memory flush existing p and then put k back so you just did a read and read in memory and then another write all because you want to do a right and then if you want to count the right in the binary block so you did a lot more work and that is the graph that i'm gonna about to show you in perconas when you reach this stage when you like uh essentially thrashing the memory just putting flushing back and going back flushing back and going back you are essentially becoming i o bound they call it right when you are in memory when you have enough memory in the pool that's fine i mean yeah you can just put a lot of memory and solve this problem but you're just delaying the invitable let's go ahead another do another one 500 another page if you hit the same happen to hit the same page which is very rarely in a in a great environment you'll have the same dirty page but if not then you're going to always do this read read write read read write well guys let's go through the graph i want you to pay attention to these three types here so what eaves did they store he stored the grid values the uid as a string 36 value or as a binary because they are 16 bytes right or as a base 64 they he tried three approaches and look at this when you first start let's take the string example right when you first start rights are like you you have like 8 000 writes per second so beautiful and that's when your memory is not full yet the buffer pool once you hit that uh re reach this threshold after what 150 seconds the memory is now full of dirty pages that are barely used the insult to the database right that we talked about it's just like yeah you have all these pages with a single right in them ugh so sad so sad at this point point you start flushing to disk so an insert is doing a read and write to right actually like if you think about it to the binary uh log right so you're doing more work and if you're doing more work your throughput just slowed down so you you can see that just dies all of a sudden comes back to here because now the thrashing becomes like you're flushing back from memory to disk from discriminatory right and now you might say hussein when he did a binary 16 we went a little bit longer why because the size of the uuid is smaller so now you can fit more rights you can fit more dirty pages in memory essentially because the the size is so small right that's the beauty here so but you still it's inevitable you go you go a little bit further but eventually you're going to hit that memory limit and and eve beautifully explains it here right and like uh explain let's read through here in all cases the insertion rate is at first cpu bond the memory when you talk about like we only read from disk the page and then we put it in memory and then we write to memory which is now at this point we're just cpu we're just using the cpu right but as soon as the table is larger than the buffer pool the insertion rapidly becomes io pawn this is expected and should not surprise anyone the use of smaller representation of eoid value just allow more roles to fit in the buffer pool but in the long run it doesn't really help the performance so he suggests here two solutions and they're beautiful solutions i like this better so he said okay since uid are random let's introduce some sort of order to them some sort of sequencing so what he did is essentially he at the first four characters he tried to fix them based on the current week so they will be exactly constant and when they are constant right in the first week as you come first week they will all become part of the same value and as a result the databases oh these are very nicely almost organized together so the database will look at this as they are synchronized sequence your rights and as a result you see the first case right so he did that part the other part is he's just mapping uuids to c auto increment uh integers right so let's not use the uid as the primary key let's use the primary key as an integer that is sequential not just any integer because you can just still do random uh inserts in an integer you have to be auto increment to get to to be useful right and let's let's not manage it right and then uh use essentially a mapping table to map the uid and check for duplicates and all that stuff and map it to the integer so guys this is very interesting right and here's showing that okay the pseudo order all of a sudden look at that using the pseudo-order you can see nice approach beautiful nice approach so let's end this article guys by discussing postcards because eve didn't mention it here but postgres compared to mysql in this case takes the advantage in my opinion i might be wrong in some cases correct me guys if uh experts out there postgres does not use clustering by default it doesn't have a clustering so you create table is not clustered like uh mysql right mysql it you don't get a choice it's always clustered by the primary key so you have this behavior if your primary key is a uid but in postgres you can create the primary key a primary key which is just a secondary key right which points to the raw directory it just happened to be unique right and when you do that there will be any no any problem because the table is not really ordered and new inserts is always going to go to the bottom it's not ordered right it's not clustered so all the rights are going to the to the bottom it's not like the database has to go okay oh this has to be going this page all this has to be because it needs to maintain order even that's being said that's with the default postgres even if you did cluster in postgres poscas does not maintain order in a clustered table yeah i was surprised too when i ordered this piece of information clustering is a one-time operation when the table is subsequently updated the changes are not clustered so inserts are always fast in postgres they are not going they're gonna they're not gonna go into the right place so might say that's not really clustering is it it's gonna cluster existing data but new changes new updates it's just gonna go to the end so you don't have if if you've did the same graph it's not gonna look the same it's gonna go i think it's gonna look just straight because there's no problem with that but if you do want to cluster it you just call re cluster i think that's what's called the command just call cluster again right calling cluster will re-cluster the index and when it would re-cluster the index over reorganize those uh uh uh rows and tuples back to the place where they should belong right essentially this is the whole thing is just very interesting to me very well written article i'm going to recommend it below guys guys what do you think do you use grids in your tables let me know in the comment section below i'm going to see you in the next one you guys stay awesome good article and by the way guys if you don't follow percona they produce high quality content and blog forms and then on youtube channel check them out they talk about not just my sequel they're specified in my sequel if you've noticed most of the article my sequel mungo benchmarks postgres maria db if you're into this stuff check them out i'm gonna see on the next one you guys stay awesome goodbye
Info
Channel: Hussein Nasser
Views: 13,145
Rating: undefined out of 5
Keywords: hussein nasser, backend engineering, mysql, postgres, mysql vs postgres, clustering, uuid mysql, uuid guid, uuid in mysql slow, uuid vs int mysql, uuid in postgres
Id: Y5mWz4vK10A
Channel Id: undefined
Length: 20min 12sec (1212 seconds)
Published: Thu Feb 04 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.