How Discord Stores Trillions of Messages | Deep Dive

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Discord has posted a new blog a few days ago about a very interesting problem they ran into how they store trillions of their messages if you don't know Discord is this client messaging app that has become really popular especially in the live streaming I think a space where most of the gaming and live streamers use Discord as their you know de facto Communication channel so for chatting or audio calls and I've seen sometimes I watch YouTube podcasts that use Discord as calls to receive calls so they have discussions there so it's a very popular app and I think 2017 I covered how they moved from mongodb as their primary storage to Cassandra and it was a very interesting blog that they wrote back then and just few days ago they wrote another blog how they storing trillions of messages they move from billions now to trillions and they are moving essentially from Cassandra as their main uh storage in Gin and database 2 seller DB in this blog what I want to go go through is I want to go through the the section that I talk about they talk about the problems of the Cassandra they talk about the challenges they face they talk about many other things and then they talk about what they did to solve these problems and migrate everything from casadra to sell a DB so let's analyze this let's get started how Discord stores trillions of messages in 2017 we wrote a blog post on how we store billions of messages I talked about that I'm gonna reference the video below and the podcast as well for those listening we shared our journey of how we started out with using mongodb but migrated our data to Cassandra because we were looking for a database that was scalable fault tolerant and relatively low maintenance so those are the things they were looking at the reason they didn't start with the relational databases they wanted a distributed database to start with scalable that's the word they used I want to talk about this a little bit as relational databases while you can Shard them you can definitely partition relational data into multiple instances it is commonly the case where that is causes you more problems the reason is simple reasons just transaction you can't do transactions across short effectively right because transactions are by Design the multiversion concurrency control aspects of it and acidity and all these nice Properties or just at the instance level it can't just spread them now I know people will ask me yeah you can two face commands three phase command s of course there's always a way but as at its Simplicity and we are so predictability as a single instance all the data is in one instance now that doesn't mean you you have to use one instance for relational database you can scale horizontally by having replication you know have one Master Rider that receives rights and then have many read replicas and as the master node receives writes the wall changes or the transaction log changes will be pushed down to the replicas and the replicas will have the essentially eventual consistencies of sort of speak and they will catch up so that's the state of the hour that we have it and you can go further than that and Shard if you want by actually splitting the data into multiple databases and that model in the original model everything isn't those replicas all the data is there and there is no oh this database has only half of the data and this database has the other half now if you want you can short that by partitioning and put the data based on a certain sharding key and if you hit that shortening key you hit that instance and you read the data that you want right mongodb by default and Cassandra and seller they start with that model that sharding model where things are partitioned things are fault to around things are scalable and as they call it and this way you have the database that knows will have fuel and fewer data as a result your queries are faster when you when you hit that the reason is as I always say the goal of all these queries is to work with as few fewer data as possible that's the goal yeah instead of searching trillions right you search submillions that and of course if you work with fewer data and you have beautiful indexes you can even further reduce that result set into exactly what you want that that's the whole trick behind databases right work with as fewer data as possible how do you get there indexes partitioning sharding all these trucks so mongodb Cassandra does that by default you have a partition key and this on the partition key you hit the node and that node has the data and the node of course uh this becomes of course dangerous right just one node have part of the data what if that node dies that's why you have a quorum of this data should be duplicated in multiple nodes usually three and of course that's basically the the state of the art as we have it all right with that said we wanted the database to grow alongside us but hopefully its maintenance needs wouldn't grow uh alongside our storage needs unfortunately we found that that not to be the case our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain not to prove not to improve right so let's unpack that we store our messages in a database called Cassandra messages this is their database that stores all the messages in 2017 we run 12 Cassandra nodes storing billions of messages that's when we discussed this article back then right at the beginning of 22 we had 177 no so they increased a lot right to store trillions of messages that there is we're talking about multi-billion no multi thousand billions right that's nuts to our Chagrin is this even a word what the heck is Chagrin distress okay I don't know to our stress it was a high toil system our uncle team was frequently paged for issues with the database latency was unpredictable and we were having to cut down maintenance operation that become too expensive why why is latency unpredictable well stickler there is a sequel statement here cql statement that has been chopped I think by by the Safari reader so I'm gonna show it here instead and then go back there I prefer that view is better but here we're looking at the messages table their message table includes the channel ID which is a unique identifier for the channel and the server bucket a bucket is literally an integer that represents a a Time window very critical to let you know that this message lived in this time window because you're gonna have a lot of messages in the same time Melinda and you want them next to each other usually right especially if they are belonging to the same channel why if you do a read to read certain message a certain time most probably you want to read those messages that are came after and before it right unless it's like direct messages that should be in a different system right but if it's a public Channel and that I mean just a channel messages from that channel should live together why especially in time because a read if I do an i o and that's true for both b3s or LSM that Cassandra uses or celadi before that matter you want to do an i o and you want to get as much as from your IO as possible right you don't want to do multiple iOS to read 10 messages you want to do do a single audio and get a thousand messages 20 000 messages if possible just all compacting nice into a single page that's the goal of databases if you manage to do that you won right and that's what they're doing they're doing it the data modeling is beautiful here and they have also a message ID which we're going to talk about in a minute that this ID this is a unique identifier that identify this message and the author and the content of the message and notice that the primary key is on the channel ID sorted by the bucket and then the message ID all of three or all of the three are the unique primary key which is then defines the clustering of that table and the clustering is very critical because that's how the data is stored on disk based on this ordering the channel pack bucket and message ID sweet all right uh now we can go back to the to the better view here so the SQL cql the Cassandra query language statement above is a minimal version of our message schema every ID we use is a snowflake all the sequence is not it's just so it's all just monotomically increasing they you don't you don't get a lot with it you don't do much with it you want more data in your messages so let's adjust time and other information right and that's what a snowflake gives you here so what do they do with the snowflake why don't we just use euids grids right quits are unique which is nice but they're random random is not good in storage because if you Generate random messages they have no relation to each other and as a result the database if you clustered on that uuid it will try to sort it and the Sorting does not make any sense because the thing is random so if you turn around and start to read these you ID you'll be going all over the place so there is a a better version of uuid which is called the UL idea Luxor lexographically sorted gweds that is used Now by Shopify and I covered that that it gives you like an ordered list that are unique uuid is beautiful so if you generate uuids that are next to each other they are technically ordered and if you have ordered things that are relative to each other a query to give you one one uuid will give you all anything related to it also in the same page and if you do that things that are next to each other will make sense right and that's what Shopify did Shopify use this to ensure item potency in their ordering every order request will get a unique UL ID and because we know that orders will be absolutely next to each other right one after the other right they are storing these orders on this request and if they want to check if The UU UL ID is a duplicate if the order is a duplicate they just read that and guess what they're gonna always hit that neat tail page where where all the uuids must live and this page will be cached right it's almost impossible that you're gonna hit a request order that is has been submitted uh three days ago right because orders to check duplicates you only check within few seconds or even minutes right and those will be guaranteed to be next to each other and that's the beauty here so they're using something similar here they are trying to Cluster they're trying to group things together we partition our messages by Channel they are sent in so that's the first key they use to partition along with the bucket which is a static time window this partitioning means that in Cassandra all messages for a given Channel and Bug it bucket will be stored together and replicated across three nodes this is the column right the replication Factor every every message within the same channel and bucket if you send a lot of messages in the same time window all of these nice messages will go to the same nodes effectively right within this partitioning lies a potential performance for pitfall a server with just a small groups of friend tend to send orders of magnitude's fewer messages of course than a server with hundreds of thousands of people and and it's in the same time same channel right in the same channel same group of time you will get a burst of massive amount of messages from one server compared to another so you'll see this they call hotspots like one bucket of time especially like maybe at night ish right you would see this flood of queries into these areas and right right and if this happens this will create this hot spot so why is this a problem let's continue reading in Cassandra reads are more expensive than right so in order to talk about this we need to understand uh the difference between what Cassandra uses which is a log structured merge tree versus uh which most all relational databases use which is B plus trees B plus trees or data structure that is used to index your table and they can also be used to store the raw data as well sorted into what we call Leaf pages so each page will have based on your primary key and assuming a clustered index primary key here all your primary key and of the full row will be stored in these Pages ordered so row number one and all its columns round number two all it's column three all of almost come right and it's gonna be ordered until the fixed page is full and that becomes one leave page and Then followed by the next one and these Leaf pages are linked and changed together so you can Traverse back and forth so in the list stages then you have the root node an intermediate node that is allowed to be used to index to quickly go to the page which has the row you're looking for and very simple if you're looking for row number 55 you'll start with the root node and the root node will tell you oh if you want row 0 to 10 go to this page from 11 to 20 go to this page and so on so 55 is between this and this you hit that page and then you do on the internal node you do the same thing until you hit the leave page that you want that's how B plus 3 Works in a nutshell in an unclustered indexes the leaf pages will not have actual data they will have simple pointers to the actual table like in postgres it's the Tuple ID which is a page ID and the index of the Tuple but in Prior in an MI sequel or other databases it's a pointer to the primary key it's actual the primary key value in and and how do you do an insert very simple you want to insert a row you need to find which page that row needs to go in in that index and then insert it right into that page and your you will update it in memory of course that leave page right if there's hopefully there's a space in your page and you're going to write it right there in the correct order so let's say the values are one two seven right and then you want to insert the value of three three has to be right after the two and right before the seven you have to order them as you insert the order is critical right because it's clustered indexes or the end of the day they are they have to be ordered and when you do that you also use a persistent model such as the write ahead log or the commit log they call it transaction log to persist these changes only into that and then you commit that it's enough to commit just that you don't have to flush the page to disk right but eventually you will have to and B plus three if you flush the page it's an update it's an actual physical update to the OS he said hey this page go and update it and what does that mean well the page is a fixed size it lives on disk on a file and it lives really in a particular offset and it ends at a particular offset so you would issue a right and say hey write this content this new content on position number 2000 for 8 kilowatts that's like an example of a fixed page right or 16k in case of my Sequel and that's an update right so what's wrong with updates updates are in place updates in the days of hard drives this is an actual in place so you when you do that the hard drive will second and will find the location sector and then write it back overwrite whatever you had in ssds unfortunately there is nothing else called in place update you don't you don't just overwrite things unfortunately that's because of all the nand the way nand works 99 cells you invalidate to write something to the SSD to update something to the SSD you know the logical block address right which is basically what Maps the file system offset down to an array if you will a list of logical block address and those will be flush to disk right this is what I want to update that's will translate into whatever technology in the driver whether this is nvme or normal SATA stuff so yeah in ssds though when you do an update do you want to update a certain LBA what happens is you can't overwrite existing things so what this is the controller does is actually takes you right right to a new place find the new block and a list of pages where this fits and then write your data the second step is it invalidates the old data that's what it does says hey this is no longer a valid there so that's another right so one right two rights one right to actually write the thing the second right to invalidate it and the third one is in in memory dram change which now points your logical block address on the OS that the OS uses down to the new physical location that's what we change in the SSD we change the pointer the pointer is now this this is your new pointer this way the OS can continue working with the same LBA The Logical block address but the physical block continues to change and that's what happens every time you update it moves to a new place it moves to a new place near the rivers this is a disaster if you keeps doing it a lot why because now you're left with stale invalid Pages eventually if you do it a lot then you will fill up your SSD very quickly to the invalid data especially if you update a lot of stuff so you'll have active and invalid active and valid so now what happens this invalid data to be used it has to be cleared and guess what to be cleared the whole erasable unit has to be erased to be written to you can't just use it and that in itself is an expensive operation called the garbage collection so as you want all of a sudden you want to do a new update or a new ride you have no place so the control will say wait a minute let me find an a a stale block erase it that's one right and write your stuff to it so that's you're gonna do two i o in this case not bad right but here's where it goes really bad when you have an erasable unit and I know I'm going all over the place but it's all related guys believe me it's already that's why we do these deep Dives like well if you're gonna do like a summary why why would we be here even right let's talk seriously about these things and if we have this erasable unit and if it has mixed valid data and invalid data this is where it gets really bad because now I can't just erase it it has valid data so the controller has to garbage collect and this garbage collector has nothing to do with Java or Cassandra's jar garbage collector has nothing to do this is the SSD garbage collector stumbling by my words there so now the garbage collector have to move the valid data to a new block that's one i o it has to erase the block a second I O it has to write three iOS your right throughput and as they call it the right amplification is tremendously increased your one IO that you think it's a single update it's now four update 404 iOS which now of course this sticks space from first of all it takes bandwidth the iOS now being used for things that you didn't really it's not your things it's the system things and then you also need a space to store this temporary data right that's the over provisioning so that's kind of a quick lesson on ssds and how updates are really not quite good on the long runs that's why B plus 3s and ssds especially like if you have like page splits this become exacerbated and as you grow the tree the three splits and Pages splits and all of these splits are just updates right it's like more rights and updates and deletes and updates so like that's our things go really bad when it comes to ssdn and when I say really bad you're talking about years until this can you you see the difference you know so people invented this concept of log structured merge trees or or LSM which is Cassandra it's used by default and cell ADB right and now what what the main goal of ss the LSM is everything is a right immutable you rarely do an update you're gonna do an update but very rare it's called a step called compaction right so always write in memory write to the commit log right to this transaction log as you do changes everything is all right right right right and as you're in memory the mem table is filled up you flush the mem table to desk as a called the no you don't flush it directly you sorted and then you flush it as as a table and this is called the sorted strength tables or SS tables so you sort and flush so when you flush you write you never ever update this table never it's always an insert so now this as you create more and more and more SS tables this first level gets filled up so what you do is the next step you read a bunch of these Stables based on certain criteria and write them to another larger sets of tables right and then you do the same thing and always everything is an insert everything is an insert what this creates is now problem with this is the the good up one let's talk about the good the good about this is as you write rights are extremely fast the SSD doesn't have to do this right all everything you're writing is actually your data you're not really invalidating anything right like you're always writing in a new place which is a nice thing especially if you're writing in sequence database ssds love this right it's a very nice sequential right if you do it sequentially but like think about what do we do we insert we update like actual the users do an insert and update and delete all of these will be translated to inserts all of them an insert is an insert and an SS table and an LSM an update is an insert a delete is also an insert it's called a tombstone you create a tombstone so if you if you keep inserting and updating a message and delete updating the same message like everything the same message update update what you do is like you create new records effectively right and you persist them in new SS tables and Snus tables so now to look for the thing you actually want reads have become very very slow because they have to First Look up and the meme in the mem table and then you look up in the first layer of this is able to look for your stuff you go if you didn't find it you go to the another stable and you go to another as a stable you got another step until you find what you were looking for right or you found a tombstone that you know this this thing has actually deleted I have to stop while both Cassandra and seller uses a technique called Bloom filters so say okay your record is actually is is impossible to be here so it's like a nice bitmap that is stored that you can quickly check to use to see before you actually open the SS table to check if this thing is actually there or not right so these are like kind nice trick with those explained now that we know the benefits of LSM versus the detriment now reads have become slow right so what we do with the SS LSM is we do something called compaction and this is where updates and deletes happens where we can group things together and then delete the Tombstones uh delete duplicate records have one record of the same thing we don't need the old stuff anymore I remove it you have compact things so a thousand version of the same thing are no longer the Thousand it's just one you delete this stuff and of course in that process what you do is you create a new thing and you just delete right the old stuff and you create it's always a create a new one right so a delete technically is is a marking and that lbn the SSD as deleted but it's not as you do it very is you don't do it as often as as an update to a B3 does that make sense that's that's the goal of it SSD I personally can't speak to how how bad B plus 3 can affect your ssds you know or LSM like I only hear things and I I don't believe anything I hear unless I actually see it myself you know so that's we always human beings we create we we have a problem we create a solution but that solution almost always creates other problems and that's what we have here all right with that rant now let's jump into it again in Cassandra reads are more expensive than rights rights are appended to a commit log as I talked about it and written to an in-memory structure called a mem table that is eventually flush to disk reads however need to query the MIM table to potentially and potentially multiple SS tables on this file to find what you're looking for so this is a more expensive operation lots of concurrent reads as users interact with servers can hot spot a partition okay which we refer to as imaginatively as a hot partition the size of our dances when combined with these access patterns led to struggle for our cluster so what happens here is because most queries go to almost the same partition almost to the same uh bucket Channel ID will create this hot partitions like the Nord becomes so busy because all the queries will go to it and this is the same tail problem that we talked about where things that are which is a good thing when it comes if the database has a good caching you know in place it should just serve you from the memory because this should be in memory but because this is LSM it has to go to desk because nothing is in memory can't put this as a set tables as a stables in memory right they have to be 11 desk apparently right I don't know much about USM but if this was B3 the page will be just right there alive for you the page will have everything there's no walking to get multiple things it's just one thing you know you get the page and that that page will give you everything so it's actually a better design if you have this hot spot because all this hotspot will hit that cache right versus in this case you're doing more iOS because you're doing as a stable now you might say I'm saying let's just come back you can't compact fast enough to to go with the deal because these are brand new things as they are written people want to read right because if you write new things what the moment you write a message what do what is the first thing you're gonna do other clients want to read that same message in a relational database this is the best case scenario right because you just wrote something it's hot in memory it's sure it's dirty but we're gonna serve you that read off of that from the buffer pool it's right there take it we don't even need to see the disk not the power right of B plus trees that essentially here in this case okay but in that case what we have I know I'm repeating myself but in this case what's happening is because we're doing this SS stables and writing to another sustainables and taking just just to save make rides fast we significantly slowed down rates and that's the main problem here when we encounter the hot partitioners frequently affected latency across our entire database our Channel and bucket pair received large amount of traffic and the latency in the node would increase as the node tried harder and harder to serve traffic and fill further and further Beyond I just can't keep up one note can't keep up if all the traffic goes to one node and the node has to do IO the node is cache if it's like serving from the cache sure and even that is Allah but go doing IO and serving and doing compaction good luck so so far this is a problem with boss seller and Cassandra this is not a unique problem for Cassandra since weary didn't perform read and write with Quorum consistencies all queries to the nodes that serve the hot partitions suffer latency increases resulting in a broader end user impactor at the end of the day users feel it cluster maintenance task also frequently caused trouble all right so there's a maintenance operation also causes some more trouble we were prone to falling behind on compactions that's the problem right I told you human beings we always introduce a solution but the solution has problems compaction when did we ever had to do compaction with with b-plus trees never right where Cassandra would compact SS Stables on desk for more performant reads not only we were uh we not only were our reads than more expensive but we'd also see cascading latency as the note tried to come back I talked about that a little bit man and here as they talk about the garbage collection of Cassandra which is which is a valid point right let's talk about that a little bit we frequently performed an operation we called the gossip dance where we'd take a node out of rotation and let uh let it compact without taking traffic and so they took it out of the rotation so it can't receive any rights anymore let it come back right uh I don't know what happened with Quorum now like do you add other nodes I suppose for the column right bring it back in to pick up hints from the Cassandra hinted handoff and then repeat until the compaction backlog was empty right so so it can start compacting so that's kind of a a way uh to speed up compaction because the compaction can't keep up all this isn't stable or just being written we also spent a large amount of time tuning the jvm garbage collector and hip setting because GC poses would cause significant latency spikes that's understandable as you write to because Cassandra is written with Javan Java is a garbage collection language and as you write new things to create memory entries in the HEB especially those large MIM tables if you go out if null if there if there are no pointers to this uh no stack pointers pointing to this Heap at all then the garbage collection wakes up and start removing all this entries all the memory the all the unused memory right so it can be released back to the operating system and those garbage poses it has to pose because technically what is the garbage collection has to acquire a mutex I suppose I don't know much about garbage collection but I think that's what I know dazing thing about that memory and operating system right has to acquire mutex to release anything in memory and that poses as you want to allocate more memory can hit an a page that you're about to write to and that's when the pauses happen a memory page that is and that's the garbage collection poses is what caused a Linker d right to move from Java is the that's the service measure reverse proxy to rust they move to rust because they say hey we want just a a garbage collection flea because we want predictability right so they moved away from java and Russian they saw significant performance and that's so that's a valid criticism of crooks Cassandra and I totally understand that the rest of the i o stuff I don't see how cell is better well you can you can argue that okay changing our character to seller well we're gonna use C plus plus because still has written with C plus plus and there is no garbage collection C plus plus that's nice that's fast but how is it better right how's Cassandra house seller DB better and I don't know the answer until I actually research both of them and see how exactly they are to me maybe the compaction strategies are different maybe the there is a specific Enterprise level in Cellar that is having certain features that is not available in Cassandra that might be the case but it's not mentioned here so let's continue reading changing our architecture our messages cluster wasn't only wasn't our only Cassandra database we had several other clusters and each inhabited similar though perhaps not as severe fault in our previous iteration of this post we mentioned being intrigued by celadibi a Cassandra compatible database written C plus plus it's a it's promise of better performance faster repairs stronger workload isolation via its short pair core architecture so that's an interesting thing so that each Shard live in its own CPU core that works and a garbage collection free life sounded quite appealing although celadiv is most definitely not avoid of issues I love that they mentioned this right it's not really defensive here it's a void it's a void of garbage collection since it's written C plus rather than Java historically our team had many issues with garbage collection with Cassandra from GC poses affecting latency all the way to super long consecutive GC poses that got so bad that an operator would manually reboot and babysit the node back to health wow that is that is really bad if a node just dies because the garbage collection can keep up oh that's really bad I understand that right these issues where a huge source of an uncalled oil and the root of many stability issues with our our messages clusters okay after experimenting with Sila DB and observing Improvement in testing we made the decision to migrate all our databases while this decision could be a blog post in itself the short version is that by 2020 we had migrated every database but one to sell idb okay so everything they migrated except they can't send their message is the actual core database why had we migrated yet to start with it's a big cluster with trillions of messages and nearly 200 Norwich 177 in particular and migration was going to be an involved effort additionally we wanted to make sure our new database could be the best it could be as we worked to to tune its performance we will also wanted to gain more experience with Sila in production using its using it in anger and learning it's Pitfall we also worked in improved seller DB performance for our use cases in our okay so there they talk about that how they are actually improving seller itself because again seller by itself you can if you just didn't do anything and you just Implement Cilla I suppose you're gonna get slightly better performance but the hotspot thing is identical cilla's still gonna can't compete you can't can what's the word can't keep up with the amount of Rights and the tail or and all these rights are going to multiple SS tables that you will turn around and issue many reads because that's the what the clients do Discord what do you do you write a bunch of messages and you turn on read the same messages that's how this code works because you're reading at the tail the tale is dirty the tail is spread all over the place it's not just an in memory it's in this table and this table and the table behind it right so you have to do many reads so the S7 and that's what they did not mention if they just blindly replace seller with Cassandra that will not give them much performance that's my opinion I think but they didn't do that they actually did more work what is that and they mentioned that hot hot partition can still be a thing insteadibia and and so we also wanted to invest in improving our system Upstream of the database to help shield and facilitate better database so they knew right so I take that back they knew replacing Cilla blindly with from from Cassandra to sella they're not they're still gonna be the same problem right we're still gonna have this hot partition at the tail so that's not no we're gonna need to change our architecture so what did they do they introduce an API I think data services serving data so now that is an interesting thing how do you actually read they didn't we don't know much about that so let's talk about that a little bit how does this code read from Cassandra orcilla there's an API right does the API has any caching whatsoever no what they did here is absolutely brilliant and I absolutely love it let's let's talk about it with Cassandra we struggled with hot partitions and you're gonna struggle with Cilla let's be honest right even if you did so are you gonna have a hot partition it's the same identical problems the same architecture at the end of the day right high traffic to given partition resulted in an unbounded current concurrency concurrency leading to a cascading latency in which subsequent queries would continue to grow on latency because like you just can't keep up because that note is busy reading 700 SS tables and the read after that is just queued in and the OS will have to do the read and they always will try as much as possible to combine these reads that's how the file system started to work as much as possible to combine these iOS but it's still Kanki keep up the bandwidth of SSD will be just you know completely saturated if we could control the amount of concurrent traffic to hot partitions we could protect the database from being overwhelmed how to accomplish this task we wrote what we refer to as data services nice intermediate intermediary services that sit between our API monolith and our database cluster interesting so now they have something called data services they didn't have that before what does the data service do when writing our data service we chose a language that we've been using more at Discord rust okay we get it rust we everybody loved us right the language should make it easy to write save concurrent code its Library also where a great match for what we were intending to accomplish right and then the concurrency here is that most importantly they want building building as an O asynchronous IO and the language has a driver support for both Cassandra and celadivi our data services sit between an API and our cell DB clusters they contain roughly one grpc input per database query and intentionally contain no business logic good the big feature our database Services provide is request cool lesson coalescing which is basically think of it like grouping and that is the key here look at this beautiful diagram for those listening we're looking at a four request or n number of requests to the same identical message right or not message like yeah message I same message ID same bucket same channel so all of these usually they used to be four concurrent different read request to Cassandra now they built this intermediate layer that receives all these requests and they coalesce them so if multiple users are requesting the same row at the same time will only query the database once the first user that makes the request causes a work worker task to spin up the service subsequent requests will check for the existing of that test and subscribe to it wow that is a beautiful design that's like amazing I like Kudos absolutely love it that's pretty cool so if they here's my point I wonder what happens if they did implement this with Cassandra and I'm saying I'm not saying just don't move from Cassandra sure destroy and move it how much would you guys would have saved and would Cassandra hold up or will the garbage collection poses will still kill us and that's the question I couldn't answer right and that's that's all right I suppose that will remain unanswered right I suppose of course now their configuration is way more optimal because they they went all the way right they changed the architecture to include this intermediate layer to Cache almost like you can use this as a cache they didn't talk about that I think but you can coalesce requests so you can group and send one request but then at the same time you can cache results right this is a this worker task is so brilliant this idea of a worker task because it will be in memory and you can as long as it's alive that means someone has just made the request right here's another example let's imagine a big announcement on a large server that notifies everyone at everyone right users are going to open the app and read the message because now someone just write writes one message 100 000 people reads the same message everyone sends the same request right so all of these requests to the same single message right how do you they know the message I suppose it's going to be a notification you get the mystification ID and then you send a request to get that message right okay and then now you have the message ID and now everybody's sending flood of queries in the old system those were hundred thousand queries in the new system it's a it's a single I don't think it's just going to be single still going to be multiple probably right previously this might have a hot partition and on-call would potentially need to be paged to help the system recover with our data servers were able to significantly reduce traffic spikes yeah it's not going to be a single one like 100 million because it's a it's a it's a worker right it's a worker subscriber thing where one the first person who made the request will create this worker and then send a message while all of these queries at the same time concurrently we're gonna look up this worker and as long as the worker is executing you can hook to it right but what happened if the workers get the response and rights is the worker dead do we cash the worker result for subsequent queries or do we create new workers every time the if the time the old worker is dead even if the same if it's the same request the second part of the magic here and this is truly magic I absolutely love it it's Upstream of our data services what do you guys do we implemented consistent hash based routing to our data service to enable more effective Coalition ah whoa right because remember load balancing if you want to like that's a problem right I didn't think about that see if you have like a if you upload balancing in place then this data services will you'll have multiple data services right requests from different clients across the world will go to different data services and in this particular case you will not hit the chances you're gonna hit a data service that happened to Cache at a coalist request are very low so how do you then this is genius it's absolutely genius I love it so what they did is they took and the load balancer and you know I didn't read this part I just I think I skipped about it now I'm just reading it for the first time now the load balancer they have a hash based to take the request and says okay you're going to channel X on this server I'm gonna hash you to this data service yeah I was gonna it's gonna create more load on this particular data server but it's good you're gonna you're gonna you're gonna hit that cash you're gonna the chances of request courses are higher brilliant just brilliant I love it the Improvement helped a lot but they don't solve all the problems okay oh they have more problems we're still seeing hot partition and increased latency on our Cassandra cluster just not quite as frequent wait a minute Cassandra cluster I thought you moved to seller why does it say seller here that's so confusing just not as quite as frequently it buys us sometimes so that we can prepare our new optimal seller DB cluster and executable so so that's what confused me all right so I take I take everything back so it seems like they they did the data service before moving to Cassandra again as I'm reading this I'm discovering new things again right I and I apologize if I made a mistake it's not clear because the the graphics shows seller messages yet the the way they are talking about is they say okay we're still hitting Cassandra okay all right so the data services are heading Cassandra and and still they did a lot of good things they have problems we're still seeing hot partition and increased latency on our Cassandra just not as quite as frequently nice it buys us time so they can move okay so they still saw the increased latency and hot partition even with all of this right even with request coalescing even with hash based grouping of the request so they hit the same data services so they can take advantage of this request called listing right a very big migration now they're into the migration our requirement for our migration are quite straightforward we need to migrate trillions of messages with no down there and we need to do it quickly because while the Cassandra situation has some what improved we're frequently firefighting okay so that is the case so they did do the data services they do they did they did do the request Coalition and all this on the Cassandra and that the screenshots confused me that's what that's what confused me because it says silly messages and I assume they moved to seller already by that time okay so that that's actually good so what are the remaining hot partitioning and latency are all of the remaining problems or all the garbage collection really that's it or let's take a look Step One is easy we provision a new cell ADB cluster using our super desk storage topology by using local ssds for Speed and leveraging rate to mirror our data to persistent disk we get the speed of attached locals disk with the durability of a persistent disk with our cluster stood up we can begin migrating data into it okay so they built a brand new cluster with a completely different architecture than they had with Cassandra maybe that helped a little bit with the i o I suppose our first draft because it's still even with with request caressing you're just what are you doing you're minimizing a number of requests I I wonder what will happen if they cache the datas that they cache the messages or even at least the pages I don't know if the concept of pages exist and Cassandra better be it's the same cons as the database at the end of the day right and why not cache the problem I think clearly is like editing if you edit a message then you have to invalidate all that cash right and that's the big problem and I think Twitter I know a lot of people disagree with me I think Twitter didn't implement the edit feature because they don't want to deal with in cash invalidation to be honest that's because they cash everything everywhere and an edit will really destroy them you know that's why they didn't add it in the future but I might be wrong I don't know much about their architecture right I know they edited like within the 13 minutes and my guess is that in the first 30 minutes they don't cash as much I don't know our first draft of our migration was designed to get value quickly we'd start using our shiny new cluster uh cut over and then migrate historical data behind it it adds more complexity but what every large project needs is added complexity okay we begin do a writing that's if you even when you do when you do migration you all have to do this dual writing right because now both clusters are alive every right to Cassandra must go to a seller DB so they wrote up I suppose some scrap that does that right duplicate the rights and in the back end they are also migrating the old data it requires a lot of here and once we get a setup we have an estimated it's gonna finish in three months to migrate everything it's a lot of time all right here's what they did interesting the time frame doesn't make us firm warm fuzzy inside I love how they write this it's a beautiful and we prefer to get value faster we sit down as a team and brainstorm ways we can speed things up until we remember that we've written a fast and performant database library that we could potentially extend we elect to engage in some meme driven and grainy and rewrite everything in rest oh God oh God you're on rust in an afternoon we extended our data service library to perform large-scale data migration it reads token ranges from a database checkpoint uh checkpoint them locally via sqlite clever so they they write everything to a instead of writing it directly write it everything to a local SQL light and then fire hose that SQL light I suppose you can compress and upload that SQL light and then locally write SQL light back to seller that's a nice batching approach I love it no because now you see you save on on network I suppose right because there's a new insert you have to go through the network so what they cut in is like everything becomes local at one point you just you have all these rights and all of them millions of Rights go to sqlite you move the sqlite with one network i o they have a good bandwidth and now they take that SQL light and then firehose it brilliant right as opposed to sending millions of requests across the world right we hook up our new improved migrator nine days wow three months three months to nine days that's amazing right if we can migrate data this quickly then we can forget our complicated time-based approach blah blah blah we can flap all right so they they turn this up now they're moving 3.2 million per second that is nuts right several days later they are gathered right they're looking at the hundred percent it's not really 100 99.99999 why I'll give you the cliff note hits uh they hit a partition that is filled with Tombstone tombstones and Tombstone are deleted messages basically right anything that's deleted right it's a tombstone and because as I told you a log structure Mastery you never you never delete you never update you always insert so a delete it's technically an insert with a tombstone so there's a few if you especially if you like to delete a server of all the all their messages then you massively insert a bunch of tombstones and that's what they hit they hit a bunch of tombstones that they the the compaction will just die there right it was never compacted so they just like they stopped come back to this got rid of all the uh tombstones flipped over done in May 2022 it's been quite well behaved they're happy everything is good and it's a much more efficient database we're going from running 177 Cassandra nose down to 72 seller DB nodes of course these Cellar divinos are larger in size compared to the 77 177 Cassandra nodes right now seller has nine terabyte compared to Cassandra which has four terabyte nodes and nice our tail latency improved from 40 to 125 millisecond down to 15 millisecond from 125 to 15 that is so chill that is so cool and insert performance went from 5 millisecond to 70 down to a steady five it was it was it was the the event that wasn't as much this tells me that they did something with the discs the I they speed up they sped up their i o themselves right you know what if they moved their ssds to zoned namespaces because this this is just a perfect implementation for them like just move everything to Zone namespace this is the new nvme technology right they don't have no they again they're going to have more they don't have they want not going to have the uh what is called the over provisioning they don't have over provision anymore they don't have garbage collection The Zone namespace basically the zone is the erasable unit and it's it's gonna be controlled by the operating system so of course this has to be a huge rewrite right and the SS tables and all of these things will be just naturally goes into a zone right and then if they want to come back they just flush the Zone just erase the whole Zone and then create a new zone so definitely moving to Zone name space it will bring this number down to even lower than that right but they already did some tricks that sounds like it from there using local cached ssds and stuff like that so it's interesting that they're doing that I wonder how much that will give them with uh Zone namespaces especially with compacting and stuff like that because it's just it's a perfect thing and I think rocksdb with Western Digital they did some experiments with zoned namespace zns and there's like a lot of stuff that I'm I'm trying to understand and learn about but this is a this is and I think this is a good thing uh if they if they consider that and here I talk they talk about the final game the word soccer and said okay this is our cluster like these these Peaks are the goals right and all that stuff and they're happy and they left their uh they lived happily ever after so so it's a it's a fantastic I absolutely love this blog and I as I read it multiple times I I'm learning um as I really get again with you I learned new things apparently like every time you leave it and you learn new things and then I I I take some of the messages and some of the criticism I started back because they did everything they did they I don't I think Chris and law is not right for them anymore um the garbage collection is is kind of a deal breaker there uh one thing is they did so much with so they they invested in seller right but they did so much other things to improve right they first added the data services which didn't exist before they played it with casana and it saw Improvement so that's a good thing but there's still so hot partition so what they did with Cellar they we switched to seller and they did that also that request Coalition with load balancing and hash based uh you know request grouping they also did this ssds grouping thing right this is something I need to read more about because I have no idea what that means local ssds for Speed and leveraging rate to mirror our data to persistent disk right so they add local SSD that does that indicate they didn't have local ssds with Cassandra maybe of course if you don't have localizes digital Cassandra you're gonna you're gonna suffer and again guys I'm not defending Cassandra I'm just questioning things here I'm wonder if they did this with Cassandra they will be left with one problem right which is the the garbage collection poses that's the only problem left with Cassandra as far as we know I know that celadibi has certain compaction strategies that might not be available in in Cassandra but how much really is the difference that's what makes me think now that's the I'll leave you with the final thought that's one thing that they didn't try and I'm not saying that they should have tried this or not no at the end of the day this this is a better architecture cleaner uh of course if you can get rid of the garbage collection and it was causing you trouble get rid of it by all means but it's it's interesting what they did to achieve this right we learned a lot from this blog I absolutely enjoy enjoyed reading it I absolutely enjoyed analyzing it I hope you did as well what do you think about this let me know in the comment section below and uh gonna see in the next one and uh quick plug if you're interested in this stuff check out my database course I talk about databases and stuff like that link redirect switch to udemy the udemy course it's around like 24 hours right now actual 24 hours worth of content you know talk about all things databases fundamentals database engineering so check it out if you're interested head to database.hose.com to learn more thank you so much okay see you next one fantastic fantastic article the shout out of the author again and everybody in the engineering team Bo a Graham and everybody in the Discord team fantastic engineering brilliant engineering work uh Kudos great fantastic well-written technical details I don't have any complaints uh happy I'm really happy with this vlog I enjoyed it and I'll see you in the next one thank you guys bye

Info

Channel: Hussein Nasser

Views: 168,456

Rating: undefined out of 5

Keywords: hussein nasser, backend engineering, discord engineering, software engineering, database engineering, apache cassandra, scylladb

Id: xynXjChKkJc

Channel Id: undefined

Length: 68min 32sec (4112 seconds)

Published: Sat Mar 11 2023