DS201.17 Write Path | Foundations of Apache Cassandra

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] i'm jamie king you've probably heard that cassandra ingests data at a high rate of speed and that's because of the way that cassandra stores that data when it comes into a particular node we call that path the right path in most database systems the way they store that data is proprietary information it's not known with apache cassandra it is a wide open book and we can see how that works and see why it's still screaming here in this section here is our loan node we saw in previous sections in comes right we're going to store some user data inside of our node let's zoom into the node and see exactly how this happens at the top you'll see we have ram and at the bottom we have hard disk space here comes that same right let's just look at the data quickly we have a key of one the name of the user is dev awesome this user lives in texas in the city of houston by the way in this example all the users live in texas and we're clustering by the city when the record comes into apache cassandra apache cassandra actually writes that value to both memory and the hard disk the memory data structure we call the mem table the hard disk data structure we call the commit log the key difference between the mem table and the commit log is that the mem table is always ordered by partition key and then my clustering column whereas the commit log is stored sequentially every record will just append to the end of the commit log at that point cassandra acknowledges to the client that the write was successful and we're done it's really that simple apache cassandra writes the record both to memory and to the hard disk and then acknowledges back to the client that hey i got your data give me another piece let's look at another insertion this user's name is come to dse short for data stacks enterprise they also live in texas as i promised in the city of dallas apache cassandra writes the record to the end of the commit log again it's just a sequential right we just append a pen to pen depend depend on the hard disk that's easy to do and in the mem table apache cassandra inserts that record where it belongs as far as clustering column order is concerned so dallas d comes before h as far as alphabetical order is concerned or lexicographical order say that 10 times and again we acknowledge back to the client that hey life is good here comes another record i'm going to have you pause the video here and think where's this record going to end up as far as the mem table is concerned and the commit log pause the video take a second we're here all day lone node lives in the city of snyder which comes after dallas and houston so that actually goes to the end of our memory table the commit log again load node goes to the end of the commit log we always append to the end of the commit log let's talk about the purposes of these two data structures since the mem table is in memory and apache cassandra stores it ordered by clustering column then apache cassandra can read that data later and return it back to the user just like you see it when you do your cql select statements the purpose of the commit log is hey the node may crash burn go down whatever can happen in order to get the node to the same state it was before the node crashed apache cassandra will read the commit log off disk and replay all those mutations so that we're good to go and the nodes ready to rock and roll so the memory data structure is for reading the data and the commit logs for restoring the state of the node if it goes down and has to come back up here comes another right this is i got your data thank you apache cassandra they live in austin they go to the beginning of the mem table and as far as the commit log is concerned they go to the end of the commit log always appending to the end of the commit log now i know this illustration makes it look like apache cassandra is copying values all over memory but internally it's just changing up some references some pointers if you're more familiar with that term no big deal the key point is if we need to do a read we can do a read get our data back sorted by partition key and then clustering column in this case the partition key is the same for all of our records texas the clustering columns of the city in comes another right always on nom nom nom nom they live in dallas so this one will go with its buddy up in dallas and as far as the commit log is concerned it'll append to the end of the commit log kind of not exciting but that's really cool because apache cassandra can scream when data comes in and it's that easy just to write it last record lone star lives in el paso insert them correctly into the right spot in the mem table append them to the end of the commit log boom we're golden at this point even though we have that big blue chunk of real estate at the right let's just say the mem table is now full there's different parameters that tune how much you can store in a mem table and memory is a finite resource apache cassandra needs to get that mem table out of memory and actually on the hard disk so what apache standard does is flush it down to the hard drive at that point we no longer need the commit log all of the data is durable on the disk since apache cassandra wrote it down to disk this new data structure is called an ss table stands for sorted string table literally just means the data is sorted first by partition key and then by clustering column values key point this data structure is immutable as soon as we write the mem table down to the hard drive and create an ss table apache sound will not mutate change update delete whatever you can think of that data is as is now you may be thinking hold on if the ss table is immutable how do i do inserts updates deletes that kind of thing don't worry that's coming up in another section now we recommend that you store the commit log on a separate hard drive from that that you store your ss tables why well notice the commit log we just append append to depend we want to ingest data if you also have that hard drive busy with reads and writes mutations flushes compaction as we talked about in another section then your node won't perform as well as it possibly can so stick the commit log on a separate hard disk especially if you're still using spinners for purposes of the next illustration we're going to ignore the commit log just know that the commit log is still going on in the background let's ingest some more data in comes our records our rows remember we're still clustering by the city so cassandra stores all the data sorted by the city values oh look our mem table is full again let's flush that down as well that makes two ss tables both of which make up the entire data set of our table in the read path module we'll explore how cassandra actually combines all this data up when you do a read all right time for an exercise go work on the right path and we'll walk you through the solution in another video
Info
Channel: DataStax Developers
Views: 1,736
Rating: 5 out of 5
Keywords: DataStax Enterprise, Data Stax, datastax, DSE6, DSE 6, 6.0, Distributed Data Show, cloud, cloud database, databases, nosql, no sql, data modeling, software development, Apache Cassandra, cassandra, spark, Apache Spark, Solr, Apache Solr, graph, Gremlin, TinkerPop, Apache TinkerPop, real-time engineering, software architecture, DBaaS, customer experience, help, academy tutorial recipes, how-to, step by step guide
Id: mDd4I-isodE
Channel Id: undefined
Length: 7min 6sec (426 seconds)
Published: Mon Aug 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.