Big Data Modeling with Cassandra

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

who here has used Cassandra in a production application great okay let's let's try and get that number up a little bit so my name is Matt I work for rap genius in Brooklyn New York my connection to Cassandra is actually that at my previous project we use Cassandra is our main data store basically we for every time a user signed up we needed to consume a ton of data from a bunch of external api's and store it and process it as quickly as we possibly could and during that time I fell in love with Cassandra it was super solid reliable a pleasure to use and I've continued to actually maintain the library I wrote for Ruby and Cassandra since leaving that project I'll tell you a little bit more about that so now do you know a little bit about me why would you want to use Cassandra I need to justify myself here so Cassandra gives you something that you probably aren't going to be able to get with the data stores that are most familiar to so Cassandra runs on a bunch of machines you have a cluster you stand up however many machines you want or can afford or need once you get good at figuring that out and the data is distributed among those different machines and every piece of data lives on generally you know three machines is a common default but no one machine holds all the data so there's no master there are simply a bunch of sort of equal nodes hanging out talking to each other and storing some subset of your data on each code now something that's unfamiliar to those of us who might have worked with a more traditional replication environment in MySQL and Postgres is that the the master list nature of Cassandra means you can talk to any node you want when you're asking for data or writing data so these machines are talking to each other and if you say you know I need this you know these rows and that machine doesn't have them it knows who to ask for it and it comes back to you it's completely transparent so you don't need to you don't even need to be connected to every machine in the cluster and you certainly don't need to pay attention into which machine you're talking to friend you can request so why is this good first if a node goes down and hardware does go down if a node goes down you basically will not notice so that all of that nodes data also lives on other machines and the cluster will figure out hey looks like this node isn't here anymore so I'll just go somewhere else that particular piece of data so you're really tolerant to failure and that's something you don't have to worry about another thing you don't have to worry about is scaling well you have to worry about scaling a bit but if you have a single master data store and you have more and more data you basically need bigger and bigger hardware that's vertical scaling classic we've all dealt with it Kassandra means you can very effortlessly scale horizontally so more data just throw another node into the ring and again Cassandra is going to deal with redistributing the existing data among the new cluster or the expanded cluster without you really having to worry about that at all another really compelling reason to use Cassandra is that it's optimized for writes high-volume writes and the main reason it's really good at that is because when you write data to Cassandra it doesn't have to read existing data modify it and then write it back to disk so the only thing it's ever going to do is either append to the end of a file or create a new file in response to a write request from you that's just very efficient thing to do in the world computers so that's fine but actually the things I just described to you're true of a lot of options you have out there so things like react and HBase and fold them or there's a whole class of distributed masterless write optimized data databases you can use why am I here talking you about Cassandra instead of any of those others so the problem with a lot of these big data stores is that the way they model data and the way you interact with them is pretty unfamiliar and can be pretty awkward it can be difficult to map your domain model to the way these databases work and for that sort of reason Kassandra can be a better option so first the way cassandra model data is in tables and rows and columns so this should be really familiar this could be a post rest table we have an ID primary key that's a thing and we have a couple of data columns so you can dive right in and think about your data model similarly to how you think about it today if you're using relational database even better not only do we get a data structure that's like relational database we get a query language that's like relational database so cassandra has CQ l CQ l grammatically looks pretty much exactly like SQL you can express yourself in many of the same ways there are differences we'll talk about them but again this is a familiar metaphor for interacting with data and it allows you to really get going quickly so I want to tell you about the downsides I don't want to tell you about the downsides but there are downsides there are reasons you might not there's a reason that like no hands went up when I asked who's using it so full disclosure first the way you set up your table so when you define your tables is going to constrain what you can do in terms of data access so in a relational database you're when you're setting up your schema you really can kind of think about it in the abstract as okay well I've got this data it has a certain structure how am I going to map it to tables and join tables and foreign keys and all that sort of stuff and aside from maybe creating some indexes in that sort of thing we don't have to think too hard at least at the beginning about exactly what we're going to do with that data that's not true in in a Cassandra schema so for instance let's say we are making a blog and I expect to be able to sort by possibly different columns for different use cases that is just not that's not an option so Cassandra rows are they have a single order they always have that order and the order by keyword actually only has one valid which is that column by which the rows are ordered so another reason uh you know we might not use Cassandra is we don't get any of those nice data integrity constraints that you get in a relational database so a not null column I want to say every row needs a value for this not possible similarly unique columns just not possible foreign key constraints I mean there's no even concept of foreign keys or joins so you can't lean as much on the database to ensure you can't really lean on it at all to ensure that your data is in a valid State another another fact about Cassandra is that cql is a much smaller grammar than fql so that inner join there the the offset neither of those things exist in CQ all it's not a relational database so no join offsets sub queries the list goes on so it's a very very very essentially a small subset of what's possible in SQL with a couple of extensions as well so sorry and finally transactional support so I'm gonna send a statement I'm gonna begin a transaction so I'm kind of in my own little world here and I uh I deleted my entire post table now that's not going to be visible to anyone else yet so I can just issue a rollback and go about my day like nothing happened there are not transactions in this sense in Cassandra's so you can batch together several rights and send them as a single statement to Cassandra and those are guaranteed to either all succeed or all fail and they will be somewhat isolated in ways that probably are worth getting into but these you know sort of transactions that span multiple statements and you can do application logic in between while you're doing it and read stuff that you can see that other people can't see that is not the thing but distributed masterless failure tolerant horizontally scalable and optimized for writes this is still a very cool piece of technology so let's talk about how to do do things right and do things in a way that will give you a very performant very robust always working perfectly kind of Cassandra data model so we're going to make a blog here is some cql to create a blogs table that should look very familiar probably the only the only thing that might jump out at you is that our primary key is not an integer and it's not auto incrementing so we could make it an integer but there's no auto incrementing you always specify explicitly when you create a row what the primary key is going to be so you have a couple of options for how to make that work within your application one of them is pick a natural key so in this case sub-domain is going to be a natural key so we're going to make that the primary key for our blog table other than that though this could basically be SQL so here things get a little more interesting not that great to have a blog without posts so there are two columns in our primary key here and those two columns actually play very different roles this is a compound primary key the blog sub domain is what we call a partition key and what that's going to do is it's going to group together related data and it's going to do that in a way that's meaningful as we're going to see all the way down to how it's actually stored on disk so it makes a lot of sense and a blog to group together all of the blog posts for that blog and then secondly we're just going to have a plain old ID column and we're going to make that a UUID so usually unique ID is a your other good option for dealing with a non auto incrementing world and that's called the clustering column and the clustering column actually defines the order of the rows within a given partition key and they are always in that order and a UUID is actually really useful for this if you want to order by some sort of timestamp because a type one UUID actually encodes a time as well as some some random or random ish data so when you sort by a type one UUID you're actually sorting by timestamp sorting by creation time of that ID so let's juice this up a little bit and add a secondary index so that is something you have available to you in Cassandra and it's great secondary indexes are not a cure for everything they are not specifically not something you should sort of base your main access patterns on but in this sort of situation we have authors and we've at certain points would like to look up what has this author written and we can absolutely do that now you can have more than one secondary index in a table but you cannot have more than one column in a single secondary index and finally let's add a collection column so this is a very cool fairly new feature in cql and we get sets we get lists and we get maps and we actually get essentially atomic operations on those collections much like you get in Redis or so I can just say add this element to this set at you know and I don't have to read the existing contents of the set i don't have to press it per system back and you get basically anything you would expect for these types of collections there is a anatomic right operation you can do one so that's very cool so we're gonna dive a little bit under the hood here and it's going to motivate the way we've structured our data model so far and we're going to get a little bit into the weeds and the reason i want to talk to you about this level of things because it's a level of things you never have to directly interact with but that project where I use Cassandra and loved it and it worked like a dream and it was amazing that was actually the second Cassandra deployment we did so the first one was pretty much a mess and basic reason it was pretty much mess is that we took an existing relational database schema we were storing data in Postgres and we just said okay let's do the exact same thing in Cassandra and that is a bad idea and having a bit of understanding of what's actually going on at the lower level and how Cassandra is representing this data internally and on disk will allow you to very effectively reason about the types of structures that are going to allow you to efficiently interact with your data so the basic structure of our low-level Cassandra data is called a column family and it looks a lot like a ruby hash or like a hash in general a key value structure where the values themselves are also hashes and these inner hashes also have the property that they are they maintain order by their key so you can see that here that it's not a coincidence of insertion order or anything like that or just how I decided to display it these inner hashes will always always be in order so let's look at this in a more traditional way of representing a column family in Cassandra so we've got on our left we've got a row key and then we've got several columns and the columns contain a column name and a column value and you'll notice that you know I didn't line up the Tuesday and column is actually a pretty terrible name for what's going on here because there's no fixed set of columns that this structure represents so you know we can have one two four in the first one and two seven in the second one and every row can have completely different column names and that's fine it is really more like this hash where it's just a bunch of key value structures inside each row now we call these wide rows and the reason we call them wide rows is that you can put about 2 billion columns in every row at this level and that's about where the people who maintain Cassandra would recommend you stop but to billions a lot I guessing most of you don't have two billion columns in any of your relational database schemas but I'd be impressed if you did so the most important thing to know about wide rows is that each individual wide row sticks together so that means that when a wide row is stored on disk each individual row lives in a contiguous part of disk and in fact the data within that row is on disk in sort order so if I want to read a single row or a slice of a single row in a range of column headers I can do that very efficiently that's one chunk of disk I just have to find one thing and sort of as a corollary to that the wide row you know given wide row is going to live on the same set of machines so if you're replicating three times any given wide rows is going to be on three machines altogether so bear that in mind as we keep looking at our schema one small note is that under the hood cassandra has a concept of compound types these are just tuples cassandra has a pretty normal type system pretty much anything that you would expect from any other data store but you can also have a compound type and those have a defined order which is just like sorting an array in Ruby basically sort by the first thing and then if they're duplicates the first things aren't by the second thing I going with that so let's take a brief step back up to the cql table we created and the way we're actually dealing with our data so here are a few values for our blog I've got a blog about my cat and I've got a blog about code and so you know that first column the sub domain is going to separate out the blogs and let's look at for roles that exist within the data here so we've got the partition key we talked about that so everything for a given blog has the same partition key we've got our clustering column which defines the order now I put dates in here and you'll keep seeing dates because you you IDs are really really long on a screen and just jumbled but think of these as UUID is that represent a given dates then we've got names of our data columns so title and body data column names and we've got the values of our data columns that's just the values that we would think so take this in while I take a deep breath and then we are going to look at how this is actually represented under the hood okay so here's what it looks like we're basically taking for every blog we take all of these skinny rows and we're going to project them into a single wide row and the headers of the columns in our wide row are compound of the clustering column and the name of the actual data column that we think about as a column so what we're going to have is we're going to have all of the rows for the blog and they're going to be in sorted order just kind of all next to each other and you know within each row like you've got still got the columns all broken out but they're all sticking together in a wide road so what are the consequences of this well it means that if I want to do something like that that's going to be really efficient all I'm doing is I'm saying give me the end of the wide row that's at the mic at partition key so all I have to do if I'm Cassandra is go find it on the disk and read that range off the disk and bring it back to you and it's already in order and everything is going to work just as we want and that is in fact why we chose this particular schema so like I said the schema you choose is what is going or is driven by the way you're going to access your data so this is the basic thing I want to do on a blog is I want to access the most recent 10 post another thing we can do very efficiently and this happens to be another good use case for a blog data model is ranges of our clustering column so in this case we're going to look for everything in September from the Mike cat blog and as I'm sure all of you are imagining in your head okay well that's a look for the beginning and the end of that range within the ride row and give it back to me and it's going to be super super efficient so using secondary indexes like I said is not should not drive up your core data access and the reason for that is that if I'm just getting a bunch of posts they're going to be living in random places in wit random wide roads potentially under the hood and so yes the secondary index does efficiently let me figure out what rows I want but when I actually go to get them that's going to put a fair amount of strain on the database to find it so this is legitimate you can do it but if it's the main big operation you're doing all the time that has to be super performant you probably need to choose different schema or you might be normalized so we have a fair amount of scope to not do normalize here but in the end you probably will end up either kind of building a table that acts as a secondary index or you might just build a table that you normalizes certain things in a particular order or range of things that you want to get them in and and sort of hand roll it that way another way we can really use our Cassandra database most efficiently is to if we are doing a lot of writes to try to do that without reading so here's an example of adding something to a set I have no idea what's in the set right now I just want to add something to it if it's not already there and I also want to update the title and because Cassandra is right optimized this sort of pattern particularly for high volume writes will work really well in keeping Cassandra happy and efficient and performing well for you but as soon as you start trying to read and write you're going to find that things are a little more sluggish now of course it's fine to be reading from the read parts so some of you may be wondering what about the blog's table that does not have a compound primary key it does not have a clustering column at all so the wide rows in the blog's table are not wide and because of that there is no defined sort ordering it's it is in an order but it is not a meaningful order so we can't do anything fun like ordering by subdomain similarly we can't do range type queries like greater than or less than or between also not an option so really these top-level these top-level models the ones that have a simple primary key should really be a single entry point into the data they own so in this case for the most part we're only ever going to be dealing with one blog at a time so we can it's perfectly reasonable to load a single row out of a simple primary key table and we can do a bulk load and say give me these 100 keys that's also fun but manipulating them in any sort of meaningful way with respect to a collection of them is kind of out of the question so I do have some some tools for you to work with if you would like if this sounds appealing and you would like to just scale your data out like crazy and never have to wake up the middle of the night because the database is down so this is a library called key cool it's been around for a while however just this morning I released the pre-release of the 1.0 version which it lets you do all of the things I showed you so actually a lot of the stuff in cql that has been in this presentation is pretty new it's the third version of the cql language so you get an active record like domain model you represent the relationships between sort of parent and child in the active record like way and that translates into a compound primary key so we can take a look at what that looks like blog single primary key and then post actually says well I belong to blog so by partition key is going to be the blog subdomain and then we have its own ID you can deal with collections quite easily so they just act like the Ruby version of that collection and we under the hood will persist changes to those collections in the way that you did it in memory so if you push something onto a list we're going to push something onto that list in the save statements for the most part it works just like active records so it's quite used but my goal with designing people was to make sure that doing things the right way doing things the way that we've talked about today is easy and natural and the thing that you want to do when you look at the interface and doing things the wrong way is awkward and doesn't feel right in hopes that you will not only become a dedicated Cassandra lover like myself but that you will be able to get there with a minimum of pain so that's all I've got I hope some of you will go check out Cassandra and use it for your big data needs and I'm sure everyone's hungry so please feel free to hit me up at lunch or on the internet there's a link to equal itself and my email hit me up thank you you

Info

Channel: Genius Engineering Team

Views: 28,897

Rating: 4.860465 out of 5

Keywords:

Id: L5xHQwT1Xww

Channel Id: undefined

Length: 25min 43sec (1543 seconds)

Published: Fri Nov 22 2013