Introduction to Apache Cassandra™

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] and welcome everybody to our intro to cassandra or developers workshop today or tonight i i know a lot of you are saying uh good evening and stuff because we're all around the world here so we'll get started here in a minute i'll be quiet and let you chill out to a nice edm video on databases make sure to read the sub subtitles there alright we'll see here in a minute [Music] purple [Music] and hello everybody and welcome thank you for joining us today i'm david jones gilardi i'm joined today by rags srinivas and uh here we go wait a minute i just made a i just made a booboo on my here we go we're better now i i clicked the wrong thing this is you get to see the fourth wall sometimes um but yes i'm joined today again by rags shernavis um did i say that right rags or did i just totally mess it up what do you think no i think you did a great job okay thanks for thanks for trying uh but but as most people refer to me as rags rights not to riches easiest way to remember me i like that one so so both uh so you know a little bit a little bit about who we are and everything um we're both here developer advocates from data stacks uh teaching today we're going to learn together about apache cassandra um i'm an apache cassandra expert i've been in the programming in i.t world for about 25 years now uh coded in a bunch of different languages and such and there was one time in the past where i had an oracle uh dba certification but that was that was a long time ago now i was like god that was in the 90s anyway anyway so rags how about a little bit about you what anything you want folks to know about you yeah so i i already said that my name is raghav srinivas but i just go by rags you know i'm really not going to mention my first computer language that i learned um you know it actually precedes the language uh was it punch cards come on it was the punch cards were you doing punch cards i'm not going to go there but but i was going to think about you know kind of probably adding a quiz on mentee uh mention a language that has two sevens in the name um you know you know something along those lines maybe it's a good idea to put it in mighty but i did a lot of developer advocacy evangelism for java uh and uh you know basically i picked up uh this hat um you know one of the developer conferences uh you know so this is my indiana srinivas hat if you i like that i think give us the thumbs up if you want rags to wear the indiana trivium is set all the time right [Laughter] so let us know if you want to see rags at the hat take a look at the chat there hello everybody all right and uh okay you probably saw the transition in my picture uh i jokingly referred to that as acquiring gray hair as a result of uh worrying about small things about development so although i've been doing development for quite a few decades i really want to make feel um you know want to get into making development a lot easier i think with kubernetes we're getting there uh but but i think we still have uh a long way to go but that said enough of my interest all right so with that with that let's move on so it's not just me me and rags you know for those of you who've been with us before in one of our workshops you've probably heard us say this before but we have a whole team of us right so in the chats both in the youtube chat and in discord which we'll we'll talk about that and share a link here in a moment we're there this is meant to be interactive we're here to help answer your questions to help learn together and everything so don't be shy right we have a bunch of awesome folks that are supporting us uh in the chats and everything um up in the top right hand corner so just some of the housekeeping on the top left hand corner there um you're probably watching us right now on our data sex devs youtube channel uh we actually are um on um i just oh my god i just totally forgot the name of our other streaming platform i just totally blanked on it but uh someone will remind me in the chat real quick um but yeah so you're probably watching us on our data sex devs youtube channel um you'll notice that separate from this live stream uh that we have a ton of other workshops we're always doing workshops and everything this is the place to go and if you've subscribed to the channel ring the bell you'll get notifications for everything else that's coming up in the future um up on the top right hand corner you'll see the link that uh dtsxio discord that will bring you to our discord channel so what's the difference so we'll chat with you in youtube we're happy to do that just know that when the stream ends today and it's two hours by the way it's gonna be a two hour stream when the stream ends today the chat's going to go away right but if you want to continue the conversation if you have questions whatever especially if you have longer form questions go to the discord channel that exists you know indefinitely we have something like 15 000 users in the community there today so there's other community members and such that can help answer questions for you and then finally in the bottom there you see menti.com so we're going to use menti today not only to ask you some questions here in a moment but for our swag quiz later so if you want to win some swag and compete against all the other folks that are here with you then join us on mentee all right so a couple other things um we will use so the workshop is meant to be hands-on you can either follow along with us in the video or you can just watch totally up to you um if you have to leave sometimes i know for some people it's pretty late it might be like 11 p.m or whatever um and it's okay you can come back and watch this later with us the key to that is the github repository that will drop to you as we get into the materials this is all intended to be self-service so you can totally do this on your own um and rags will actually bring you through the exercises in the get up so we'll both do some theory crafting i'll talk some theory talk about cassandra help get you the fundamentals and rags is going to bring you through all of the exercises in github and then we will use astrodb today so by the way i want to be very clear everything we're going to do is free right so when we have you create an account in astra it's completely free all the good stuff in github every all of our workshops and all of our learnings are always free astrodb is the database that we're going to use today so astrodb is data sax's apache cassandra is a service serverless platform right that's there's a lot of s's in there um but what is that it's a cloud-based service where you can spin up a database based on apache cassandra in a in a couple minutes right so we're all going to do this together and then you get to leverage cassandra with the exercises for real and what's really cool about astra is this is something you can use for your own projects um completely for free there's a free tier the free tier is pretty pretty generous like 40 gigabytes of storage many millions of reads and writes and that renews every single month right so you start off free and you can keep it that way easily for learning prototyping greenfield projects that kind of deal all right and then finally here we have badges so everybody always asks hey can i get a certification for this workshop or something like that so we don't have participation certificates what we do have are these badges so as part of the workshop today we're going to give you some homework now the things we're going to do today are in fact the homework so you know mostly speaking if you follow along with us today you'll actually be doing most of what you need but if you finish the homework and the homework is listed in the repository we will show you that if you do the homework and finish it the instructions on how to submit are also there then you can get yourself a badge these are really nice because uh you can then share them out in linkedin you can tell people about them you may notice that there's i think we have maybe about 10 of these now or something like that so there's a ton of other badges that we have that you can use for like bragging rights and everything and for the homework um we're going to have some other scenarios so to kind of go in a little bit deeper you know for the amount of time that we have and to cover the topics that we have um you know we're we're gonna go more high level right and ask your questions we'll we'll get in deeper where we need to but we kind of have to stick it to high level to get through everything these homework scenarios are designed to go a little bit deeper right to help you kind of solidify some of the knowledge you're going to get today um so whether or not you want the badge these are really good to do again they're in the repository uh and you will be able to do this all through your browser and everything all right so how are we doing on questions or anything before we get into mentee that we need to address maybe i should go take a look actually take a look over there let's see how are we doing oh yeah amol how can i get the voucher for certification we will drop the voucher link at the very end uh that getting the voucher is our thank you to you for coming and staying with us with the workshop yeah let's see yeah oh nova nova crazy i it's twitch i don't know how i forgot twitch it just blinked see this is how you know these are live events by the way every once while you see our brains just come out of our ears or something like that all right so um so first thing i want you to do is go to mendy.com either use the qr code with your phone or put in that code that you see there at the bottom of the screen i'm going to use this to answer to ask you some questions to start off and then i want you to keep this open um and we're going to use it for the quiz later so again use the qr code or use the mentee code you see there i'm going to go ahead and type this into the chat for you some mentee.com eight one one two four one zero five there we go and let me go ahead and get this in discord as well there we go all right and do me a favor give us a give us a thumbs up when you're in the mentee give us a thumbs up in the mendy i see 60 of us are there already wonderful let's see how many do we have on right now all right for the amount of folks we have with us today i feel like we can push that probably to 100. thanks krithika i'm super excited to have you here today all right i see the numbers are being pushed up wonderful let me put this back up one more time just to give you another chance to get there and then we will get going with the mentee and get into the content because that's why you're all here all right we're good let's go ahead and move forward in the mentee and again we're just going to ask you a set of questions first um just to get to know you a little bit better uh first one here is how much experience do you have with the patrick sandra now we do assume that since this is an intro to apache cassandra course there's probably most people have never used it or very have you know a low amount of time with it oh we've got some at least one veteran three to five years that's pretty solid yeah the code is at the top of the screen um let me go ahead and grab that for you you go so if you look at the very very top of the screen up here you'll see the code is up here as well but here you go all right so it looks like most of you have never used it this is no surprise right it's an intercourse that's kind of the idea and we have a couple of veterans with us there we go very cool excellent thank you so much all right so have you been to any of our workshops before always curious about this one yeah of course you go you know it's it's so interesting i'm starting to notice a trend here rags that every time we ask this question i feel it's always about like 20 of the folks have been here with us before some folks come from the day before right and uh but the rest are always that new which is which is really awesome i'm always happy to see some folks come back and hello to the folks that have been with us before and always of course happy to see the new faces awesome and what is your main motivation for learning about cassandra let's see so so a lot of folks looking into development with that cassandra some are looking into both dev and admin i wonder if we have a lot of operators with us today that's kind of your your realm rags huh with the uh the folks in the operation end yeah i don't know about that but yes and we have some admins always love our admins all right everyone right that's true that's right i don't think we don't love anybody yeah no it's very true that's very true okay great so with that keep this open because we're going to come back to it later this is where your swag quiz is going to be and all right so all right so let's go ahead and get into it first thing so what is apache cassandra right it is a no sql distributed database so there's two words there that are particularly important the nosql and distributed nosql means not only sql by the way it's not nosql it's not only sql and no sql databases i think it's been at least a decade now right it's been a little bit more than a decade when nosql databases really started to push out from the relational database world and really where these came from um where relational databases have served us extremely well for decades and they still do right um but at some point they we the amount uh you know the the requirements that were being asked of them scalability requirements and latency requirements and such like that just weren't able they just weren't able to keep up anymore and so what ended up happening is there was this like kind of no sql revolution where all these different type of technologies that extended past what relational databases could do started coming into the you know coming coming into the uh into the into the world um you know because relational databases they're they're wonderful general purpose machines and anybody who's ever used one before you're used to being able to query on just about anything right you know they're they're very very good at being general purpose but at that point at some point you you they we start running into some constraints and such so no sql databases really came into the fore to extend past what they could do and that's essentially where cassandra came from cassandra really has a big focus on its scalability its speed its robustness and everything now the distributed piece is also key many many nosql platforms are distributed where historically or traditionally relational databases are not they can be you can have like leader leader follower setups um you can there have definitely been um relational databases that over the years have gained the ability to say have you know you know read replicas and such like that the big difference though is that databases like cassandra were built from the ground up to be distributed right we're going to talk more about what that distributed piece means and and where the benefits in that come from so an instance of cassandra is called a node right and for an individual node you're talking a couple terabytes of storage and you know for thorough put i i'm being cheeky with lots of transactions a second or core that's just because it depends on how the individual node is scaled but this is usually on the realm in the realm of thousands of transactions a second per core right and of course multiple machines these days you know have multi-core processors and everything like that so you can expect generally in a node you're going to have many thousands of transactions many many many thousands of transactions a second you can handle um for a node and then and rex did you have something you want to say by the way i thought or heard something no keep going okay okay that's nice okay okay um so now i mentioned though that cassandra is a distributed database so while you can technically run a cassandra database on a single node you could do that right maybe in a development setting or something like when you're first kind of experimenting that's fine but you won't reap any of the benefits of cassandra un unless you actually use this distributed nature and so what that really means is cassandra's really a cassandra database should really be comprised of multiple nodes right what's really key about these nodes is that any node can do what any other node can it is a peer-to-peer leaderless system there's no leader follower here there's no special node or anything and what they do is in this peer-to-peer nature i mean i can add a node or remove a node and the system will dynamically and automatically adjust to that and they communicate through a protocol called gossip right so gossip means that the nodes are kind of talking to each other they can determine if another node is down or what their token ranges are things like that the key thing to take away here though is that it's peer-to-peer and leaderless this is really important to some of cassandra's you know core capabilities and then nodes are logically grouped by something called a data center in a ring right you see that that circle that just grouped everything together right so i'm going to have a data center with all of my individual nodes in it now interestingly cassandra can have more than one data center definitely and i'm going to talk about this later and where this comes into play because there's some really cool functionality you can get with this okay so some like high points of cassandra for one this is a big data database right this is a data database that can handle petabytes right petabytes think of how big a number that's that's a one with 15 zeros behind it that's a huge number uh so so in it's not enough to just be able to store all that amount of data but to be able to store it at performance at scale right so it's very key that even in a petabyte level database uh that when i go to request data that i can still get it back in milliseconds right that's something that cassandra really excels at from an availability standpoint cassandra is known as the always-on database and part of the reason comes from its distributed nature you can have configurations where you lose in some cases two-thirds of your nodes and still have an available database that can still facilitate requests from the geographical distribution standpoint i'll talk about this a little bit later but since you can have multiple data centers and because those data centers can replicate and everything you can actually spread a database globally the read write performance is very well known pretty much write performance in cassandra is at the speed of wire right generally speaking rights occur in the microsecond level the very very tight melees reads are usually in the very tight melees um and again this is at any scale whether i have three nodes in my cluster or a thousand nodes right um and the last one there is that vendor independent so it's it's open source it doesn't care what platform you install it on so it's not locked into some particular vendor you have the choice of what you want to do with it okay so let's look a little bit at the distribution piece and how this works right because the question we always get is well wait a minute if i have all these nodes and i've got data coming in how does it know where to put things right this is where partitions come into play so this is something that's a little bit different in cassandra compared to how you might do things in a relational database so let's take a look at an example look at the table on the right notice i have three columns country city and population now one of those columns country is being labeled as my partition key what does this mean so in cassandra data is partitioned right the partition is the base level unit of access what all that means is is i'm putting data into my tables right it's required that i have at least one column that identifies my partition key i'm saying i'm going to partition my data by this column in this case it's country so as i put data into a table as i write data that data is automatically going to be distributed around my cluster by my partition right now i want you to notice these partition keys notice at the top like usa and you see like france and canada and such notice how in cases where i have multiple rows that share a partition like usa at the top they're actually stored physically together right now this is being done automatically this is not something you would actively manage this is just something as you define your partition key your keys in your table then as you're writing data this will be done and why this is important though is because those partitions that's essentially the address of your data so later on in a thousand node cluster when you go to read that i don't just scan through all those nodes to find the data i literally tell it hey go get me that usa partition and i'm going to show you later how it does that but it can do that extremely fast so just know again as you're storing data in your tables this data through your partition your partition keys is automatically being distributed around your database oh hey ksar so yeah what do we have david if you don't mind me jumping in not at all you know maybe i may be stealing and stealing a little bit of your thunder but go for it essentially what i do you know when i talk about yeah you know kind of these databases especially the key value databases um i i kind of talk about i i oversimplify it and basically say that it's it's a distributed hash you know whatever way you want to look at um and essentially the nice thing about you know cassandra is that the platform takes care of distributing it in a even fashion and you know taking care of what happens if one of those copies you know disappears for whatever reason right and i know you're getting to that but but that's kind of a a simplified mental picture that helped me a lot and you you know um again it's over simplification but it does help in kind of understanding you know what what you know the nosql um database cassandra is no i think you're absolutely right rags it's totally a distributed hash table without a doubt right i think you nailed that right on the head yeah that is a nicer it is a simpler way to talk about it that's for sure so now you know we talk about this partition key and and things being distributed right and the question always comes up well again how when i'm issuing queries when i write data or when i'm reading data later how does cassandra know where that data is right so what happens there's something called a partitioner in cassandra and again this is automatic but if you remember a moment ago i was talking about these this table that we had and that our country column was our partition key so if you take a look at the left there you'll see the partition key now what will happen in cassandra is when data is being written in that partition key is audit is going through the of the partitioner and it's being hashed out to a token value so you see the example like token 59 or token 12 or something now why am i telling you this because if you look at the bottom notice the nodes there and notice they all own a set of token ranges right so each node is going to be responsible for a set of tokens so the partition key gets hashed out to a token value that determines where it lives so later on when i go to read some data and let's say again i want to use you know the usa partition or something like that that gets hashed out to its token value that token value is going to say exactly where that data lives so again this is being done automatically but it helps kind of helps you understand like how the data is getting addressed if you will how it sets the address of the data so might be a good time to kind of address the question that nova crazy14 uh is asking which is why is uh au in indiana together right oh yeah yeah thank you thank you rags and thank you uh nova crazy for asking that so let me go back to that screen real fast just to be clear you're not limited to a single partition in a node a node could have millions partitions right this is just a simplified example so in the case of the usa partition it has two rows and because both those rows have the usa partition they're actually stored physically together in the case of au and indiana those are different partitions they're different partition keys so that's just showing that i can have multiple partitions in a node that's all and given the data set the sample data set we had on the right that's just kind of how we drew it right so again any node could have millions of these partitions you're not limited to just like a partition per node or something all right so let's look a little bit at replication factor this gets into one of the key core features of cassandra it's an inherent automatic replication now something we're going to gloss over a little bit today just because we're using astrodb and astrodb does this for you but if you're working with your own cassandra clusters and for those of you veterans that have multiple years in it you'll probably be familiar with this there's something called a key space and a key space is where you store your tables essentially it's like a database or a schema in the relational world there's something else you set the key space it's called replication factor and that essentially says how many replicas do i want of my data what's really cool about this is when you create your key space and you set your replication factor all of the replication i'm going to talk about is completely automatic this is done for you cassandra does this now if you remember i said each node owns a range of tokens so notice that each of these nodes in this example um you know it has a number 0 17 33 so on really what i'm saying is that that you know each node owns a range so like if you go from that top node to 17 i'm one to 17 like 18 to 33 and so on and so forth they each own a range of tokens now if i say a replication factor of 1 i just mean that yet every partition that i'm going to write is just stored on a single node so let's take a look at our example of usa earlier if i had a replication factor 1 that partition would be stored on one node only right now what happens in a replication factor of 2 okay now i add a ring and i shift it that means now that i have two different nodes that may own a particular token range so what happens though if i then store my usa partition now that'll be stored on two nodes replication factor of three what do you think's gonna happen right i'm gonna add a ring i'm gonna shift it i have three nodes now that can own a particular token range and if i store some data that will be replicated on three nodes now be clear this is not just limited to one partition this is going to be all partitions right so all partitions will follow this now replication 3 is essentially the standard this is where you should start i totally get it if you're doing like a development thing or something like that you might have one on your local laptop but any real usage you want to start at three that's where you're starting and you'll see that this is where you start to get a lot of the benefits and everything um one thing to note to the astrodb that we're going to use today starts you here you're going to it'll be at a replication factor of 38 it's already set up for you okay so let's look at what this looks like when a request comes in right so using our usa example let's say that that usa partition key which hashed out to token59 how does it get the data where it needs to go so as i mentioned before cassandra is a peer-to-peer system and it's completely leaderless that means that a request can come into any node in our database at all and it will get to where it needs to go now the node that handles the request is just what we call the coordinator it's all it means is that node at that moment is managing that particular request so what it's going to do let's say you know the node here 17 it gets this data that needs to go to to partition token59 so it goes okay who owns this well remember they talk to each other they gossip and everything so that node knows that the node here in purple actually owns that range right because it's going to have from like 51 to 67. not only that we're at a replication factor of three so we're going to make three copies of this to three replicas that own that range so if you notice that purple line by the way that's the one there look at that purple lines there's three nodes that contain the range that'll handle partition token 59 so what'll happen here is when the coordinator gets the request it's going to go okay these three nodes own that range i'm going to go ahead and forward that data on to those three all right i did see some questions come in as i was talking there i want to take a moment to pause because i just said a lot how are we doing rags on questions yeah yeah so we got a barrage of questions okay yeah let's let me slow down and let's let's answer them yeah yeah i think ryan answered some of them uh but but i think uh some of these are pretty interesting uh questions uh one is how do we determine the replication factor ryan already kind of talked about it just go with three right okay yes yeah sometimes you may want to go more but but very rarely do you uh right um does implementing a replication factor mean that we are violating data redundancy i don't quite understand the question um no because if the question is coming around like will my data be consistent and such we're going to talk about that and how cassandra handles that if that's where the question is yeah right and will all three nodes keep the same data um yes so in the case of a replication vector of three or whatever your replication factor is then for any given partition then all nodes that own that particular range will keep the same exact data exactly um matter of fact i'll talk about something a little bit later as we get into like um more repair scenarios or whatever um but yeah the cassandra has a whole bunch of mechanisms in it to ensure that data is consistent right because that's kind of important because you want to make sure if you're if you've got the data on three different nodes you want to make sure no matter which one you read from you're getting the most consistent data right totally um so yes yeah cassandra does handle it for you yeah sana salim had a question about uh is replication factor less than or equal to number of nodes and and really i mean i don't think they you know there is any relation between replication factor and number of nodes obviously you know you have to have the number of nodes at least uh greater than replication factor which is probably what you're talking about but but really um you know three is recommended so obviously you need to have at least three nodes right yeah and to be clear let's say you have a thousand node cluster you wouldn't make the replication vector a thousand right you would be correct seriously over replicating your data you would still have a replication factor of three potentially that's at least where you would start um and that's exactly how many cassandra's able to maintain that scale can you imagine the amount of network latency and stuff like that you would incur if your replica if you were replicating to every single node you always had so that's one way that you can actually have these really large clusters but your replication factor is still three maybe five depending on uh your particular need um i did see another question that i wanted to answer oh yeah will it be synchronous or asynchronous replication always async cassandra is an asynchronous database right um if we blocked on that yeah that would that would incur lots of latency and everything all right so another question that may be again relevant is can you get the data from any one of the nodes or is there a leader node there is no leader node you can get the data from any one of the nodes as a matter of fact when we get into the consistency level section a little bit later i'm going to talk more about how that happens so just be patient on that one and i will get to it but yeah it's any of the nodes that own that you can get the data from yeah absolutely lots of quality questions today yeah definitely that's great um you know and uh divya asks a nice question that i'll just address right now what's the purpose of storing the same data in multiple nodes that's a great question not that the others aren't but this is kind of key there are a whole set of benefits you get here one is resiliency if i lose a node and i'm going to talk about this actually in just a moment here i might as well do that see my mouse is somewhere else there we go there we go if i lose a node right if i didn't perform a write well guess what two of the other nodes still can write the data but the node that acted as my coordinator will store what's called a hint this is actually one of the self-healing mechanisms in cassandra and when that node comes back up it'll replay the hint on that node so part of the answer divia is the benefit of having multiple nodes that store the data is now i have a built-in resiliency there are some configurations where even though i have three nodes that i need to have data i could lose two of them and i can still write and read data and have an available database right that's actually one of the key features here another piece if you think about it if i have multiple nodes that own the same data from a performance standpoint then i have i'm spreading out the load of requests to that data because now i have three nodes that can facilitate that request not one right but again there's no leader or anything like that any any node can handle the request given it over so let's go ahead and show yeah another one had another question which is the data ever replicated on the same node go ahead um is the daver oh no right right right yeah i see uh joshua um no the if you have if you have like in this example right i'm not gonna see multiple partition 59s over and over and over and over on the same node it will always if if i have a partition right and i'm adding rows into it or something that will always be physically stored in the same physical partition i'll never have a repeat of the same partition on a single node right i hope that answers your question um also from a data distribution perspective i think it makes a lot easier to kind of spread it around the cluster yes yeah and i see a couple more and then i want to move on because i think some of the other questions i'm seeing we're going to address as we move forward i want to make sure that we do get through the material um siva asks what would happen if all the replicated nodes fail honestly if you have a case where all of your nodes are failing you have a bigger problem in the case of all of if you had for a particular partition token or a particular partition if all of the nodes actually failed for that partition then that partition would not be available yeah now that's exactly though why cassandra is distributed and it's very robust because the chance of having that happen you know one one thing you design here a lot of times you might spread nodes in multiple regions things like that you you don't always put them in the same exact physical hardware because that way if somebody by accident you know flips the switch to turn off your server farm you don't just lose your whole cluster things like that um so there's there's all sorts of ways that you can kind of build that but yeah if you happen to lose all three that owned that same exact partition then it just wouldn't be available um the other one here is what if the coordinator node goes down during uh during managing the other nodes so again any node can handle any of these requests right now once once those requests go out like if the coordinator actually went down right in the middle of the request right now any other node can actually take over coordination from that standpoint but what would probably happen from the driver perspective right if the if the coordinator itself actually went down you would probably get an error message or a disconnect in the driver or something like that now the cool thing is writes in cassandra are item potent um or they should be anyway and item potency just means that if i perform the same exact right it's the same information and because of the way that cassandra writes data and handles data you could literally write over you could perform the same write operation over and over and over and be totally safe so what might happen is something like that happen it might retry the drivers do have retry mechanisms in there but the chance of a coordinator going down exactly at the time that it is coordinating out is pretty darn low um but there are robustness mechanisms built in both the drivers and in how cassandra does it itself to manage that case so before we before we proceed they just really had a a question about uh kind of the logistics of this uh i'm not able to see the questions posted by other participants uh most likely i feel like you're not on the youtube chat uh because that's that's the way you can see it there is another um kind of even bright link or something right uh where you can't see the chat so twitch yeah we're not answering questions in twitch if that's where they are yeah yeah okay yeah if you're in the youtube chat you should be able to hopefully see everybody's questions all right so let me go ahead and move on from there and i see that stefano and ryan are all about you know everyone's answering questions too so hopefully we'll get to everybody okay so cap theorem now this is kind of a key concept to understand when working with distributed systems right in a distributed system if everything is working well everything is green you can maintain each of these three guarantees consistency meaning my data is consistent right availability meaning i can i can talk to the database in and issue requests read and write partition tolerance is really talking about um like network partitions so imagine i have two nodes that are talking to each other over a network and if that network gets severed what happens in that case so the cap theorem essentially says that in a failure scenario you could only ever maintain two of the three guarantees you can't maintain all three in that particular case and this is any distributed system this is not just cassandra or something like that so cassandra defaults to what we call an ap system availability and partition tolerance what does that mean right that means cassandra will default to being available so let's go back to my two node example let's pretend i have two nodes and they're talking right let's say that i have a network partition meaning they can no longer talk to each other well guess what each one of those nodes can facilitate a request right each one of them can handle reads rights and everything like that but what i can't maintain is consistency why well they can't talk to each other so they can't maintain that yet right and then once the network comes back online they can talk to each other now they can maintain consistency now what's interesting is even though cassandra defaults to being an ap system you can actually configure it at the query to be a cp system meaning that it would actually in that case if it's cp then it would say hey i can't maintain consistency i'm not gonna i can't do this request so it's no longer available right so there's always this trade-off and this is any of the databases but it's a key thing to kind of understand and again though cassandra defaults to being an ap available partition tolerant system which essentially means even in cases of failures it's going to be up and it will facilitate requests and then once things go back it'll handle it'll it'll kind of fix the consistency part so that gets down to this whole question that i was just talking about is it ap or cp is it availability partition tolerant or consistency partition teller right so this is what i was just talking about now cassandra has a set of what are called consistency levels um you can see that in the bulleted list there to the left right um the really the ones there's only two of them out of there that i think you should even care about and there's only one of them that i'm going to say is the one to start with and stick there unless you're doing something more advanced and and you you know you're kind of schooled on it and those are going to be consistency level one and quorum right um there are also multi-dc so when you have multiple data centers you can have consistency levels across those data centers so really if i go back one all one means is that in the case of a write i'm just gonna wait for an acknowledgement that one node wrote the you know what the request was before i move on right in the case of a read if i'm reading at a consistency level of one i'm just saying hey i'm going to read from one node that's it right even though i have three nodes that could have my data from a data center standpoint when we're talking multi-dc consistency levels notice the local variants notice there's local one there's local quorum what does that mean well if i have multiple data centers and if i'm replicating a key space across say all three of them in this case if i then would say i want consistency one then when i go to like i go to read that data it could read from any node at any of those data centers but imagine if those data centers were geographically dispersed right and maybe my application is talking to a local data center but i have a data center somewhere across the world i probably don't want to incur the latency of having to go across the world all the time so there are these local variants where you can say i want you to maintain consistency with this local data center so that's what local one and local quorum mean another thing to kind of note is let's say i was reading at a quorum now a quorum by the way just means a majority so if i have a replication factor of three a majority of three is two right if i had a replication factor of five a majority of five in that case will be three so i'm saying uh that's a case where uh if i were to do that locally at local quorum i'm only going to worry about getting that quorum at a local data center if i don't and i just say quorum then i might be getting a majority across my data centers so it's kind of a key thing to understand here's what i'm going to tell you though the takeaway here use local quorum just start there that is the best trade-off of both performance and robustness and consistency and everything um and if you need to do something more advanced like one or each or whatever then that's something you would you would research on its own but for purposes here today start with a replication factor of three start with local quorum by the way when you use astrodb you're already being defaulted to all of these things right because those are pretty much the industry standards of how to do things all right so let's break down this consistency level thing a little bit more because this gets this gets interesting right um and it can sound like a lot of words until you see an example okay so in the case that i'm writing in a quorum there's this anytime you're writing and by the way consistency level and replication factor have a relationship consistency level is always based off of your replication factor so when i'm writing some data no matter what consistency level it's at i'm always going to write to the number of replicas that are defined in my replication factor so in this case this client's going to write some data it gets a request i want to write some data it's going to write it to three nodes no matter what right but at a consistency level of quorum i'm saying i'm going to wait for an acknowledgement from two of the notes all three by the way will acknowledge but the coordinator is just going to wait for the two fastest ones to acknowledge and at that point it's gonna go okay now you're good to go right now i've given the okay message back to the client back to the driver back to wherever it is right in the case of a read if i'm reading again at quorum again the request comes in and even though now this is actually kind of an interesting point when you read it quorum cassandra does a comparison it actually compares and checks the data and this goes back to another question i think somebody was asking about like consistency and how do i know like if i'm if i have the data replicated on three nodes how do i know i'm getting the right data so this is what happens at a quorum when you redid a quorum it's gonna read from two nodes right because you're saying i want to read from a majority so it's going to read from two of the three and it's going to do a comparison and it's also going to do a comparison of the third note but here's the key thing it doesn't compare the whole row of data and everything that's in there it compares essentially a digest digester a checksum right it's a much more efficient way to do the comparison to ensure that the data is correct without having to expend all the energy to read the whole thing now what it'll do in this case is if it detects that for some reason that no like data a sale on a node or something it'll automatically repair that data and it will return the consistent data to the client and this is of course done asynchronously so you have to like wait for the repair or anything like that um so really this combination though of of writing a quorum and reading a quorum is what we call immediate consistency what this means is if i write some data i want to be able to read the data i just wrote immediately after right this is using quorum quorum we call it read write quorum quorum right this is the standard to start here and this is why i say this is this nice balance between ensuring you have consistent data even in a distributed system but also maintaining a good performance because obviously when i'm waiting for like acknowledgements for more than one note or something you know every time i add a note in to wait for an acknowledgement i'm adding a little bit more latency um so again this is this nice balance all right i said a lot again what kind of questions do we have anything we need to address rags uh the last question uh which we get pretty frequently uh from harry seti uh how cassandra is different compared with mongodb oh yeah so you know is primarily a document database um also does use a leader follower system um does not scale quite the way that cassandra does cassandra is primarily a tabular database it does have and we're going to talk about it later you're actually going to uh we won't use it today but i will talk about it um there's something called stargate.io which is another open source project that actually opens up other apis things like rest graphql and a document database you can actually store json json objects natively in cassandra as well with stargate um the big difference though is going to be the fact that does have a master based system and it just does not scale the way that cassandra does um so that's that that's the major difference other than the fact that is primarily a document database where cassandra has more models available to it okay by the way by the way for those of you who want to get deeper on everything we're talking about here right again we have to scratch the surface in two hours the homework is going to do this that's exactly why we give you the homework and there's other materials that you can get access to that'll really go into a lot of depth here all right so moving on with this now remember before i mention that cassandra can have multiple data centers right so take a look at the left i see that i have three data centers one over in north america one in emea one over an apac now because cassandra has this inherent automatic replication imagine i wrote some data from a node over there in the west coast of north america that data will not only be replicated to my local data center but also asynchronously to the other data centers what that means is i could write some data over there in north america and then i could read it from india right away right and that's at the speed of wire that's at the speed of wire right again writes in cassandra extremely fast so it really comes down to speed of wire um and we see this a lot we see this kind of application a lot where people will geographically distribute their cluster and have essentially a global database now on the right hand side you see kind of another another way this works where again i mentioned before cassandra's totally vendor agnostic it doesn't care where you put it you can put it on any of the cloud providers you can even have it on prem you can have it on a combination of on-prem and cloud providers imagine you have your on-premise cluster but then you need to burst up for black friday and then you want to scale back down you can totally do that like a hybrid cloud scenario where you burst up to a cloud provider and then come back down or maybe you want to leverage some capability that's in google cloud like machine learning capabilities and you want to just expose the data there you can do that maybe you want vendor leverage where you don't want to be locked into a particular vendor maybe you want to protect from outages every one of the three major cloud providers in the last three years have had major outages maybe you want to protect against that these are all the kinds of various use cases that we see that people use this for the key thing is that you can essentially do any kind of combination of these whether it's hybrid cloud between your on-prem and a cloud provider or multi-cloud between multiple cloud providers and on-prem it's up to you um so cassandra is really flexible this way and again all this replication is completely automatic there's no like etl or file copies none of that it's being managed for you when you when you configure it all right and then last part of this section so we can get in some hands-on right because that's always the fun part we gotta get to the hands-on is use cases like questions we always get are what kind of use cases does cassandra fit now this is not an exhaustive list this is not the whole thing but this is a really good idea of the sweet spots for cassandra anytime you have an application that's really high thorough put high volume things like event streaming iot log series that kind of deal definitely a good fit for cassandra mission critical one of the things we see from so many of the companies that run cassandra is that they need a database that's never going to go down um there's one example from home depot they've been running one of their um one of their important systems on cassandra that has never had an outage in something like six or seven years right that's the kind of thing that they're looking for for something like that and that's that's a really good fit uh for cassandra we just talked about the distributed piece you can totally distribute a cassandra database globally using multiple data centers and everything like that you can even go so far if you notice um compliance gdpr for those of you in emea you're totally aware of gdpr really and that means that there are extra security constraints there in that region but maybe not in the americas right so how do you do that and so cassandra allows you the ability i can have multiple data centers and i can say well for this data center and emea i want to apply extra security for gdpr and stuff and then from the cloud native standpoint again you can totally deploy it on any of the cloud providers and intra cloud and all that kind of deal so again this just gives you an idea it's not the whole list but there are definitely some sweet spots for cassandra all right so then finally astrid b you're going to use this here soon um this is a fully managed database that is based on apache cassandra it's serverless so what that means is it can scale up it'll it'll expand and contract elastically dynamically um so as your therapy needs go up it'll automatically expand if they come back down it'll automatically contract it keeps costs down is the idea we'll use this today in the free tier again the free tier is pretty darn generous there's no way that any of the stuff you're doing today will ever even go past that um and so it's it's completely managed right uh you just spin the thing up and go and the rest is being done for you all right so with that um let's go into the first we'll pass it back over to rag so we can get into the first um exercise scan the qr code we'll drop the links as well there's the github repository so all the exercises all the work the homework everything is going to be in that github repo by the way um i do see real fast um can we use joins in cassandra no there are no joints yeah cassandra we're going to talk about that later in the data modeling section it's a denormalized data model we'll we'll get into that detail but there are no joints all right let me pop over to your are you ready rags yeah i'm ready yeah go for it go for it let's get over it let's get over there are you able to see my screen yes i've got that sharon yep all right um just wanted to be sure uh one of the nice things about today's workshop is that you don't have to install anything right and um you know we have not crammed too much into it so um you know i'm sure you'll be able to get through this uh i really hate when someone is talking over me when i'm in the midst of a workshop so i'll try to shut up you know after i go through it initially uh my suggestion is if you can you know kind of keep the distractions and hands away from the keyboard when i'm kind of walking through this it could help you um and then you know you can jump in and even if you don't finish uh you can really try later since everything is recorded and you can you can go try it yourself okay um my suggestion again here is keep a window open for the github link you know i'm using mac down here as you can see you know create your astro db instance and so on uh and then you know we're going to get into what is called the sql shell um you know in a moment um and you know be able to cut from here and paste to here that's all i'm going to do very simple right but as you're cutting and pasting uh you know just just you know try to take a moment to you know you know kind of figure out what exactly you're doing um you don't have to go through exactly what we have charted out for you you know you can do some you know kind of help and and walk around kind of you know chart your own course if you will but but you know um what i'm going to show you is just like i said you know just go cut and paste right so the first thing you want to do is register uh oh by the way there are three um steps here one is creating a database creating a table in the database and then doing what are referred to as crud operations okay cred stands for create read update and delete and the idea is that you know we're going to walk through each of them one after one after another we create we read them we update it and then we delete it right um pretty cool um so so we're gonna you know if you want to head over to astra datastax.com i'm already there here right um you can you can use either your github google or whatever email link i used email and you choose start free now um and basically what that does is gives you like uh 25 is that right david per month yes yes and i've been told that's more than enough as you can see here i i have spent like zero dollars and i don't even know how it can be zero dollars because i definitely did some stuff um so as long as i'm not being billed as long as i'm not putting in my credit card i'm fine okay yeah you'll never be charged exactly if you if you decide you want to put in a credit card or something like that that's up to you but that that 25 that he's talking about that that 40 gigabytes of storage and the you know millions of reads of rights that rolls over every month right so you're a matter of fact i usually i have at least a dozen databases at any given time and doing a bunch of stuff and i've never ever come near my 25 exactly okay um so here's where you're going to start you know you can either say create database or create serverless database you know serverless doesn't necessarily mean there are no servers in the you know in the back right you know there are servers somewhere right but the idea is that uh you know serverless is a paradigm uh that's been uh that's getting a lot more um you know interest in the developer community because the idea is that you know you think about your business problem and not think about servers you know wherever wherever it is in the world right so so let me start with you know the create serverless database right so i'm going to create a serverless database okay and very simple uh my you know there's uh usually a lot of questions about like uh case um you know uppercase lowercase and so on yeah my suggestion is just go with you know the the case that is provided here okay so for the database name i'm gonna pick killer video cluster right and i'm gonna pick a key space name of killer video you may be wondering what the heck is a key space name right and and really it's basically a way to organize um you know your um different databases right so think of it as like a container right um the nice thing about this is it doesn't matter what you know cloud you want to create it and i probably don't even care uh you know pick whatever you know you want google cloud aws microsoft azure uh it doesn't really matter um you know i i will pick microsoft azure just because i want i can right um and i'll go with north america and i'll say south central us you know i've tried other things too it doesn't matter right and just say create database okay so again you know if you go back to the instructions it'll it'll give you this and it'll tell you that you know this this basically uh comes up as pending right so so essentially you know the um the server in the back rooms are working right and and trying to get this um up um and and in a moment you will see that these these will turn into active um okay anything you want to add david in the meantime no no yeah no we're all good yeah questions are good and everything we're all caught up all right right um so you know this is the time when we play some music right don't don't don't don't don't yeah let's do that all right let's do that exactly it's exactly at kind of the half uh halfway time right so so we'll yeah play some music all right here we'll i'll just turn on some light music for us to code by right i mean i definitely i don't think the audience want to hear my jokes so you know i think i don't know dad jokes are always fun especially the ones that are terrible which i'm good at i'm good at bad dad jokes yeah maybe you should try that all right it's still pending but it's going to take a little bit of time it's going to come up yeah and give us a thumbs up once you get to a point um where you have done uh what rags are showing where whether it's impending or initializing give us a thumbs up and let us know you're there because that's that is the exercise at this point right and then we can move on to the next section yeah actually it's a good idea you know maybe you should you should get this going while i'm working through this side i know i told you that you know if you can pay attention while i'm um going through this it might help so what you can do is take a moment you know to create the database name and the key space name right uh and now that it's it's available um you know i can i can go ahead and and select the sql console and and you know basically it shows you you know through animation you know how you can how you can do this there are now several different ways for doing this um you know you can you can just click here or you can say connect um and you know you can bring up the sql console okay that's right oh and sorry rags and once once where's my mouse okay once you get to the point where you create your database yeah give us that thumbs up and then we'll move into the um into the data modeling section before we go to the uh create a table piece yeah and and even if if it's not active um give us a thumbs up because you know you can turn to active in a little bit so that's right yeah pending or initializing is totally cool and we'll we'll go ahead and move into some theory crafting while your databases are coming up and then by the time we're done it'll it'll be good to go yeah awesome you go hey thanks sir deep i appreciate that i see some thumbs up coming in now i do see that stefano answered the question uh from uh devendar so astrodb is powered by google aws or azure cloud and we have to choose you have the choice to choose from and i figured even though stefano answered it i figured we should just talk about that real fast yes you can deploy astr to be on any of the three major cloud providers you don't need an account on the major cloud providers or anything like that you don't need to pay or whatever it is completely being handled by astrodb right it gives you flexibility is really what it comes down to in the vendor that you want to choose for it um so all that cost and everything is just rolled into how astr works all right i see lots of thumbs up coming through pending is cool yeah pending is totally cool don't forget you're spinning up like a fully fledged three node cassandra cluster right now it takes a moment to do it it will take a couple minutes pending or initializing is totally fine yeah all right yeah i should have probably started with astras like uh you know database as a service right you know which i completely bypassed but but uh you know essentially i i'm hoping that most of you are there so i'll proceed uh with this and like i said just have fun just relax if you think you're you're kind of falling behind you probably are not because you know there's plenty of time to catch up here okay and uh you know we have set the pace to be at a very lack space okay that's right there is a lot of education okay so so you can do a description so i'm just going to go through this different steps right so you can see um you know i'm describing the key spaces okay uh dese stands for describe or short for describe and you can see it shows the um key space that we just created which is the killer video okay this is something that we're going to use uh you know for for other applications but but you know right now um you know we're just preparing for that um you know the idea is that you know we're going to create the database and and use this in an application right so now what we're going to do is we're going to use the database and like i said you know if you can stick to the um case that is given here it's a lot easier because sometimes so so now it's you know you can see that the prompt changed and i'm using the database killer video okay um so not a whole lot there right and then now what i'm going to do is i'm going to create a uses by city table okay so essentially what i'm going to do again is very simple i just cut and paste it right um but take a moment again to review what we are trying to do here right so we are creating a table if not exist kind of a defensive programming right you know you just want to do it anyway uh use us by city and then we are going to provide you know what are the attributes right um you know like text of you know city uh first name last name and so on and you know based on what david was saying one of the key things is how do we partition the data right and that's where the primary key comes in here so as you can see here you know the primary key is what you know city last name first name and email okay that makes sense and i'm gonna hit return so really not a whole lot happened there right um you know except that this particular table got created now i can do the described tables like anything else you know you can i can just you know use my up arrow key uh and and you know kind of modify you know you can do all this the usual um you know command line stuff right and you can see here the users basically is there okay um and that's really it you know we will get to the current operations in a moment but but you know creating the um you know the table was pretty straightforward uh if any of you are having issues would be happy to walk through it but but you know worst case just go back and and recreate you know it it doesn't take a whole lot of time okay um with that let me hand it back to david all right we'll give everyone a moment to create that table right so give us a thumbs up if you've done this you did step two along with rags and you've created the table give us a thumbs up let us know you're there and i'll wait to see some of those before we start moving forward yeah so there's a question by prashant gupta which is key space is like schema and rdbms and and the answer is no right so key space is really a way to organize um you know your databases the schema is what you saw here you know which is like create the table right where i created a table with a number of different uh attributes that make sense yes let's see i'm answering a question here in the chat um [Laughter] so if you're if you're at any time wondering you know what commands you know you want to uh you can just do help um and if you do help uh you know the particular command it takes you directly to docs.datastax.com um and and you know gives you a little bit more uh information on that so you know feel free to kind of play around with you know with some of those uh but like i said you know the ones that you absolutely need is is on the github link exactly all right so let me go ahead and i will put my screen back and then i'm going to answer a couple questions that i think everybody could benefit from prashant's asking a really cool set of questions right um one is uh in a key space we can create multiple tables absolutely right a key space is a container tables you can have n number of tables in a key space no problem um there was a really good one by uh dr ritesh dr ritesh singh malik um and asking about pretty much like the what kind of like thorough put can cassandra handle with something we haven't really talked about is the when people ask who uses cassandra right what's it good for where do some real world use cases i love this question pick up your phone and i guarantee you have at least an app more likely handfuls of apps that are using cassandra today some that everybody knows about right netflix is a great example netflix has been a long long time cassandra user they were in since the beginning they have something like 10 million operations a second they handle and in netflix imagine when you're you're playing pausing you're interacting with it you're right you're getting recommendations all those millions of people on netflix around the world at any given time that's using cassandra right apple apple is another one that has again many millions of transactions a second um you know siri is actually powered by cassandra spotify uber home depot blows a bunch of the banks and credit card uh banks and stuff like that all use it um sony playstation is another one right when you're interacting with soda playstation you are using cassandra on the back end um so i mean cassandra is used by a really solid amount of the fortune 100 companies a lot of disruptive apps are using cassandra it can scale and what's really cool about that that when we talked about how in the beginning i realized now i may have glossed over this so yes it's a distributed database right but what does that even mean cassandra was built to be able to horizontally scale it can essentially scale indefinitely and it also scales linearly so if you want to double your thorough put you double your notes right and that holds whether you're talking 32 nodes or if you're talking a thousand nodes or something like that and that's exactly why companies like say netflix and apple and verizon's another one why they can scale up the way they can because compared to how i might do things with a relational database where i'm going to scale vertically meaning i'm going to increase the speed or the amount of my cpus my ram my disks things like that um cassandra will scale out horizontally meaning i just add more notes i keep adding notes um so that is that's another kind of a core thing that i totally glossed over but to your question uh dr ritesh um yeah from a scalability standpoint that that's one of cassandra's strong points it's one of the things that's it's picked for okay so david uh i think we need to highlight the conversation between uh amal gaikwad and uh stefano um and and essentially uh i think in the true spirit of open source we need to acknowledge all the contributors for cassandra right and uh one of the biggest contributors is apple uh absolutely is a huge contribution yeah what you're seeing is not just you know data stacks effort but obviously we're doing a lot but but uh but i think the entire community is contributing as well which is why you know it's kind of one of the leading nosql databases yeah that's right that's right walmart's even a big contributor they're a big user as well yeah yeah no thank you for pointing that out all right so hopefully you see my screen and it looks like i got a lot of thumbs up so i believe people are kind of sticking with us creating the table great now you may have noticed when you created that table for those of you coming from the sql and relational world that looked really familiar and that's because cql what you use there is a subset of sql it the syntax is very similar of course there are going to be some differences it's a different database but for the general um like dml commands ddl commands that you know data manipulation and uh data definition language commands they're going to look really really similar and that's on purpose right so now what i want to do though is in this next section we're going to start getting into some of the data modeling piece um we've talked about things like partitions but you know what exactly is a partition what does that really mean so let's we're going to kind of like start the very very base level build it up and then we're going to get into the art of data modeling which is how do you properly data model in cassandra which is probably one of the most important things we'll talk about today all right so let's make sure my here we go all right so the first thing i meant we're going to really start at the base here right let's just look at the basic structural elements that we're going to work with now cassandra at its core is a tabular database and in that i'm going to have things like rows and columns so again if you're coming from a relational database or if you've ever used an excel spreadsheet or something this is all going to look really familiar right so the base the smallest tiny unis unit that i can have to store data is a cell right a cell is essentially just an intersection between a row and a column it's a single it's going to be a single piece of data a single intersection of data then if i step up i'll have a row right so a row is just going to be a single structured data item that's essentially multiple related cells right so in this case i have four columns that comprise my row one john doe and then wizardry now this is where it gets a little different right so the way that i mentioned before the way you store data in cassandra is via partitions so i want you to take a look at what's going on here notice i have three rows right but look at the department at the very right they're all wizardry and where that box is so when cassandra if the department here is my partition key let's say that when i created the table i said department is my partition key that means that these three rows will be physically stored in the same wizardry department right on the same wizardry partition sorry so they're physically stored together so this is kind of a key difference from how relational databases work now if i had three rows but they had different departments in this case wizardry dark magic and devrel those would be three physical partitions those would be three separate physical partitions so again going back when i have the same partition key the rows would be stored in the same physical partition if i have different partition keys then those rows would be stored in their own physical partitions and a key concept to understand here is even though i'm looking at a logical table let's say i just said select star from this table and i got this back those three rows and those three different partitions could actually be located on three separate physical nodes right they could be distributed around my cluster it's a really key concept because when i'm when i'm pulling data back uh if i do just an open select star from without a where clause i'm actually going to be pulling from nodes all around my cluster so we talked about a key space being a container of tables right let's break down this partition thing a little bit more so i have key spaces that contains my tables then i have my tables right that's going to be my grouping of any of the rows and columns that i have and those rows and columns are broken up by their partitions let's look at a concrete example i like concrete examples this is how i actually learn things otherwise it's a little too abstract sometimes right so i'm going to have some key space killer video again it's just my domain right killer video by the way happens to be one of our our older reference apps that is like a youtube light right so it stores videos and users and things like that so any of the tables that i'm going to have in killer video are going to probably be be in that domain then i have some table in this case users by city now there's a convention here in cassandra we say what it is i'm storing by what i'm partitioning by so i'm storing users partitioned by city okay so i'm saying my partition key in this case will be city you see that on the left so phoenix and seattle are my partition keys so notice that in this table i actually have two partitions i have three rows per and i've two partitions now there's another concept here at the bottom called clustering columns so in cassandra if you want to order data or if you want to determine uniqueness in your primary key you set what's called a clustering column so what's going on there so let's take the first example of the phoenix partition there at the top and then notice the clustering column's last name and first name so the way clustering columns work is if i set say last name to a clustering column and it's text it will automatically order that alphabetically so when i store the data when i write the data to the database it'll actually be ordered in memory where it's fast and stored on disk in that order why am i pointing this out to you so that the point here is this cassandra is optimized for performance at scale and the right performance is actually extremely fast where you really really have to optimize is on the read you want to ensure that read performance is still good now in a relational database if i want to order some data even index data i pay for the cost of the order at the read in cassandra when you order data with a clustering column you pay for the cost of the order at the right when it's in memory it's flushed out to disk in that order so later on when i read it it's already in that order so this is a nice optimization that include increases my read performance is really what it's about and what it really comes down to if you take a look at that last name there helsin last fall smith just notice it's already in alphabetical order right so if i go to read that data out it'll automatically be in alphabetical order if i set that to a clustering column and both partition keys and clustering columns are part of the primary key i'm going to talk in a lot more detail about that here in a moment and then anything else is your data right so again i have i'm storing user information and i'm partitioning by city that's what this is saying here all right so uh we did just create a table a second ago um so i'll just go over it again real quick again it looks just like sql so i'm going to say create table buy some table name then i'm going to have a set of column definitions in there right city last name first name so on and so forth i'll have their field types you know these happen to all be text but by the way everyone always asks well what what types are there it's pretty much comparable to what you might expect in a relational database um where you might be used to varchar we use text funny enough you can use varchar but it's an alias to text and there's no byte limitation right in a bar car i have to specify how many bytes in cassandra it's technically unbounded but all the different types you could think of you know various blobs and collections and ins and stuff doubles longs are all here um and then notice my primary key the first value in your primary key is always your partition key and in cassandra it is required that you have at least one partition key you always have to have a partition key because without it it's not going to know how to partition the data right so you have to have that notice the trends there in partition key it's a nice convention right you don't have to have it but if you're using a composite like if i have multiple columns in my partition key you would you might as well just use it that way it's very clear where your partition key is anything else after that last name first name and email are my clustering columns so in this setup i'm saying i'm going to partition by city and now i want to cluster or order by last name first name and email all right so let's go oh yes go ahead yeah go ahead one of the questions from saksham arora is in addition the partition key eliminates the need to normalize tables is that correct and uh i'm not quite sure i understand that but but uh you know i don't think they're related the partition key and and like we said um you know data is denormalized so it's not really in the third normal form that you would expect uh you know like in sql and the idea behind that is you know many of the use cases um you know you will do a lot of giants which makes the operations very expensive right so so sometimes it's easier just to de-normalize and and treat those use cases so so partition keys and normalization i don't think go together but but maybe i'm missing something yeah and we will and i see stefano said the same thing we will address that here as we get into another part of the data modeling piece um you're correct though we don't normalize tables right we denormalize tables um but i we will address that later so i'm going to hold on that one um i do see another question from uh sashbot say does clustering does column clustering happen on its own or do we define it you define it you tell it um which ones when you define the primary key like you see up there in the example when you have anything after your partition key those are automatically clustering columns and those are part of your primary key yeah okay so the two main things the primary key does for you is it ensures uniqueness and it can define sorting with your clustering columns right so if you take a look at these examples here notice that in the first one at the bottom there you'll see city last name first name email right again we want to ensure uniqueness so you know i have the combination of city last name first name and email ensures a unique row right that's the key thing and i'm going to give you i'm going to dig into that a little bit more here in a moment and show some good and bad examples the second example is just a user id now something key in cassandra we don't tend to use ins for user ids or ids we tend to use uuids reason being is for collisions if you use ins like a lot of times in a relational database i'll use an int as an id which works fine in that particular case but in a distributed system if you go back to what i was talking about earlier if we had like two nodes and there was a network partition and they can't maintain consistency what happens if during that network partition i get a request to create a user on both nodes and i'm using an int well guess what they're just going to increment to the next one so they both go okay i'm going to increment the two well then the network comes back they talk uh oh i have a collision and cassandra the last right wins so what's going to happen in that case is the one with the last time stamp is just going to win right the other one will just go away so to protect against that we don't use ants usually we use uuids which you can just generate you can generate them in code and the driver's all over the place but that ensures that the ids will be unique even when nodes can't talk to each other right that's kind of a key thing but again the key thing of the primary key is to ensure uniqueness and you can define sorting now the partition key yes it's part of the primary key right you can argue well wait if i only have a partition key in my primary key it's the same thing but your primary key can obviously include clustering columns and such but the partition key is the is what you're going to use to partition your rows right so look at the example there um in the first one user id again i'll have my my uuid i'm going to use that's going to ensure that for each user that comes into the system in this case they would have actually their own physical partition right now in the second case there you'll see where i have video id is the partition key and then comment id is my clustering column now there's a distinction between these two examples the first example is what we call a single row partition that means any partition will only ever have a single row a single user because i'm only defining a partition key user id and the second example it's a multi-row partition why because for one particular video idea i could have tens or hundreds of comments right so those will actually be stored as separate rows within that same physical partition but the key thing is that is how i'm partitioning my rows now getting the clustering column this is what i was waiting for to answer that question um so because i like these examples that kind of show you good and bad examples so again clustering columns are there for sorting and uniqueness and or uniqueness and the first example there with the red x you see city last name and first name what happens if i had a city of orlando but i had two jane does right cities could have many millions of people i could totally have two people the same name in the first case they wouldn't be unique so if i had a jane doe in orlando and then i had another jane doe in orlando who was a different person the last one would just win it would just overwrite the first one but notice that second example of the green check mark where we added email now emails being added in there for uniqueness generally no two people are gonna have the same actual email so now i have a good unique row so two people gonna be jane doe in the city of orlando but they have different emails now i have a unique row now on the bottom there what happens in this particular case right if i want to use a clustering column for sorting so in the first example of the video id and then comment id only well think about it if i'm looking at comments and like a youtube video or something like that i probably want them in time order because if the comments just come in randomly then i'm going to lose the feel of conversations it's not going to make any sense right so it's not being sorted in that case because the common id is going to be something like your uuid that doesn't sort very well but if you look at the example at the bottom we've added created at now with created at in there's a clustering column that will naturally order based off the time stamp right so that'll give me a nice order to the comments that are coming in and they'll make a lot more sense so some rules you want to think of for a good partition you want to store together what you retrieve together as rags mentioned earlier we in cassandra we use a denormalized data model what that means is this i want to store everything for a particular query and a table that way when i go to read i go to select and i get the answer of that query i have everything that i need i want to store together what i retrieve together and that first example there um for any particular video let's pretend i have a thousand comments right they're all gonna be stored in that same partition so when i read the partition for that one video i get all the comments and the second example that i'm storing all the comments separately in their own partitions if i had a thousand comments i would have a thousand different partitions that i would need to go get if i want to get all the comments for a particular video right i'm not storing together what i retrieved together in the first example i am i'm going to store the video along with its comments and that way i can i have it in one shot i also want to avoid big partitions now cassandra has you know technically things are like unbounded but there are practical limits um things like 2 billion cells per partition it's really good to keep well within 100 000 rows in a partition or 100 megabytes per partition right um and and that's you know if you think about it from an i o standpoint i might be able to get to that partition real fast but i have what's called a really wide partition now i'm going to incur a bunch of extra work to get all that data out right so you really want to keep within these kind of constraints so in the first example there it's the one we've been using um you know generally speaking videos don't tend to have a hundred thousand comments if they do and you have like some celebrity and it's that well that's a wonderful video good for you um maybe you want to break it up a little bit um but generally you're you'll be fine but in the example below it the country and user id imagine if i had a partition with a country like india and i had all the users for india i'd have like a billion users right i'm going to be well past the 100 000 row piece um so that's not a really that's going to be a huge partition right so again i want to keep within the constraints that you see there all right so let's look at an example of uh what's an unbounded partition right now i mentioned earlier that cassandra is really good at iot uh use cases and it is this sensor id here thing is actually an example of an iot use case where i have some sensor and maybe i'm going to report on its state every 5 or ten seconds or something like that but if you do the math in about a month right you're going to have a pretty good amount of data here and the trick is this one is unbounded there's nothing that's going to cap the amount of growth in this particular partition because for that particular center every 10 seconds as i'm reporting data that's just going to keep growing and growing and growing you don't want to do that right so if you have a case like that you might want to do something called bucketing all it really means is i'm going to break up the partition into something that will help me cap the size so look at what we did here in this example so i'm still using sensor id i still have my reported and i'm still recording the data every 10 seconds but now i've created a composite partition notice the parentheses so both sensor id and month year are part of my partition key right so it's a composite it's multiple columns but now my sensor id partition will be capped by the month year so i will only ever store the amount of sensor returns those 10 second essential reports for a month at any given time so i know exactly what that size will be and that's a good way to do it right that's a good way to kind of break it up and i'll never have unbounded partition growth in this way unless unless of course i then you know decrease my uh sensor collect time to like a second or something but even that will still be bound by the month year right so that's again a way you can break it up so if you find that you have a case you're like oh man you know as i'm trying to think about this my partition is going to be kind of big well then how can you break it up this is an example how you might do that okay and the last one here are hot partitions there can be a relationship between big partitions and hot partitions hot partitions just mean these are the partitions that get the lion's share of requests so imagine for a second going back to the country example imagine for a second i had a partition like the one you see on the bottom with the red x control by country and user id like this so go back to india right now let's compare that to say finland right finland has significantly less people than india does so there's a very good chance if all things else being equal the amount of requests i'm getting in you know per user and stuff like that all things being equal there's a very good chance that if i have india partition with like a billion people and uh you know finland won with you know many million that india the india partition's gonna get a lot more of those requests i'm gonna have an unbalanced database right because i'm gonna have a small subset of nodes they're gonna be getting all the requests while the rest are kind of quiet that's what we mean by a hot partition right so we want to kind of you know guard against that um that one in the middle with the question mark what we're saying is this in the case of a video what happens if you have a particularly uh popular video right we can totally see that in social media you could argue whether or not you might end up with some hot partitions in that case where in the case on the top you could also make the argument maybe a particular user has a lot more but usually not at the scale that something like you know a video might have with a ton of comments or whatever but again these are just things to think about to keep these in mind while you're creating your tables so you don't you know you don't start off with something like that bottom example there okay so we already did this exercise how are we on questions i want to ask yeah before we move on so david i'm going to ask you a question does uh hard partition usually mean that you know the key selection is bad or um you know not necessarily yeah um yeah if anything it might mean that you just need to take another look at your your partition key right and yeah your partition key in particular like in the case that we have here um that one's pretty obvious that's why we use as an example right um because there is such a disparate you know it's such a big difference between the amount of users i might have in india compared to a country like finland right if you see something like that and you're like uh-oh you know like maybe i'm getting some hot partitions you might want to think about a way to book it right think about how can i split those up to kind of decrease the the surface area a partition has if you will um so yeah i would say yes if you have a hot partition you need to look at your keys absolutely yeah is there any others did i answer your question rags yes you did okay yeah because we uh we had a number of questions on this right um you know usually we do right so yeah yeah rodrigo asks um a good question here can i add a primary key in a table with some data already great question no once you have once you have devised or defined a table and you've def already defined your primary key you can't change it if you need to evolve that's called schema of evolution right if you need to evolve your schema and you need to change your primary key you need to create a new table with that primary key and migrate the data over and here's why think about it the partition is literally defining the physical location of your data around your cluster and imagine again going back to my thousand node cluster scenario or something even though that really doesn't matter so much if you're using a replication factor of three but imagine you have a huge data set right imagine you have a table with a billion partitions in it you totally could and you change that primary key you change the partition you literally have to now all those token values and everything you have to change the whole thing so you literally would have to reshuffle all that data right so that's why you can't do that so yes if you need to migrate your schema then you will create a new table with that new primary key and migrate the data over okay let's see yes yes um oh yeah ivan garcia asked how many rose records by partition is adequate to maintain to keep the good rules in place within a hundred thousand keep it under a hundred thousand if you find that you're going to go above that maybe bucket right find a way that you can split up that partition a little bit and use a composite alrighty so we did the create a table piece so i'm going to go ahead and move into the next section i think we're caught up in questions let's see um i do see a question real quick i'll just answer it about badges and how do you get it to reflect in your linkedin account there are a couple different ways you can do that um you should be able to add it to the i want to say the featured section of your linkedin profile you can add you can add the url directly and it'll show up you can also just put it on like your normal linkedin profile so there's a couple different ways you can add that and i'll add i'll answer this one last one for prashant and then move on uh prashant asks can i import my sql data db table into cassandra so i would while you can technically take the same exact table and put it into something like cassandra i would argue that's a bad practice and here's why unless you have a table that is completely one-off and not joined with anything else right and and such then usually you're gonna and we're gonna talk about this right now usually you're gonna have a case where there's this concept of joints and your tables are gonna be organized appropriately based off of normalization that is not how we do things cassandra so it may not be a good fit right you really want to take a look at how you're going to denormalize and stuff depending on what the kind of relationships you have in your data so with that let's let's talk about it let's do it so um some terms to be aware of if you're not already or normalization and denormalization um so if you've done anything in the relational world you're probably familiar with normalization to some degree right um so that's where we're going to use things like third normal forms we're going to normalize our data really what we're doing is reducing the data redundancy and increasing data integrity right so in the example you see here on the right i have two tables my employees table my departments table and one i'm just extort i'm just storing my employees right each row is going to be a single instance of an employee that's it in the departments table each row will be a single unique instance of a department i'm not going to have any duplicates or anything like that and then i'll use things like foreign keys and stuff to reference between them right so notice up in the employees table for edgar codd there we also have department id of one right so both edgar and raymond are in the engineering department now there are some pros here there are some definite pros with this for one rights are really simple if you want to add a department i just add a road to my departments table done right from a data integrity standpoint the database handles that for you um why because while i've got these keyed relationships and such again i have these single instances of data and i just joined to them you know so the database handles that part for you the cons though are getting into that scalability piece reads can end up very slow how many of you here have had you know a query with like 20 nested joins and stuff like that and you have all these cartesian joins you have to do and everything it can really slow things down and that can lead to some extremely complex queries right so it can it can you know there's at some point there's diminishing returns depending on how nested all of your joins and such are now denorialization is usually a process used to like flatten out data that is there to optimize read performance if any of you have ever done any data warehousing you've probably done this in your relational database where you flatten out data the big difference though is this in a denormalized data model i may have redundant copies of data right so if you look at the example here notice what we've done we've essentially flattened out the example from before employees in department and to a single employees table that includes the department but if you remember both edgar and raymond were in the engineering department and in a denormalized data model i'm i'm replicating that data right i have redundant data in here so here's the trade-off this is where this comes into play by the way yes this is what cassandra does this is what we do when we date a model in cassandra so cassandra makes a trade-off with being able to perform at scale and using up more disk right so back when relational databases were created if you remember um it was these huge platters like that were extremely expensive very slow and like had five megabytes maybe right or something like that um so there was a you know like space was a big premium um so they really had to reduce the amount of space that was being used and that's where a lot of things you know if you look at when normalization was first invented it was during those times um so there was a you know again that was a huge commodity and plus disc was extremely slow back then and the days when cassandra was born you have things like ssds out of all the different commodities things like cpu ram things like that disk is actually the cheapest and it continues to get cheaper and cheaper and cheaper and faster so cassandra essentially makes a trade-off and says you know something i'm going to use more and faster disk to optimize my read speed at scale so we use a denormalized data model here and we'll flatten the data out now there are you know there are a set of pros and cons pros big ones your reads are faster that's that's a big one why because here i don't have a cartesian joint i don't have to i don't have the extra logic and overhead of doing a join all the data's there i read my partition i get all the data my queries are also extremely simple they're much simpler you'll find that most queries in cassandra are like a single line um because you don't have any joins you don't have it the logic is very simple now they're odds from cons though in the case we saw before normalization when i want to write to say add to the department table i just write one place i'm done but in cassandra i'd have to update every place that that exists right so it makes rights that you could have multiple rights an integrity here is actually handled by you in the application so there's definitely some differences um you know so if you have an area where you have multiple um you know like departments being written across it's up to you to manage that now there are mechanisms in cassandra there are things like batches and stuff where if you need to ensure that you have some like keys across tables you can do that and everything but it is generally handled by you all right so let's take those two things though and look at here's this this is if you haven't listen to anything else listen to this part because this is where this gets this is the paradigm shift and this is the way that i want to i want you to help change your thinking a little bit when doing things with cassandra when we date a model in the relational world we start with our data we apply our third normal forms and things like that we model the data that then determines and generates what our scheme is going to be in our relationships and then as application developers what do we do we then reference the the big erd and the schema you know back in my day uh when i was doing this in with oracle and other databases it would usually be this like whole wall of all the tables and joins and you'd go up to it you figure out okay i've got these tables i need to join these these are the keys i need all that kind of stuff the key thing here though is that the application developer we then we figure out the query based off the schema right so you start with your data you model the data and then we apply our queries we figure out what they are in cassandra we literally flip this on its head we start with the application workflows we use that to generate our data model usually query per table is what it is and then we apply the data it's a little bit of a shift right now a lot of people will ask though like wait a minute if i'm generating my application workflows and then that allows me to generate my queries and tables does that mean i need to know my queries up front are you insane um yes it does mean you need to know your queries up front but i'm going to show you a process right now where you can actually start with your application workflows and you can map out literally what your queries are going to be all right how are we doing on questions by the way before i move on their wrecks you doing okay i think we're good you can keep going yeah wonderful okay yeah it's been right it's been answered yeah okay great great all right so let's see how to do this right because this is a lot to swallow at first it can seem daunting but it's actually not that the process is is pretty straightforward so what we want to do is we want to take a combination of our conceptual data model right this is the same kind of process that we're going to use in any kind of application development it should look familiar and then our application workflows we're going to use that to generate some pseudo queries then we'll start to map that to our logical data model and then we'll optimize for our physical data model so let's break it down first thing my conceptual data model right um you know if you've done application development before and you're used to you know how you know the relationship between like entities and attributes this will probably look familiar now if you remember in killer video uh it's like a youtube light so we have this relationship between users videos and their comments right so what we're seeing here is on the left i have users my entities that have some attributes right whether email id and other things and then i have videos on the right again another entity and a set of attributes that goes with it and they have this relationship between comments so users can comment on videos and videos are commented on by users right so i have this many-to-many relationship between them now if i start even if i start on a napkin just writing out what my application's going to be i'm literally starting to write out what that application workflow my pseudo queries are going to be so let's take some examples two use cases one a user opens a profile uses opens a user profile right and in that profile for that user i'm gonna find all the comments related to that target user using its identifier you know probably some id and get the most reason first because as we talked before comments probably need to be in some kind of time order and the second case i open a video page i'm going to open the video and i want to see all the comments for that video again using some identifier for the video and listen the comments in time order so now i've actually started these are my workflows i've started to kind of map these out right and now here's my pseudo queries so in the first case i want to find all the comments posted for a user with a known id now in cassandra we use a query per table approach so i want comments by user i want to get all the comments by user and remember the convention my what i'm looking for by the partition what i'm going to partition by and in the pseudo query there it says i want to find the comments posted for a user with a known id that's it comments by user that's the table and the second case i want to find comments for a video with a known id comments by video i want comments partitioned by video and i can use this alone to start to generate my queries right so here i'm going to say okay i want to select star from comments by user i want to get all the comments where the user id is some value the same thing for comments by video so i've done nothing else but kind of just look at the application workflow and we can already start to generate our pseudo tables and our queries now let's start to map it into an actual logical table right so the first example comments by user well we already noticed that we want comments by user i want to find comments by some user id so i'm going to say user id now this what you're seeing here is called a cheboko diagram um and then cibaco is the name of he's actually on our team funny enough he's the person who invented this kind of approach the k means key partition key the c with the arrow is a clustering column and the arrow is saying the direction of order right so i'm saying here in comments by user on the left that i want to partition key as a user id and then i'm going to cluster by creation date in descending order and then i'm also going to cluster by comment id and remember the primary key defines a unique row so i'm saying the combination of user id creation date and comment id are my primary key and then i have my data my video id and my comment look on the right hand side with comments by video the only real difference is that we're partitioning by video id it's almost the same otherwise the question that always comes up is so wait a minute i'm repeating the comments i'm i have redundant data yes you are right because i remember before i'm going to store together what i retrieve together right and we have this query per table approach that means that for a particular query i want to be able to get everything i need for that query and that's single read because this way if i read for a single user id partition boom i have all the data or if i read for a single video id partition i can get all of the data in one read with the fewest the smallest amount of i o right this is really optimizing for that performance at scale now when we go to the physical all we're doing is applying the types and maybe doing some consolidation so if i go back to the logical data model right we see i have my fields now here's this difference i want to know it's a little subtle watch creation date and comment id they get collapsed down to a single field in this case common id but notice the type it's a tiny uid cassandra has this type it's really cool it's essentially the consolidation of a time stamp and a uuid it's called time uuid do you have to do it no could i have a physical data model that looked just like this and creation date was a you know a time stamp and common id was uuid yes i could it's just an optimization but that's what i would do right on my logical data models i move to my physical data model and i start to then you know bring in my types i'll look for any optimizations there's one and that's it i have my table i can now create my tables right and so you can see how i can go i can start with my conceptual data model and my application workflows i can just use that to start mapping out exactly what my queries are going to be until the point i actually have not only my queries but my tables all right so let's go in to the last exercise and i'll give it back over to rags rags you're good yeah i'm good yeah so for the purpose and the purpose of saving some time i already created these two tables because you know i think david you talked quite a bit about this right so um you know when i do the described tables um you know you're able to see that um you know the comments by video and comments by user tables are already in there right and essentially these kind of satisfy some of the workflow so let's jump into uh the crud operations right cred stands again for create read update and delete for some of you who may not be aware of this all that i'm going to do is i'm going to add um you know the data here okay again very simple i just cut and paste right you know that's kind of the way to get started right right so that's good now what i can do is i can also do the comments by video okay there's the comments by user uh don't worry too much about the uuid you know we kind of use that you know some uh in a typical application what you would do is you generate those okay uh but it doesn't matter right so the updates are done we updated you know both the um tables so how do we read this right one of the simplest ways of doing this is you know just get me all the comments right from comments by user or get me all the entries in comments by user okay which is typically what you would do in a in a mysql scenario but but completely avoided you know or you should try to avoid this in the um you know the nosql scenario because really you know remember you're optimizing for access based on partition key and it doesn't make a whole lot of sense if you want to get all the keys at the same time i mean there is definitely some use cases for that um you know but but those are really not um you know as critical you may probably do it as a bad job or something like that so so even though i'm showing it here just keep it in the back of the mind that it's probably not a good idea to do something like this okay on the other contrary something like this might be more relevant right you know select um you know for a particular user um so yeah you can see here there are three users right or three comments rather right um so i'm gonna skip the select star because you know like i said uh probably not a good idea to do that um and and you can also read um you know all the data there's there are different ways of doing this and this is going to show you how to do that you know and by the way just something to add their rags um the reason why doing something like you know select without a where clause is a bad idea as an anti-pattern cassandra like for the amount of data we have here we have a handful of rows it's no big deal it's your test yeah it's fine but imagine now cassandra cassandra is built to be a you know it's a big data database it can handle petabytes of data imagine you had a table that had billions of rows and you know like you know huge amount of data and if you don't use the where clause and you say ah just give me everything you now have to scan across all those notes right you're gonna you're gonna cause a full scan across those that's a total anti-pattern where the example he showed where you use the where clause and you're you're specifically um keying off the partition key now with that partition you go exactly to the address of the data right so so there's a big difference in performance from that standpoint so yeah you'll find that um not using the where clause with your partition key is an anti-pattern don't do it exactly yeah um again here what we have shown is like you know kind of reformating um the data that you would typically do in a sql query or something you know so something similar um the next step which is step 3d and and i'm going a little fast here because you know we want to get to some of the nicer parts as well um you know so here what i'm doing is i'm inserting a comment which is basically what we refer to as update or sometimes referred to as absurd okay so so as you can see here uh essentially what i'm doing is i'm taking a comment id and updating it to you know something a little different right oh my god that guy patrick is such a geek right and you will see this because if you remember if this is how we created the table right if it does not exist comments by video the primary key was based both on the video id and the comment id right so both video id and comment id are used to create a unique row so we'll need both to update a record right so depending on the primary key so it knows exactly where to go and and goes and um you know uh updates it or observes it or whatever right um i could also do it a little differently uh again you know probably not super interesting but but let's do it anyway okay set comment okay in this case what i'm doing is i'm doing the columns right and since i don't like patrick i'm just going to make it rags okay and you know then i set the video id right because again remember what you know what was the primary key right and i'm almost done right i i missed the second part so no big deal i can always add it here so you can see here um that you know it that particular thing got updated and again you know even though i told you um probably not a good idea i'm going to do it anyway right because again this is just for illustration purposes so i'm doing a select star from comments by video and you'll see that you know this particular comment got changed you know to whatever was there right um let me take a breath here um you know the final thing is delete uh again the idea is you know remember um you know what was the the key right you know the key that was used was both the video id comment id or whatever as the case may be right so so what i'm going to do here is i'm going to delete comments by video right and you can see that i'm going to set the partition key there as well right so you can see that i'm setting the video id and the comment id now it knows boom which you know entries to delete uh and it's done you know it's hashed it and it's deleted it right so now if i go back to select uh what is in the table you know you'll probably see only three right you know as as you would expect it okay so that's kind of the delete aspect of it so to summarize what we did was we started with these you know tables that mapped closely to our workflow right uh and then what we did was uh you know we created those tables um then we started inserting the data you know that's the create aspect of it right then what we did was we read read it um you know based on the partition key because that maps closely to the workflow uh then we did an update you know just to show that it can be done and then finally a delete and all of them was keyed using the partition key which is really the one thing that i think uh david was trying to drive home um any questions or is this good i think yeah we're good on questions yeah um give us a thumbs up if you've gotten through the the step three the last exercise there with rags give us the thumbs up let us know you did let's see a bunch of people are still with us uh yeah yeah we got plenty here see how many ever getting through we'll give everybody a moment yeah take a few more times i mean a few few more moments cool awesome great yeah nice job ottoman fantastic see a lot of thumbs up if you're stuck you know let us know as well don't hesitate yeah lots of thumbs up coming in great rigo i see awesome awesome cool cool cool cool cool all right so we're getting we're getting closer to time we're going to go a tad bit over but not not a huge amount at all just you know maybe five minutes or so um because we're gonna get a swag quick swag quiz next i almost couldn't say that nice and well cool yeah and by the way we're just scratching the surface you know this this workshop is obviously an intro um and you know get the idea of crud commands because that's what we need to do with every single database right any any development or anything you need those basic cred commands so it gives you an idea um the homework though and some of the exercises will get deeper some of the other resources like if you really everything we've talked about today you can get into like extreme detail with the academy courses if you want um so i'm happy to tell you what those are and again that's all free all right some asks assassins question almost on cue how is the work different from what we did yeah so there are some extra scenarios that are in the homework um so this is part of the homework which you're doing right now uh so make sure to take a picture at the very end there you'll see it in the homework i'll show you um but then there is an extra scenario there's actually there's an extra data modeling one and stuff that you'll you'll run through that are different from this okay oh no worries ravi i can't always find the emojis either all right so let's go ahead let's get into our menu quiz because people like to win swag that's always fun i think it's fun all right let's do this and come on over we're going to go to menti so remember i said keep menti open this is why let's give us a thumbs up let us know you're here i'm gonna go ahead and leave it here just for a second let everybody get settled into the minty again the code is there at the top if you need to here i'll give you this back so you can scan that qr code um actually uh cass theory uh cassandra cannon for canon can support document um matter of fact with stargate astra okay so in astra um if you in your database if you click on your database and you go to the um the connect tab you'll see there's a document api in there it does have support for you you can store json objects natively in cassandra yeah but you do that through stargate and that's already implemented and start in asteroids what it comes down to all right all right let's see okay we only have five questions yes yes so so this is one of those by the way speed matters and when you're asking when we're asking the questions watch your phone or wherever you're running the mentee if you have a lag in your video then you're gonna lose out because you're gonna be behind right so make sure you watch your device okay nice alrighty let's do it let's get into our swag quiz i'll give everybody a moment to get their names in because it'll say it's waiting so if you want to it'll generate a name for you but if you want to put in a fun name or something totally up to you almost as i just captured the exercise video i'm trying to think of what video that is yeah cassery i agree stargate stargate.io it's another open source project um it exposes rest graphql document kafka is coming grpc is on the way as well all to cassandra it's awesome like i from a developer standpoint it makes it awesome alrighty cool all right so let's go ahead and get into our quiz again watch your phones or wherever you're running the mentee what is astra easy one to start with all right the development tool it's a local in memory version of cassandra it is cassandra as a service in the cloud it is a gateway to the stars oh no worries emol and let's see correct the vast majority of you got that right great job it is cassandra as a service in the cloud wonderful yeah it's not just a development tool even though it's really great to use for development and greenfield projects and stuff or prototyping because you can just spin up your your database for free and if you're building something that you're hoping is going to go viral you're going to build a business on it it'll scale with you right but you can totally do it for free um and it's definitely not just a local in-memory version it is a fully-fledged cassandra cluster that you're getting there all right i don't even know if there are in-memory versions of cassandra but um let's see by the way remember don't answer in the chat answering the mentee oh wow so we have ties for speed wow so keith and var and first and second place in sham they're and third by the way these this is anybody's game so even if you get a little behind keep at it all right let's see what happens on the next one is it a little embarrassing if i'm in like 20th place i've it's happened to me too so what type is used to store both an id and a timestamp a time uuid a timestamp id all things in one id or a time id yeah rag some people are really fast i've had that happen too yeah all right let's see what happens correct answer is a time uuid exactly we were being a little tricky with the timestamp id um all things in onenote id would probably be a really fun name but that's not what it is all right let's see what happened to our leaderboard speed always comes in here you'll see like everything change so keith maintains the fastest again charlotte takes second major in third sham you're right there like i said this can be anyone's game question three jeez i'm doing worse than before pick up my speed what is the primary key the same is the partition key it's a column in the partition key it uniquely identifies a row it's the main housekeeper this one's a little tricky too make sure you really read those it uniquely identifies a row the primary key uniquely identifies a row even though if you had just a partition key in your primary key you could argue oh it's the same as a partition key but that's not the only thing it uniquely identifies the row it's the key thing there all right let's see what our leaderboard looks like and sherroth takes the lead o'keefe is there in second place var with one of the fastest times and i see then and bar in third place all right let's go on to the next question god i see saksham and uh manju they're all right like right there this is still anybody's game at these you get a thousand points per question so it could change real fast all right what is the partition key an optional table field for optional partitioning a designated field in your table structure to partition your data a consecutive number applied to each new record the key to partitioning this reality from the next let's see what happens all right the correct answer a designated field in your table structure to partition your data a partition key is never optional right you always have to have at least one partition key in your table if you have multiple in the composite yes that's that's optional but you always have to have at least one the suspense is building yeah right all right let's see our leaderboard oh keith maintains the fastest speed sham takes second then jumps up the third bar is right there munch oh you guys are right there this is like a tight game all right last one who is gonna win it for all the marbles here yeah in the data modeling framework we start modeling with copy paste from stack overflow with the physical data model with the logical data model with the conceptual data model and application workflow just talked about this let's see what happens of course there's always copy paste from stack overflow let's be honest all right the correct answer here is with the conceptual data model and the application workflow so those who answered with the logical data model here just step off right you always start with the conceptual data model in the workflow and you use that to generate your logical data model all right let's see what's next or who's in the leader who won who won but my brain just decided to not do that well let's see oh var eats out so akif takes the first position great job sham maintain second var jumps up and pops in third now if you're one of our top three winners take a screenshot right now take a screenshot of your phone and you're gonna actually send that um normally send that to jack but i'm gonna give you my email address poor jack is out sick right now so i'm gonna do this just send it to me there you go i just popped my email in the chat let me get it across oh hey stefano you mind changing that to mine yeah yeah because if you send it jack right now there's gonna be a little bit of a lag just because uh poor jack is sick but um uh and if you send it to jack he will eventually get it don't worry but yeah send it to me right now david.jilari datastax.com we'll make sure that you get that and as we just tell if we just have a couple more things um just some things to close off with but i'm going to leave this up here for your feedback good or bad whatever if you want things for us to improve whatever put it here and then i'm just gonna go ahead and finish up with a couple things um so again if you're first second or third make sure to take a screenshot if you didn't already and send it off to us all right so again for the homework now the homework is in the repo right so at the very top of the repo you'll there's a homework section and i guess i'll show it here real fast just so you can see this okay so if i go here to our repo all right homework right here at the very top right hopefully you see this and that'll give you all the steps so you've all been here with us you want to complete the stuff here if you did this with us you've done this here is the extra part of the homework there's these two short courses this is going to link you to a site called catacota it's actually embedded it's all free will never charge for anything like that finish those two just take a screen of the completed that's all right you just use this link to submit the homework if you click that that will submit a github issue right here and you if there's a template for you just add the images there and that's all you need to do okay all right um and there's even more by the way if you really want to get into it um notice this first link especially this one here this datasets.com learn standard fundamentals on both of these this dev modeling one there are actually real use case data models that are in here the cassandra fundamentals piece is a whole course right it'll get into the details of everything we talked about so if you want to geek out on that stuff if you're interested on that please go take a look um also there is data sex academy academy.datasex.com um there is a there's like literally weeks worth of material in there again all this stuff is free so you can get in there and get your cassandra on that way um now this is one way if you want to get your certification voucher right so this is a thank you for coming and being here with us today these are ways that you get two free attempts these are 145 bucks a piece normally two free attempts to get your cassandra or your kate sanders certification there are three different paths there's developer administrator for cassandra and kate sandra now what's really cool about this you have two vouchers so if you use one for one exam and pass it you can just use it for one of the other ones totally free or let's say you don't pass the first time you get a second free attempt right so you can either go to link there that dtx dt sx dot io slash workshop voucher or scan the qr code and we'll drop them down in the um in the chats as well um to get your voucher again there are multiple paths these are all an academy that you can take the courses and go through all the material and these will prepare you for all the certifications so everything you need to pass your certifications if you choose to do that are in academy.datas.com okay so we have some other workshops coming up next monday uh we have a new one this is actually going to be really cool i'm kind of excited about this one um going in using apache cassandra data with gpus and we're gonna be partnering with nvidia so this is this is totally rad right here this is definitely something worth to check out um this is also on our youtube channel and it's in our eventbrite so if you register with eventbrite you should see it there if you go to the youtube channel you'll also see it there um yeah yeah do tell your friends and family about this because really i mean gpus have have captured the imagination you know especially with ai and machine learning oh absolutely look forward to that as well absolutely absolutely um and then we are going to do our introduction to nosql databases so where this one focused on cassandra we're going to kind of widen the birth and talk more generally about nosql databases so if you want to get a primer on that then come on see us next week at the workshops um also we have a hackathon going on right again scan that qr code or go to build build a modern data app.com there are some pretty big prizes with this one right so go take a look all the information everything you need on it is going to be there at the site um so if you want to get yourself involved in hackathon by the way i don't care how small you think the app is or whatever get it out there there's a chance to win some pretty big prizes so go take a look at that um last one here we have let's see what is today we're not quite to the 15th so that's right what is today actually you have three more days left so this one's really cool um this is a chance to win one of three 500 gift cards by the way if you're in a different part of the world um that maybe doesn't have amazon gift cards and you want to use like alibaba or something i believe we could do that as well um but what you do is this lottery um you put in you know there there are some things in there you post on social media um you know you need to you know you need to do some homeworks and stuff earn a badge things like that you get points right um and then you'll be entered in this lottery and you'll be randomly chosen to win one of these gift cards um so definitely um if you want free money go take a look and please andrew you've got a couple more days for this one all right and like i said we have weekly workshops every week that sounds funny because i just said we have weekly workshops but every week we have multiple workshops going on um subscribe to our youtube channel uh to take a look at that and you'll see what's coming uh every week and we hope to see more of you again if you want to continue the conversation because once the stream ends today the youtube chat's gonna stop but we're always there on discord right we're there in discord uh we have a team across the world that we chat with you we have like 15 000 people in the community right now that are there so come join our discord channel you can see the link that dtsx.ioslash discord and come join by the way all these links and everything are in the video and stuff like that so if you missed them um i believe that the team is dropping them hopefully i wasn't actually watching the chats there yes i see stuff coming through thank you everybody and with that subscribe to the channel uh if you want to hear more from us ring the bell uh that way you'll get notifications coming up and we have a ton of content and stuff always coming out and with that thank you everybody uh thank you for being with us today uh we really appreciate it and we'll see you later thanks rags yeah yeah and uh just one i i don't know if you mentioned that uh you know the catacora is a great uh way of kind of learning stuff uh especially you know you can cut and paste again you know you don't even have to do that because you can just hit a link and and you're there the whole idea is to you know up level your skills and no matter where you are you can you can up level it so so take advantage of the opportunity and thank you everybody i know we are a little bit late and that's probably because of me but um you know thanks for coming and please join us again absolutely and thank you everybody so much thanks for joining us we hope to see you again in a future workshop with that take care bye and as always don't forget to click that subscribe button and ring that bell to get notifications for all of our future upcoming workshops imagine a being gifted with powers from the goddess of cassandra who grew those powers until she could multiply it will move with limitless speed and unmask hidden knowledge with those powers she was able to fully understand the connectedness of the world what she saw was a world in need of understanding from that day forward she sought to bestow her powers on all who came into contact with her empowering them to achieve wonder
Info
Channel: DataStax Developers
Views: 1,763
Rating: 5 out of 5
Keywords:
Id: 1494eJLRKiU
Channel Id: undefined
Length: 134min 4sec (8044 seconds)
Published: Thu Aug 12 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.