Introduction to Apache Cassandra™

Video Statistics and Information

Video

Captions Word Cloud

Captions

living in the future and welcome everybody to this thursday session workshop of intro to cassandra for developers uh i'm david jones chilardi and i am joined today with the amazingly intelligent bettina hello bettina i should say hello bettina hello everybody thanks very much for joining us it's going to be awesome looking forward to teaching you all about the intros to cassandra yeah exactly and you know and since bettina is the smarter one here out of the two of us um i'm gonna essentially put all the hard questions to you and let you take all the hard questions and i'm just gonna punt on all of them yeah i just want you to know that yeah that's good thank you i know yeah no problem right you're like well that's great we have all the other people in the background we have lots of people in the background helping with questions so yeah that's true on this board and yeah you go and introduce them yeah speak speaking of that here are uh for those who may have joined us before um you should recognize some of these faces with their lightsabers but if you haven't um other than myself in patina who are going to be here bringing you through the materials and answering questions and having fun interaction we have a whole team of us that are behind the scenes that are in the discord chat that are in the youtube chat right uh so ask away that is part of why we are here uh is to have that interaction with you and don't hold back your questions right so and if you have really complex questions you can always ask them on community.datasucks.com where we assist you every day right not just now but every day so that's right yeah and that's even more reason as a matter of fact from the team if someone would drop the discord link in the youtube chat that would be wonderful um and and that way as bettina just said if you ever want to follow up with us later i mean you can always find us on linkedin or something like that you know we're all out there but the discord chat is a really good place to do that um and speaking of that is the very tools um so let's just kind of lay the land of what we're going to do today so first thing is this is an intro to cassandra for developers right we've really focused on the developer experience from the standpoint now right now uh up on the top left hand side you should see if you're probably watching us on either a youtube stream or a twitch stream um so either one of those is exactly the same so if you're on one of those wonderful we will be using data sex astra today we're going to explain a little bit more what that is it's essentially cassandra as a service up in the cloud so it makes it really easy for you to be able to spam spin up a cassandra database in a couple minutes um as we just mentioned any questions or anything like that we're in discord we're in the youtube chat we do prefer to prefer discord there's more we can do there with your responses and chats and kind of longer form responses things like that either one is fine though and then finally we're going to use mentee.com to do a survey and take a quiz that later on if you are one of the top three winners of that quiz then you get some swag all right so with that yeah oh go ahead and enjoy the venti they'll get ready for it yeah yeah the menu is fine your phone ready yes for the menu yes mentee's fun um also from uh like for a follow-up on learning and we'll talk a little bit more about this at the very end um but we do have community.datastacks.com so sometimes some of the questions you might have might be say longer form or they take a lot more thought and time to respond to or the questions that are just really good for other people in the community to see that's what community.datastacks.com is for right so you may see us at times depending on the question you ask you might see someone say hey go ask this up on community right and that is really the place for those kind of longer form questions plus we have people from all around the world and the cassandra community and the very people who like commit the code for cassandra who hang out on community and you might actually get one of those answering your questions i should also mention our academy.datasacks.com so what we're going to do today is just going to scratch the surface right this is definitely a beginner level material if you're you know new to cassandra or even if you've been in cassandra a little bit but you really kind of want to see well how do i just get going right how do i get going to develop my apps with cassandra this is kind of what we're starting here with today but there's so much more that we're not going to cover today and those types of things you can follow up with at academy.datas.com again it's all free you go there there are video courses weeks worth of material honestly that you can get into if you really want to dig into um you know cassandra more or other subjects more than what we do now okay so from astr like i mentioned astra is essentially data sex astra is managed cassandra without the ops so this is a really nice way if you want to just spin up a cassandra database you don't have to worry about managing it you don't want to have to worry about maintaining it or anything it is fully managed for you this is what we're going to use today this is also a really great tool for those who maybe have lower resource machines you know you don't maybe have a laptop that has enough ram or cpu uh to run an actual database and do all that or run a full cassandra cluster this is the answer right you can just run it up in the cloud and again this is the system that we're going to use today and in case you haven't mentioned it david this is a hands-on workshop as well yes so i have no it will give you time to do exercises and we will give you time to register with astra and get online and use cassandra as a service in the cloud and you do not need a credit card or anything to sign up there's a three tier and you can take full advantage of it that's right and that free tier is free indefinitely right uh so if you create a database in the free tier it is free forever not only that especially when you're developing applications and you just want to experiment or something you can delete and create and delete and create and do whatever you want with the free tier you can even have i'll talk about what a key space is later you can have multiple key spaces and everything so it's a really nice tool when you want to spin up um a database in the free tiering kind of experiment okay so with that first thing um let's go to mentee yeah we're gonna ask you some questions here in menti um so you could do it one of two ways either go to mentee.com and the code there eight four two two eight three zero you enter that in or on most phones if you just turn on your camera and scan that qr code it'll bring you right to it so i'll give you a moment to go ahead and get into mentee and there we go i do it too i never come first all right and there yeah i see it there here i'm gonna drop the mentee in the discord as well boom there it is all right let me switch over to that other screen and if you would once you get in there give me a thumbs up you see that thumbs up down there in the corner let us know that you are there in mentee and again we're going to use this not just for the questions we're going to ask you right now but then later on today keep this open because then we're going to do the fun quiz and like i said if you end up in the top 3 you end up with some swag all right we'll give a moment for folks to get in and i don't see anybody saying like hey put that back up i don't have the code or anything hello spoon miles how you doing you see spoon mouse and youtube that's a fun name by the way i like that you know yeah use that for your mentee so we know who you are and i'll totally call you out yeah all righty okay so let's go ahead and these this first set are going to be a set of just survey questions we're just we're looking to find out a little bit more about you um and and this is this is this is information that we can use not only to kind of tailor what we're doing today um but to also evolve our workshops in the future let's see the first one is how much experience do you have with apache cassandra hey there cashorn wow there's a lot of people who's never used it that's amazing so welcome that's perfect that's right and again you know as we mentioned in the beginning you know this particular workshop is designed for folks that are coming into cassandra brand new for those of you who you know have like one to three years i see we have a couple you know veterans or we have two more than five years right oh wait there it goes yeah you better answer questions yeah you better ask the fun questions right exactly um you know we are gonna we are gonna definitely look into data modeling and and getting into like say creating tables crud operations things like that if you have a lot of experience in cassandra and you may have done some of those things some of this might be you know a little basic um but then we kind of rely on you as bettina just said to ask some of the fun questions so all right good to see this thank you all right have you ever been to any one of our workshops that's so interesting to see let's see we have some regulars that's awesome oh and pure i see your comment in youtube you're saying i'm just installing meant meter on android you can just go to menti.com in the web you could use the web app you don't have to use the app app if you don't want to totally up to you but now that i know you're doing that i'll try to slow down just a little bit all right so we have some we have a good amount of folks this time who have actually come back to uh to see us hello and uh for those who are brand new to one of our workshops welcome it's really nice to see you here uh we always enjoy your feedback and your thoughts so don't be shy in the chat this one here what is your main motivation for learning about cassandra watching how these answers come in that's really interesting and i always find this interesting for some reason this pie it it always seems to kind of make this split with those who are interested in both dev and admin and those who are interested in developing uh developing for cassandra now something to be very clear about we're not really gonna going to get into any admin like topics today you could always ask questions though feel free to ask your questions and if we have time and depending on where things are we're happy to answer them but this just so you know i want to i want to ensure that we set that expectation for the admin folks by the way for those admin folks we do have a ton of content up on data stacks academy and such if you really want to get into like the operations parts of the database all right every time i move my mouse i lose my clicker there we go okay and then how did you hear about this workshop i do realize i i forgot to update you you should have updated it yeah i forgot to if yeah i didn't put in like instagram and other things facebook oops oh and i see pierre says just to play with it to understand the nosql world yeah well then good because then something like astro with the free tier is really good for that uh because then you can play with it all you want right you don't have to worry about it it doesn't cost you anything oh where's my hat i missed the question oh i'm so sorry please hold please we'll take care of this there we go is that better did i get the hat back out that's david that's david as he should be there we go oh that's funny okay so it looks like most people are coming through a vent right throw through a friend oh that's really interesting that's really you know for those of you who answered through a friend i would love to know more about those stories um so if you feel so inclined and you want to say something about that come reach you know come reach out to me on linkedin or through a discord just dm me or something like that i'm very curious in that all right excellent thank you so much and last one for the survey what do you want to do with your new knowledge found it on facebook don't have instagram you know this is another one of those bettina that seem to have a very similar spread to other ones that we've done um where we see this kind of focus on to get a better understanding build an app with cassandra get a better job yeah yeah that's true and we'll see well it's interesting all righty excellent thank you so much for answering that for now keep your mentee open we're going to come back to this uh when we do the quiz and if you keep it open then you'll be all ready to go okay wonderful thank you everybody for your participation there now what we're going to do first thing we're going to jump right in to the first exercise um and this is going we're going to actually go to this repo here let me drop these links dropping the links so this repo um is the github repo you see this intro to cassandra for developers uh again by the way this this stuff is here for you this is open you can submit a pull request if you ever find anything that you think is wrong or if you have something you want to add or you can fork it do whatever you want with it totally up to you what we're going to do though actually let me let me add at this point because we get this question all the time this is also going to stay right this is not yes here for the workshop so both the session is recorded you can later see it on the youtube channel and review it and the repo stays up so you can totally do these exercises in your own leisure right so it's um it stays yeah thank you for pointing that out yeah these are meant to be self-service it's meant to be for you to do you know on your own time but we're gonna go through this obviously with you now okay so here's what we're gonna do as part of this first exercise our goal right now is to create your astra instance so if you've not used astra before you go to the link that we dropped there in the chats just go to this first section create your astra instance and follow the instructions and what i'm going to do is i'm going to give you so many minutes go ahead and do this but real quick i just want to set you with a primer of what you should expect right this should only take a couple minutes um so you'll go through the sign up the sign up is free there's no credit card or anything like that we're not going to pull that one on you when you go to create your database if you've not created if you've not used astro before you're going to create your first database for today we ask that you please follow the instructions that we have here with these values meaning the database name being killed video cluster the key space name killer video so on and so forth it makes it much easier for us as a group if everybody's using the same values for this because that way if anyone does run into any kind of hiccup or anything we we know what kind of username and password you've used and such as far as your own databases if you go to do this yourself you could use whatever you want right it just so happens that what we're doing today we do ask you to pick you know use the ones we're using here um also you will be given a choice um you know in the free tier of what region you want to use so if you are over uh say in the emea or apac region it's probably better to use europe west or if you're in the americas than to use the us east once you've filled out that information and you launch your database you're going to get a screen that looks like this right you're going to see this green bar at the top a little little spinny is going to say hey give it a moment and it's going to create your database this this could take anywhere from like a minute to maybe 10 minutes it depends right so i'm going to go ahead and give all of you some time i'll go ahead and take five minutes real quick and do that and and you give us a thumbs up and you're done in the chat right so we see whether you find it hard or whether it's an easy exercise or whether anything hangs so let us know right exactly yeah definitely give us a thumbs up once you're once you get to the green bar right once you see that green bar and your database is starting to launch then give us a thumbs up [Music] i love eric is always bombing all the links it works though i can see them good i see some thumbs up already wonderful thank you for doing that yeah yeah a lot of times yeah it only takes a minute yeah it's not that bad okay that's good i good i'm seeing lots of thumbs up [Music] spoon mouth says it took me more time to find this in setting up the db that's great that's great yeah exactly all right i'm seeing lots of thumbs up oh you like the music laura great yeah i i i appreciate that you appreciate that because um that's kind of the i i curate a list of nice chill electronic music that kind of has a nice little vibe to it so i'm hoping that uh that works for everybody all right all right we've just got a couple minutes i see some more thumbs up good more thumbs up coming in by the way for all those of you who are giving those thumbs up in the youtube channel again you can you can say in the youtube chat all you want um but i definitely implore you to go to discord because the really nice thing about that is if you ever have any follow-up or if you want to be able to talk to us later at any time you can hit us up on discord um so it's a nice place if you want to kind of keep that conversation going oh yeah jeff so if you had a previous free tier you know something i'm glad you pointed that out so jeff uh just schmidt over on youtube said how to terminate the previous free tier db first because true if you had a previous existing database in the free tier you're only allowed one in the free tier however let me show you something um in astra today and i guess i could have mentioned this but it's okay today you should if you go to actions you should see this ad key space so it used to be that you know even though you have the single database you could only have one key space now you can add more key spaces um so that's really nice especially if you're doing something where you're experimenting and you want to try something else and you might have to like just drop the whole thing or keep adding tables into that key space you can just add one in so it's a nice little feature that makes this snow okay so and also we say use the password and username that we suggested it totally those exercises really also work with your own password and username etc it's just that you might find it a bit harder with copy and pasting right but yeah so in the future you probably don't need to delete your existing database you can just continue working with it um so thanks for doing it hope it's up and running again [Music] oh so pevan is asking in youtube why is it asking for a database name as well as a key space because and we're going to talk about this in a moment actually but the database is the actual cassandra database that you're working with within that data space we're creating a key space that's going to hold our tables right in a key space in cassandra is equivalent to like a database or a schema from a relational database right so you have the actual database itself the thing and then the key space is the object we need to create container tables that's why you have both all right i'm going to go ahead and stop this timer because i know it's going to yeah that's good yeah it i saw a good amount of thumbs up coming in and everything so i feel like we're in a good spot all right wonderful so with that if any of you are still at a point where that green bar is sitting there it might just take a couple minutes it's no big deal we're gonna continue on by the time we come back you should be good to go okay so what we're going to cover today then is we're going to first start off with a little discussion on tables and partitions and kind of break down you know what these things are then we're going to get into kind of the art of data modeling and cassandra especially for those of you or those of us who have come from the relational world we're used to doing things a certain way and this is really a discussion on the slight switch you need to make when you're thinking in terms of doing things the cassandra way and why and then we'll kind of wrap up with some other resources and materials that we can send all of you to okay so to start oh by the way i forgot i didn't turn the music back down let me turn that back down real quick didn't want it to be too loud there we go all right come back here all right so to start off first the first thing we want to know about is what's called a cell right so most of it is most of us at this point probably worked with excel spreadsheets or a tabular format or something like that we're familiar with what a cell is right it's the intersection of a row and a column so i've got some row and i've got some column it's that intersection that's going to be an individual cell and the cell is you know one the the most basic unit of where we're going to store data then at the row level now this is going to be a set of related cells right so in this row i've got four cells that comprise a single single row object with each of my individual cells again related then we have partitions right so this is something that in cassandra is something we're going to really focus on a little bit more are these partitions so the base unit of access the base unit that you're going to store like a block of something in cassandra is called a partition so if you notice here i have uh on the right hand side i have department right let's say department is my partition key i'm just going to call it that for now we'll talk about what that is later and in this case all of those departments are wizardry right so that's all the same partition so partition is really going to be a set of rows that are contained in a partition and that's what we're seeing here we see that we have three rows right and those are all being stored in the wizardry department so that's something to keep in mind the way that we're handling the partitioning there and then a table is now going to be a group of those partitions and the rows so notice here on the right-hand side department again our partition key that i don't only have wizardry but i also have dark magic endeavor right those are three separate partitions because that's my partition key and each one of those could have one or more rows right and then my table is going to contain those partitions so if we kind of step back a little bit cell is the very very basic kind of unit you know that's the intersection of a row and column then rows are comprised of related cells partitions are going to be comprised of essentially related rows right because they're going to be related to the partition that they're stored in and then a table is going to contain our partitions now this is a question that came up a moment ago right um you know somebody was asking about why do we have a database in the key space and this is what i was referring to so notice here that we have this overall structure we talked about a table a second ago well tables need to be stored somewhere and that's at the key space level so the top top level is that key space level right so i'm gonna have some key space within that key space i'm gonna have n number of tables and then within those tables i will contain all of my partitions my partition keys rows and all of that so let's look at a concrete example i like concrete examples here we go so we see i have a killer video key space that's just the name of the key space that i'm going to use for my app then in this case i'm using table users by city now there's a convention here that when i say users by city i'm saying that i'm going to partition users by the city right so if you take a look at the example we have in the partitions you see city is now going to be my partition key i'm partitioning by city and the values are phoenix and seattle so in this table i have two partitions one for phoenix one for seattle and in each case you have a set of supporting rows bettina do you have something you want to add before i move on and now that's all good to me so okay i thought i heard a bettina i thought i heard the sound of bettina wanting to say something so let's see so by the way it makes sense right um i was just looking at some of the questions here [Music] yes okay so eric got that one great okay so again i have these two partitions in this particular table in users by city now let's take a look at something else that's going on here if you look down at the bottom there you're going to see a partition key column so i talked about you have partitions in a table but then in any table you have to identify what column or columns are going to be your actual partition so in this case we said our partition key is going to be city and then that value those values phoenix and seattle the actual partition keys right then if you move to the last name and first name notice that the little arrows in the bottom it says clustering columns clustering columns are used for ordering or for uniqueness so notice what happens here and this is actually something really neat about the way cassandra works is when you're ordering something we're paying for the cost on the right right and cassandra is optimized for fast reads later on so when you go to read we don't pay the cost of the ordering and what this means is is my data is actually going to be stored in that ordered format so take last name notice that last name is already naturally ordered h l s helsin last fall smith right so it's going to be ordered alphabetically based off the type this happens naturally if you use a clustering column so if you do have something like again name something alphabetical alphanumeric something that is time based or whatever and you want a natural order to it that's a good time to use a clustering column as i also mentioned clustering columns are used for uniqueness we're going to see examples of this a little bit later also notice that we have two clustering columns in this case so the clustering column order goes from one to the next so last name is my first order i'm going to order by last name then first name and then anything else i might have is a clustering column anything outside my clustering columns and my partition keys are just my data columns those are the payload right usually this is the information we're trying to get out of the table and we're going to search by something in my partition key and my clustering columns okay so again we have our partitions those hold the rows within our partitions we're going to have our partition key we you know we don't have to have any clustering columns but zero or more clustering columns and then our data columns there we go and then simply enough how do we translate what we just saw into actually creating a table right if you're familiar with sql and you've done things with relational databases this should look very familiar um for those of you who might be curious cql is a subset of sql it the syntax is not exactly the same but it's darn close and again if you've been doing things in sql cql is going to feel very familiar to you um so if we want to create a table in cql uh you'll see there we have a create table command if you're doing things in the best practice mode uh fully qualifying is always good so you notice i have the key space killer video dot the table name um now funny enough i say it's the best practice but you'll notice uh that i we did not do that in the actual um the github repo in the instructions and there's a reason for that you know because i was kind of forcing uh the use of something we're going to do here in a moment but generally speaking it's a good practice to fully qualify that way not only that when you come back to things people don't have to guess which key space you're in they just know they can just look at it and see so yeah so we're going to say create table the name of our key space dot the table then we will set whatever columns that we want and their types so we have a bunch of text fields in this case the key thing though the key thing that is really important cassandra is that primary key right the primary keys job is to essentially denote uniqueness for an individual row and in cassandra the partition key is always required you cannot create a table without at least one column in your partition key that's the only thing that's required and that's city in this case the first value in your primary key is always your partition key no matter what anything that comes after your partition key is a clustering column right um notice the parens those are optional in the case of a single column partition key it's just a good practice to have them though if you ever use a composite where you have multiple columns in your partition key you're going to need to use those generally speaking it's just good to have it there that way you know the thing in parens that's your partition key anything else is a clustering column so what i'm saying in this case oh go ahead betina yeah yeah so i think if you want to look at this primary key and think that it should express uniqueness you see that you really want to have the email in there right because this will make it unique otherwise every david smith in the same city would be the same record yes yes and um so there were yesterday a few questions about why not only use email to express uniqueness there and it's just what you want to get out of your your primary key kind of defines what you want to get out of the database in one read right in one swoop yes and i want to get the city and the last name and the first name and the email so that's what i want to get out yes and we we have some examples too to what bettina was just talking about with some examples of if you don't use email where that might well we're not saying you have to use email in an absolute sense we're saying in this particular case right where if you didn't use something that gave uniqueness where that can trip you up in something you should know about just to be aware of okay so let's take an example then of let's look at some good examples and some bad examples of your primary key again your primary keys job is to ensure uniqueness for a row right and it can define sorting that can be a thing you do as well there's actually some really nice sorting optimizations you get in cassandra when you use your clustering columns and such so if we take the first example we've been talking about this one where you see city is our partition key so it means we're going to partition by city just like the example seattle and phoenix then we're going to order by last name first name and email now email's really there for uniqueness and like i said before clustering columns can be used for ordering and uniqueness um so if you think about it and this goes this goes into bettina's point right if we took off email and we only had city last name and first name how many cities do you think are large enough that maybe they have multiple adam smiths or something like that what would happen in that case well if i had one adam smith in my city and i didn't have email in there and then i insert another adam smith with nothing else to make unique i would essentially overwrite the first adam smith right that's where you want to ensure that you have uniqueness in a row that's why that's a good example so you would have to have a city where all admiral smiths live at the same place yeah yeah yeah they'd have seen everything yeah yeah and then in the second example here with user id well user id is you know that by its nature is going to be unique per user right so it's actually a good example and as we just explained in the byte example on the bottom notice we took off email this is one that's gonna we have a high chance of having collisions where you're gonna have the same city same last name and same first name for this for different people again that's why email comes in there to add in that uniqueness all right i'm just checking do we have um any questions that i need to worry about at this moment any questions i see i see eric is all over it yeah i think eric and cedric are answering them wonderful okay so i'll move on okay so then looking at the partition key right and and what we want to do there our partition key is there to partition our rows right um so again in the good examples case uh we see it the first one with a user id that's going to be a nice natural way because those ids are going to be unique per user so i know it's going to be partitioned pretty well and the second case here what we're doing is we're saying that we want to get a video id along with a common id so a unique video with a set of its comments right so what's really cool about this kind of case is common id is actually being used here as a clustering column that means for any particular video then i can actually store for that one video that one id i can actually store all of its comments so i can perform one read and get out i'd actually pull that whole partition out with all the comments for that particular video now if you look at the bad example on the bottom or why we're saying it's bad with sensor id and log down now this is something that you could run into like an iot situation where i'm actually like recording some information from a sensor say every 10 seconds or something like that now on its own that doesn't sound so bad that actually there's some good use cases for that but in this case we are not limiting that data right i'll talk more about that later but what i mean by this is if you think about it if i have some sensors recording data every 10 seconds or something and i don't ever put any constraint on what i'm doing they're just going to grow and grow and grow and grow grow right so that's why we're saying this is a bad example we'll talk more about this um in a couple slides and just a moment it was a question whether we could have used address as a clustering key and in order to you could right in order to create uniqueness perhaps but i think email would be perhaps better there right in case you have father and son of the same name for example living in the same house you would still have some overlap that's right yeah if you can actually to that depends on your use case right it depends on your um yes on the kind of data that you want to store um which informs which ones are the custom keys yeah and to that point bettina in my actual personal case you know my my name uh is david michael gillardy my father's name is david michael gillardy now we don't live together now but we did it one time and in that case if i used the dress we'd have the same address with the same name so to that point again is what bettina's saying it depends on your case right but if you really you really do want to ensure uniqueness for a row and that's where emails probably if you've a government entity for example storing this data you probably would have a social security id or something like this right so just to make sure that you're definitely unique yes that would be another one yeah anyway again to be clear we're not saying it has to be email that this is the only way we're saying this is an example of adding uniqueness social security number is wonderful example okay all right so um right going on to clustering columns all right so in all of these in the first example here um we've been talking about this one with this last name first name and last name first name email right so i think we've already explained this one in the first example with only last name and first name it's simply not unique that's all right so we can have collisions where in the second case adding the email in gives it that uniqueness now we're now we're good now if we take a look at the bottom here um you'll notice in the first case our comment id in this example is just a uuid it's just an id so it's not really going to sort i mean if you've ever if you're familiar with uuids and what those look like um they don't they don't really sort very well right now in the bottom case what we're doing here is we're adding a created at field which is time based right if you think about it from a comment standpoint if you're going to view comments uh like in the youtube chat for example it would be really odd if the comments came up at like random times right you want them sorted by time that's what we're doing here so this is just really kind of an example of saying that you know clustering columns yes you can use them for uniqueness but if you wanted your data to be sorted then you could add in a time component like this with created ad okay so there are some basic rules when thinking about how to partition and they you see them in the bullets right first one there store together what you retrieve together one of the big differences that you're going to find in apache cassandra and the way you do things right is as i mentioned it's optimized for reads at scale right cassandra databases can actually get huge and they can still continue to perform at scale even with thousands of nodes right it's actually really awesome um but in order to do this it denormalizes we're going to talk about a little bit more about this later we use a denormalized data model and what that really translates into is unlike in the relational world where i'm going to use probably third normal forms i'm going to have fact tables and i'm going to do joins and everything and i'm going to have some set of data in this table and some set of data in this table and i'm going to probably join those together in the cassandra world since it's denormalized for a particular query i'm going to have a single table with everything in it that i need this is what we're talking about store together what you retrieve together so when you ask a question through a query you want to ensure that whatever that table you're pulling the data out of you're getting all the information that you want the other thing you want to do is avoid big partitions we've actually kind of alluded to this that was that iot one i was talking about with no constraints you want to avoid big partitions that are either extremely large and we'll talk about what those values are um or or keep growing indefinitely you want to avoid that you also want to avoid hot partitions so cassandra is a distributed database right you actually get a ton of benefit from this but it means that your data may exist in different nodes throughout the system and so a hot partition is one that is getting a huge amount or the lion's share of say the reads that are coming in and unbalances the system and again we're going to give you examples of what these are okay so here's the first example right store together what you retrieved together so in this first case here you see that nice green check mark we're saying per some unique video um we're not only going to store you know the time but then the actual comment id and this you know you're just seeing the primary key but in this table it's the video and it's all the comments and everything so the point being is because the comments here are being stored as rows in my video partition i can do one single read read that partition out and i get all of my comments in a single query i don't need to do a join yes and they're already sorted for you right that's part of the optimization step when i use created as a clustering column here i'm saying order by created at i'm probably going to default that to descending and so i do one read i don't have to even do an order buy i don't pay for that cost on the read it's already been done on the right that's optimization and cassandra so again i can give that one video id boom i get all the comments out by that one video id um in the the the second case here um that's kind of separated off right that's a case where if i wanted to get the um the comments for a particular video i'd have to go get the video then i'd have to go get the comments and plus those comments in that second case that's one comment per id so in order to read all of the comments for a video i had actually have to iterate through a whole set of partitions um to get all the comments for that case so again store together what you retrieve together now from a big partition standpoint this this really gets into even though in cassandra technically there are no limitations as far as like can you can you store more than a hundred thousand rows in a partition yes can you have a partition that is bigger than 100 megabytes yes it'll let you do that but from a practical sense it's better to keep within these within these values right um so if you look at the example the one the red x on the on the second one there the one with country and user id let's explain how the why would this be a big partition imagine you're in a country like india compared to a country like iceland and imagine how many people you have in india and you're going to store all those users it could be what is that is it 1.3 billion right now or something it's a huge huge population density right so what you could have in a case like this in the example that we're showing is uh for that one country india i can end up with over a billion rows potentially but for iceland you know not so much so you could end up with this case where you have this really big partition right and and that could go well past what we're saying is that kind of hundred thousand rows um kind of value so so you want to avoid doing things like that and what this also leads to are i'm actually i was gonna jump to hot partitions let me do that yeah this also leads to hot partitions um where if you imagine in my data model i had a case where i had say one country in that bottom example that had all like so many of the values compared to other partitions well what that means is that partition's probably going to be getting hit over and over and over so that unbalances things it makes it so a node or individual nodes are really taking the load of all the work where the other ones aren't really handling much right it's not a good distribution so that's one thing you want to keep in mind is you want to set your your partitions up in your primary case you're distributing data well all right let me go back a little bit here okay um another thing with big partitions i kind of alluded to this earlier is i mentioned here like this is a good iot case actually it would be kind of normal to have a sensor id and have all of the reported information at particular time intervals for a sensor but here there's really no constraint this is just going to continue to grow so what i might want to do is add something like a ttl called a time to live right i might want to add something in here that is automatically going to control over like every week you know every seven days of that data is like cut off or something otherwise my partition is just going to grow and grow and grow and grow so this is this way you see this question mark here again depends on how you implement something like this it's really more something to be aware of and i just mentioned that okay all right so with that let's go ahead and go into our second exercise which is going to go to the second section here so if you already did the astro part at this point all your astra instances should be in place um so we're just going to go to create a table you can just scroll down and i'm going to let everybody do this one thing i will point out real fast you should see in the instructions when you're here in astra if you go to the cql console that's where you're going to be working out of it does say that here in the instructions i'll let everybody do this this would be kind of a first experience of getting in to a cqlsh prompt and creating a table with astra and with that i'll give everyone about 10 years and give us a thumbs up when you're done yes please and and shout if you have any problems [Music] i think david we have two eric's who are competing for questions oh i say that's great chad yeah both of our eric's are all over it yeah yeah yeah that's good thank you eric thank you for answering you know this i think we're just gonna call him eric's from now on it's so much easier it's quite a free to beat eric ramirez actually but a dynamic and sometimes i feel like eric ramirez actually has a cybernetic implant and he hides the wire that he has hooked up to the computer where he actually answers instead of typing he just thinks the thought and again don't forget for those of you who do finish and you create the table give us a thumbs up let us know that you are good [Music] and erica's on top of the emojis he just thought those emojis and they happened [Music] [Music] cool thank you bibic [Music] oh cool good one awesome oh and on discord also if you get some thumbs up that's wonderful put some music on david oh there we go i was gonna say it should be looping on its own there you go so for those who joined us a little bit late so in addition to the database uh instance we want you also to create the table in this exercise here and exercise two yes exercise two thank you yes go to make sure you finish with exercise two and then we'll go over it real quick uh once everyone's kind of there i'll show you what i'm gonna do while we're waiting [Music] [Music] all right seeing lots of thumbs up [Music] wonderful so so since i'm seeing a lot of thumbs up i'll go ahead and answer this one i see uh sherkanth and youtube asked a really good question very relevant to what we were talking about creating the primary key with city last name first name and email does it mean we can use only city in the filter clause and the results will be in sorted order of last name first name email so you have yes and no you are not limited to only filtering on the partition key you could also filter on your clustering columns but if you didn't consider the first part of your question if you didn't filter at all and in your where clause you said where city equals phoenix you are absolutely correct it will automatically sort on last name first name email but you can if you wanted to also filter on last name first name and email you can do that now the key though is that when you're using clustering columns if you do that you do have to like if i wanted to sort on email or filter on email i would have to include last name first name and email if i want to do a filter on first name i would have to include last names you always have to include the preceding clustering columns but you can filter on them there we go and eric is putting some queries in there yep yep thank you eric all right do you have anybody i'm just gonna ask this and then we can move on do we have anybody who is still working on this exercise or needs more time i should say [Music] okay all right so what i'm going to do is go ahead stop the timer let me just go do a quick run through [Music] okay all right so let's take a look here real fast just to go over what we did um so we go to the cqlsh console we log in uh with the username and password that we provided okay now if i describe the key spaces that i have here we'll see that i have more than just killer video so i'm going to use killer video to switch the context essentially and tell the database i want to be in the context of killer video by the way this is why i did not fully qualify your create table statements i did this on purpose um even though it's a best practice i would always in my code i always fully qualify um but here i i kind of was being a little sneaky um because what'll happen is if you're not fully qualified and you do not set the context and use use then you'll get an error it'll let you know hey you need to pick a key space tell me where you want to go so that's kind of why that's like that now i don't have any tables in here so if i follow the instructions um down we create our table and this is the exact thing that we saw on the slide right there's there's no real difference there now if i describe my tables we'll see that i now have this new table and that's it for this one this was like super simple just getting you create the table boom i've got something now that i can start working with and we're going to go into that into the next section all righty so anything before i get going on this next piece bettina that we need to answer oh actually next time when you do a demo on the on the console i need to make a picture let me let me just do that thank you for pointing that out yeah i should know better and what about the readme um perhaps i think if you have them both next to each other you probably want to make the readme bigger as well somehow um here we go how about that is that better see we got going here give us a quick shout out in youtube whether it's uh still too small or whether it's legible yeah let us know if it's yeah yeah much better okay good cool all right i'll go with that i'll go with that all right so this next section is actually kind of one of my favorites um because it really we we've kind of glossed over some things honestly right we say like okay you need to partition here's good here's bad but this gets into kind of the why right why do we care about this and i mentioned in the very beginning that especially for those of you who are coming from the relational world and we're used to doing things certainly some of us for decades right we're used to doing things certain ways there there is a difference in how you data model and cassandra but it's not that hard you just need to understand essentially kind of like one concept and then it's like oh okay so that's what we're going to explain here so if we start with the way that we data model in the relational world right is usually we start with some set of data then we apply like normal forms and everything most of us are doing a third normal form or whatever we apply these normal forms to our data and then that ends up that's how we generate our data model in a lot of cases too this is being done by an architect or dba or something like that where they're they're generating back you know if we go back a little bit back in my oracle days and everything um you know it was it was kind of funny we used to have those huge printers that like i don't have enough horizontal space to show my hands but um these huge printers right that would print out this big old sheet you'd have these huge uh entity relationship diagrams on there and that was like the holy grail right if you're developing an app if you want to perform some query what would you do you would go take a look at the erd figure out where all the joins were and everything like that and then you would write your queries based off of that so that's kind of the way we've been doing it for a while we have data we model the data based off of normal forms then from the application standpoint then we figure out what our queries are going to be based off that data model now i'm going to jump ahead real quick because i want to do this a little bit differently and then i'm going to go back and explain something in cassandra we literally flip that on its head right so we start with our application workflows we actually start with what are the questions i need to ask from my app what are the the ux what you know what are the types of queries that i need there we use that to generate our data model and then the data comes after and so some of you might be asking yourself like wait a minute did you just tell me i need to know my query is up front and i'm going to say yes and you're going to say you're crazy and i'm going to explain this right it's the the moment you even start like writing down like sketching out a ux on a napkin you've already started to generate your application workflows it's it's not that hard there's a process for it and you'll find this is actually very natural so let's break down a little bit though actually we did allude to this the whole time yes when you sort of said it depends on your what you want to get out right it depends what you want to know from your database whether you're just fine with just using email for uniqueness or whether you want to have the last name and the first name right we kind of alluded to that that in the construction it depends on your use case that means it depends on your application and the application dictates the queries which will then yes dictate your data model yes so let's explain a little bit why this is right why are we flipping this on its head why do we have this different process so if you go back to the relational database world and normalization right what is normalization there for it's to reduce data redundancy is for data integrity if you think about when relational databases were created when this whole concept of normal forms and stuff when that was all created that was back decades ago when the technology was significantly different than it is today disc was very expensive it was if you remember like five megabytes was this huge platter they were slow right so systems at the time had to be optimized you didn't have a bunch of space that you could use to put a bunch of redundant copies of data right disks weren't very fast that had to be optimized for the systems at the time but since then things have changed a lot right if you look at the types of resources that are available today whether it's memory or cpu or disk disk is by far the cheapest it's actually in technology wise it's very fast now especially with ssds um and you can get a ton of it for cheap compared to the other commodities because sandra takes advantage of this right so if you look at the example here in the relational sense if i had let's say a set of employees and i had departments it would be very normal for me to have an employees table and what would that table contain it would contain just information about the employee right then i'd probably have a different department's table and that would contain information about departments they would each have their unique ids and there would probably be some foreign key relationship between those and in my query what would i have i would have something like you know if i wanted to find out the department that edgar codd was in i would perform a join between those two tables and you know i would have a a select statement that would actually include multiple tables that kind of deal right so the key thing here though and this is the key differences in my department table i'd only ever have one listing for engineering i'd only have one listing for math right so i can make that i can make that connection and i know that my department table is not going to have redundant table or data or anything like that in the cassandra sense we denormalize so this is different this is going back to what we were talking about you want to store the things that you want you want to store with what you're going to retrieve right so notice what we do here look at this employees table notice at the top that my department is also included here so what happens if i had more than one person other than edgar cotton engineering i would repeat that data i'm going to store the redundant data in cassandra and that's okay we're going to denormalize and flatten here's the big difference this is why this is done because here i can get a single read i o single seek i can go to that partition pull the data back i get the whole thing in one shot right in a relational database what ends up happening is because of joins you know cartesian joins i have all these different tables i have a lot more reads that have to happen to get back the same amount of data and at some point from just from a scalability standpoint and this is actually why cassandra was born in the first place as relational databases even if you vertically scale them you know a lot you're going to start to run into some ceilings you're going to start to run into some scalability issues cassandra was built to be able to scale out and be able to maintain your thorough put performance at scale no matter what that scale is one of the ways that achieves this is actually through denormalization this by the way for my data warehouse folks if i have any data warehouse folks there you've probably done something like this before in a relational database that you will flatten the data out why because denormalization is there as a read optimizations for read performance optimization so that's it i mean in cassandra we denormalize and it's okay to have redundant data in your tables bettina i feel like you got one there i feel betina no no just not in agreement go for it yeah absolutely what do you know i think it's a really good point that you don't have to be so conscious anymore about the space that's being used right we live in a different age now right so it's much more affordable to to denormalize and to have redundancy but we might also live in an age where um those servers go down more frequently right or they you know that you might have to restart them so the other benefit of this redundancy and denomination is that you have more other other copies you know that can service your reads and your requests so you get better resilience yes um yes absolutely all righty and and to the point about simple queries by the way how many here i want to see some thumbs up on this one how many here have written a 20 line sql query sql query right where you have all these joins and back in the day i'll tell you it was like a badge of honor to optimize some of this stuff right in cassandra you just flat out don't have that right and you're going to find it is a query per table design the the select statements are extremely simple because of this denormalized pattern i didn't actually see oh there's some thumbs up i got a thumbs up lots of thumbs up i have a feeling that bibic then has uh ones that are longer than 20 right you know and i've seen some really hairy sql um yes there we go we're getting some you know so we've all experienced this this is another one of the differences with this pattern you're going to find that your queries are much much simpler okay so we already talked about that okay so that gives you kind of the why right what what we're after for that so what we want to look at now this is the process right so how do i take those application workflows i promise you that you're gonna be able to take application workflows and use that to generate queries that you can use to generate your data model your tables and everything and we're gonna go through this flow right now so the first thing we're going to do is we're going to start with the conceptual data model um everyone is most likely familiar with to familiar with this to some degree whether you're doing things in relational or coding or whatever we're going to use that along with our application workflows to actually generate our queries and then we're going to step through these these steps here to actually you know map our conceptual data models to a logical data model and then we're going to you know optimize that into a physical uh data model that we're going to use to actually create our tables so let's do it now first thing i want to start with is you know so far we've been talking about like users and comments and videos so this actually comes from a real app um it's one of the reference apps that we built so many years ago here in the advocate team called killer video so it's like youtube light right so you have a concept to users you have a concept of videos and they have a relationship between comments right users can comment on videos and and uh users you know have comments themselves and you have this relationship so something like this conceptual data model probably looks familiar to many people that are coding or have done anything with databases this is no different it's the same type of thing here but really what i want you to take away from this spot is this relationship between users and videos and comments so we take that and now we start to ask ourselves some questions about the flow of our application this is the application workflow part so if you take the look at the first use case there a user opens a video page what happens right well the video comes up on screen but then i also probably want to view all the comments for that particular video and i also want to see the most recent first right i talked about how comments like in the youtube or discord chat it makes sense that they're actually in time order you don't want them just all their random now in the second use case maybe a user opens their profile and when they open their profile they want to see all the comments for themselves or that particular user user us also getting the most recent first so we have these two use cases getting comments when i look at a video and also getting comments when i open the user profile so then we can literally take these flows it's a query per table design right because we use this denormalized data pattern i can actually just say for one particular query what is the application asking me it is asking me in the first case i'm going to find comments posted for a user with some id right so if you remember before i said there was a convention we were using that in the table name we're going to say what the table is by what we're partitioning by so i'm going to say comments by user that is going to be my first table name in the second case i'm going to find the comments for a video with a known id again we're going to show most recent in both these cases so i want comments partitioning by video those are our two table names let's go to the next step right here are queries i'm not kidding they're uber simple so here are the queries i haven't even generated my tables yet and i already know what my queries are going to look like so i'm going to take these comments by user and comments by video tables these you know kind of pseudo tables if you will i'm going to say okay well then i want to get all the information from a comment so select star from comments by user where user id is some known id some uuid the same exact thing for comments by video right it's that simple and since there are no joins in cassandra they don't really get i mean yeah you could have other things in your where clause but they don't really get much more complex than this right this is this is it and then from that i can generate my ddl or start to generate my my my data model i skipped a slide uh start to generate what my actual data model is going to look like so let's take the two examples by the way the notation here is called what we call chewbacco diagram you know those the k's and the c's this is all it's saying a k means partition key a c means clustering column and the arrow denotes its direction so in the first case on the left hand side with comments by user i'm saying okay i'm on a partition by user id right then i want to order by creation date and comment id is going to be there for uniqueness so those three fields again my partition key and my clustering columns make my primary key that denotes a unique row and then my payload in this case is going to be the video id so i can link back to the video and then the actual comment in the second case on the right-hand side with comments by video you see something very similar the only real difference is now that i'm partitioning by video id instead of the user id now here's the next part this is what i want you to see when we go from logical to physical we're going to make an optimization it's subtle so i'm going to tell you what it is notice now we've collapsed by a column so instead of having i'm going to flip back creation date and common id i just have comment id right what happened there if you notice the type is a time uuid so i should explain a couple of things number one in cassandra when we are storing ids when we create ids right we use uuids coming from the relational world you probably use an int or something like that right you automatic automatically increment that int or whatever that's not something that works well in a distributed database because if you imagine if i had multiple nodes that are all you know if i get a request to create an id what happens if two nodes get the same exact request at the same exact time and let's say there's like a network partition they can't talk well now they're going to increment the same way and then when they come back together later they're going to collide that's not a good idea so we use uuids right um and there are they're signed integers they're extremely large the chance of getting a collision is extremely low i don't bettina i don't even know the case where i've heard of that have you ever had one case i'm colliding a uuid uh between nodes or anything like that it's i've i have not really come across the case but obviously there is there is sometimes a nagging doubt you know so so in you know in cases like for example for graph databases we've actually sometimes recommended that we explicitly generate them right yeah so but that other than that i've i've not come across the case where we've had some collisions for that especially not for time uu ids yeah yeah because we have this additional timestamp in there right so right so the so that what so really what it comes down to is the practice here in cassandra when you are using ids it's much preferable to use a uuid that way you're gonna you know you're gonna significantly by orders of magnitude reduce the chance that you could have a collision and ids right um now in the comment id case this is where i want to explain notices the time uid so remember before i combine what we did is we just combined creation date and comment id into a single field called common id that encapsulates both the time and the uuid in a single type that's what a time uid is it's a time stamp and a uuid now if you implemented a table like this in our logical data model that had creation date separate from comment id could you do that yeah you could right it's no big deal this is just an optimization right so i just want to i want to point that out because sometimes people look at this and they go from one to the next they're like well wait a minute what happened to the field and why did you do this so all we've done is we've encapsulated both the common id and the creation date into a single time uid field that contains both the timestamp and the uuid from here we can take that to creator tables right so um again we're going to say create table if not exists you know as a check to just say if the table already exists don't try to create it over it here's our name comments by user here are fields here are their types and the primer the primary key again we're saying buy user so i'm going to partition by user id and now i'm going to order my comment id and what's really cool about this is because it's a time uid field and it has a time stamp when i order by comment id here it will in fact order it in its time it's it's um time relative format right um so that right there one read out of this table per user id i will automatically get all comments back already in descending order by time in the bottom example if by one video id i pass that into my where clause i'll automatically get back all comments for that video also in time order so this is what we're talking about where i can use that single read now to read that partition out the data has already been ordered this means that later on when i'm reading that data i can read that data very fast okay so with that let me um we're going to go into this yeah a couple of questions yeah please i was going to ask actually this is why i was i was just watching the youtube because there was kind of a different tangent on the questions okay and so one of the questions was sort of like still going back to the denomination and how to ensure that um that you keep the data in sync right if you have um if you have to do normalized perhaps you want to say a few things about that and um oh yeah how best to do that yeah so there are a couple different ways um that you can use to to keep data in sync my first answer is going to maybe be like wait that's not a way which is don't um and what i mean by that is cassandra is optimized to be in a synchronous platform right you're going to get your best thorough put in your best performance if the individual query is going to individual tables don't need to essentially like rely on anything else right so i would say that if you can reduce the dependency on oh well if i update this one table i have to update these other 100 tables that's probably a good practice to have but there is a mechanism called a batch now a batch in cassandra i want to be very clear about this a batch of cassandra is not the same thing as a batch in a relational database batches and cassandra are not for batching up large sets of data and they're not transaction based nothing like that a batch simply means this if you put so a batch by the way to use a batch is very simple you just have like you know a batch and an batch kind of thing in your statement and you just put all the queries that are related to that in there what it's saying is this let's say i had three tables that all had similar key data you know maybe they all have they're all related to videos a matter of fact we actually do this in killer video right um and i i have a requirement to ensure that if i update one data or one table with video data then i also need to update the other two tables at the same time to ensure they also have that related video data i can put them in a batch the batch is essentially a strong retry mechanism it's gonna say okay if any one of these fails you know if it fails immediately it's just gonna none of them will be applied but if any one of them are successful it will guarantee that the rest get written right so it is there is that mechanism it's called batch as a matter of fact in the follow-up materials um i believe that there is a there's a catacotta scenario that actually gets into batches um and there's a bunch of follow-up materials we can give you if you want so to answer the question if you did have a requirement that you need to ensure multiple tables from a denormalized standpoint are being updated at the same time that's what a batch does was there another one perfect thank you yeah there were a couple of like how to store big um blobs for example example photos or videos themselves you know so but i think eric you've answered that um by you know providing pointers basically and store them not in the databases of database yes and yes oh by the way i see one from deblina in uh discord uh can you please explain the meaning of the arrow operator downward i'm sorry if i didn't explain that it just denotes the direction of the order right um so in the case of here uh if you take a look at creation date notice it's saying the order is descending order common id it's a id it's just ascending order that's all the arrow tells you is what order the clustering column is in okay and yes i heard the thing about blobs that's right the recommended ways that just store a pointer to the actual thing i do know of at least one case of a company that what they did this is actually pretty slick um but this i think takes a little bit more advanced knowledge to get right um they took their video blobs and they split them up into various partition chunks and they stored them around and they had a mechanism that would allow them to essentially splice them together on the fly real time as people were watching the content which is one of the few times i've ever seen someone do that as compared to just storing pointers um so that was a fun one okay anything else any other questions i need to know about bettina um i think it's fairly under control okay there was just um can we use time uuids instead of ttls but i think i might have to score that out of context we can probably address this after the next exercise if it hasn't been added yeah and it'd be to be clear though a minute yeah we'll get into the extra exercise i almost said xmis we'll get into the exercise and then we can talk about it but just it is a very quick response to that emit the tiny uid let's see can i use timeout and not just a ttl is going to tell the table in the database how long you want data to live if you set something to a time uid it's never if you do nothing else that all that data is just going to grow there's no control there right the ttl is just saying okay after this amount of time i'm going to cut off data that is after that time that's all that means there are two different things right so if that doesn't really answer what you're asking um please ask a follow-up and so we can get to it all right so with that oh i came i didn't mean to come out full screen there oops my bad okay here so we are going to go now to the example or the oh yeah console you want to have a little bit bigger still i heard so the the sequel sh console just started a little bit more yeah well i'll i'll start a little bit bigger so now what we're going to do is i'm going to give you 10 minutes go into exercise or step three where you're going to execute the crud operations and let's go ahead and get that going and get our timer up make sure our music came back god i have the music right i'm saying i can't hear them oh you will be it's coming oh yeah that's right yeah bettina can't hear my music i'm sorry here we go all righty oh pevan um how netflix is handling data as a matter of fact there is we have a netflix scenario that you can go through um if you want to take a look at this i believe it's in datastax desktop i don't remember if we have a catacota scenario for it um but i do have something for you so give me a moment i'll find you a link while everyone's doing the exercise uh yes yes you and again for those of you who finish the step three of the crud examples and you finish the exercise please give us a thumbs up let us know that you are good and i'm going to look at i somebody asked a fun question that i think i think i have something for on the netflix one i want to kind of remember where it is though oh they've got some thumbs up that's perfect good okay how is that one you know the one i'm thinking about might only be in desktop yeah i think it's only i think it's in desktop okay okay good yay yeah so pavin and youtube um that link i gave you to data sex desktop and for those who are interested um data sex desktop is if you're used to docker or whatever it will allow you to not only spin up cassandra or datasets enterprise within docker containers you just click and go and it provisions downloads does the whole thing for you but it also has a whole set of really cool examples one of those is a netflix use case so there are language examples i want to say in java python c-sharp and node and any one of those will all use those netflix examples so if you're curious about taking a look at one of their data models or whatever that's in there i know that we have something else there i'm trying to find it i wonder if one of the team also knows i know we have a netflix example somewhere i thought it was in a github repo [Music] so i'm only seeing one thumbs up so far so how's this exercise going for people is it hard [Music] [Music] [Music] [Music] by the way um it's a little out of context but if anyone's curious about the music i'm using i'm actually gonna post it um because i've seen some comments on it um this is this is actually something that one of the other team alex pointed out to me that we can use for copyright free music so we don't get in trouble which we don't want to we don't want to you know go against copyright i'm going to post the links down in both the chats now i am filtering on calm inspirational electronic music uh it's called well i i i search for synthwave and then i look for mood of calm and inspirational um good i'm seeing a bunch of thumbs up so if you're interested in coming soon now yeah yeah cool and the good thing too is if you know if you don't get through all of the uh exercises with us it's fine this stuff is going to all exist and definitely um you know the material is going to be there astra is going to be there so you don't have to worry about that wonderful [Music] i see eric eric has been all over it with some of the questions cool i'm seeing more thumbs up wonderful [Music] yeah um [Music] uh in youtube that's a great question right when you're talking about read and write blocks and reading today that's not actually cassandra when you are um when you're performing writes and reads in cassandra we don't lock the data as a matter of fact that's one of the things that's really interesting about the way that cassandra does it so cassandra is optimized for extremely fast reads and by that i mean we're talking the order of like microseconds to type type millies right for re or for rights um and the only way it can achieve this is by not locking data or anything so what happens is when you write data we're actually going to touch on this here in a moment when you insert or update delete or anything everything in cassandra is essentially an insert and it's an append only you know file type so it or file system so what happens is you write data it'll boom it just depends it on to the end it depends it on right to the end the key is to be as fast as possible and this is actually happening on you know individual nodes as you are as you're writing data and everything so there is no lock right from that standpoint when you read it's the same thing right it's going to go read from an individual partition and pull that data back so now if you're talking about something that we're not really getting in today called a lightweight transaction you know there's there's some differences there but yeah we generally don't have the types of write and read locks that you do in a relational database append in later competition oh yeah go ahead go ahead bettina no no go ahead and i was just saying that we have a few thumbs up but go ahead and answer that question yeah yeah so a pen in later competition that's a i love how you put that on it it's so it's not actually a competition right so what ends up happening and i'll i'll stop the timer here for a second we'll talk about this because i think we're getting to the point where we can move on um what ends up happening is it's not so much of a competition so again cassandra is a distributed system and and we're glossing over a lot here but i'm just going to say i'm just going to give you some some uh standards so when we replicate in cassandra by default we're going to replicate the three nodes that's something you set in your replication factor and that means i'm going to have three copies of that data right so it's not that there's a competition when i perform a write when the client goes i want to write some data and i'm going to write it to those three nodes that happens asynchronously to each of them right and that is that append only you know file system that i'm talking about now what happens though is later on let's say i want to read that data right away how do i know it's consistent right and i think that's what you meant by competition um yeah go ahead patina actually i think i read mad compassion who respected himself oh did he oh oh oh oh compaction oh okay well then i totally misunderstood what that was gonna be yeah okay yes yes you're absolutely right it is append and then later compaction compaction's a process that happens automatically and it will then essentially like coalesce um all of those that was funny yeah no worries hey we we type all the time that was great all righty all right so it looks like most people got through the exercise wonderful oh i should forget i almost forgot i gotta go through it myself so let's do this real fast oh okay okay so and remember to have your secret lsh console a little bit bigger than that it is already blown up i hope that it doesn't need to get any bigger please let me know if you can read that um because if i get any bigger then it'll look like uh it's gonna be a little too big okay so the first thing we're gonna do is we're gonna create the k the tables that we were talking about before i need to turn the music down real quick there we go we're gonna create the tables we were talking about before i'm just gonna paste those in and there we go and then i'll describe my tables so we should see that i have three tables now right now here's my comments by video and comments by user that i just added and now in the commands execute um we're just very simply i just broke it down a little bit we're just creating um you know a set of insert statements here this is going to be our first crud create the c right so we're creating the data i did break it down a little bit here in the beginning so you can kind of see what these are so these are the exact fields we talked about there in the slides we have our uuid that we're going to use for user id again our comment id is going to be the combined uuid and time stamp of the time uid uuid for video id and then text for our comment now one thing i do want to point out is notice that here i'm using now um i do provide a link that will prov point you to the docs on this now is just a way you can generate a time uid right it says now and for this particular case that's what i want is that data comes in i want to know what the timestamp is now i'm letting the database do that for me yeah i meant to say that actually earlier with all the emphasis that we did on uuids and on um that we actually have the tools in the databases and the drivers to create them for you you don't have to generate those yeah you don't have to do that on your own yeah you can now you'll notice here and i'm glad you pointed the dot bettina that we are using contrived values we are doing that only because from the exercise standpoint if we have a whole set of you that are generating your own unique uuids and then you need to select buy them later and everything then you have to go through this thing of selecting the data finding it it's much easier for us to make it the same for everybody that's the only reason why we've done it this way so in a real world scenario i would have most likely generated a uuid for my user right dynamically and you can do that with simply enough uuid it's like right here yeah i i talk about it right there um so you can just use the uuid function to do that so we are using some contrived values just because um it is easier to do when we're doing a bunch of exercises together okay so in the second one so i've inserted some data into both my tables um now we're going to read now you may have may or may not have read um here let me pull this back just a little bit let me do that again let me clear if it's if again if it's too small let me know i just find it harder to read when everything is scrunched um but um you do notice that i had a little coughing fit here uh and i talk about this so when you see a select star from a table in cassandra with no where clause that's the key thing there's no partitioning going on here now in the data that we're using right now i've only got a handful of values it's not that big of a deal imagine though we're talking about a cassandra database we're talking about a database that can handle petabytes of data and and do it at scale right imagine if you had a table that had a petabyte of data in it and you did that command right you're going to essentially scan every single node you're going to pull it back um the same thing in a relational database right your data gets big enough you don't want to just do a select start from the whole thing you need some constraint so even though we're using a select star here with no where clause right um just know that we're only doing that just for this kind of case we have a handful of values the right way to do it is what you see here right where yes i can do the select start just fine but i'm using i'm partitioning now so i'm saying give me the values that are in this one partition what's really cool the difference between these two queries is actually kind of intense this one is going to scan every single node in my cluster for that table no matter how big it is this one is going to go directly to that one partition on a node and pull it back right away right so it's a huge difference in what they're doing anyway with that disclaimer i'm going to move on because you're going to see i use select star like this later on and i just want to make sure that people understand that okay so we see that we selected the select is how we read the data we see we selected the data that we inserted before and then now moving to the update part now we're going to insert a new value notice that we talk about here in this case the comment is oh my god that guy patrick is such a geek and actually i didn't need to do that i'm sorry yeah poor patrick i already had this data in here by the way i forgot i i wasn't supposed to copy that one but it doesn't matter it won't hurt anything um a matter of fact it should be right here yeah here it is oh my god that patrick patrick is such a geek later on we're just going to update so notice what we're doing here right we're going to update our table we're going to set the comment to our new value we're going to use that video id and the comment id here now if imagine imagine the ux imagine what's going on in your app at this point right let's say i'm editing a comment so at the point of my application that i click on that edit button i'm going to retrieve the video id i'm going to have the video id and the comment id that i'm referencing from where i'm editing and i can just pass that into my query and now we see that oh my god patrick is on fleek i'm probably too old to say that honestly i probably sound weird saying it but anyway all right and the last one here is our delete right so we're going to go now to the delete piece and now i'm going to delete data and when i select that back out i see since i deleted the one by that same video id and comment id which is the patrick one that is now gone now something i do want to mention about elite and i do kind of talk about this here in in the text is that when you delete and cassandra it's better to opt to delete for the largest amount of data in a single delete what do i mean by that right well in cassandra i could delete it the cell i could delete the row i could delete a range of rows i could delete it the partition right i could also delete at the table so what i'm getting at is this if i want to delete a whole table's worth of data imagine if i had a million rows in a table or something like that right now could i delete it at the cell yeah but now i have a million deletes but maybe i just want to delete everything that's in that table right i'll just truncate the table at that point right or if i just want to delete something for a partition well instead of deleting all the individual cells in the partition delete the partition right you have that control because then i have one delete for the whole partition not for all million values or whatever it was so that's what i mean i do explain this a little bit here um uh you know of of kind of like what that's about but really what it comes down to is when you delete you want to perform the least amount of deletes for the most amount of data possible in your particular model all right and with that that is yeah go ahead yeah the idea in the console there was a question from abishek about how to determine the time it took to execute the query so perhaps you just want to show what a tracing output would look like yeah so if we do i see that eric has already told that you can do tracing on but have a look what it looks like so you'll notice now um now this was actually a small amount of information but you could really break this down um and you can get all sorts of information you can do this on whatever statement so all i did was set my tracing to on right there oh you want to zoom in before you show more zoom in because otherwise everyone is losing this goodness how about this how about this why don't i make this bigger yeah yeah because that's i was like i can't read it right placing that spacing gives you a lot of information yes it was just because the question came up in youtube yeah it does it gives you a ton of information that's actually something very fun by the way to take a look at especially when you have a cassandra database that's all over multiple nodes and everything you can kind of see you know what nodes are coming into play if you want to geek out a little bit but yeah you just turn tracing on okay yes okay uh debelena asks in discord how to delete based on a partition the examples deleting table based on a partition yeah so in this case we are in fact we are deleting um let me give you an example let's pull out here let's see what do i have um oops let me turn tracing off when is this facing off again perhaps yeah here we go yeah um i think i have a better example in the other one i forgot my semicolon don't forget your semicolon okay so we'll see here in comments by user i have this partition right here so if i wanted to simply delete by that partition i would say delete from my table where user id equals and if i grab this id because this one user id even though i have multiple rows whoopsie how'd that happen that was weird let me do that again that got all i ended up with something else in my buffer apparently that's better okay so notice that i have multiple rows in this partition right i have the same exact user id that when i'm partitioning by and i have multiple rows for this one because this one user has multiple comments so if i want to delete by partition instead of i could have deleted by each of those rows but then i'd have three delete markers in this case if i just delete like this now i deleted by partition right i deleted at the partition i have one delete marker it now deleted all of my rows hopefully that was what you were looking for and i see eric put it in there as well wonderful good okay all righty anything else bettina before we move on i think as far as i can keep track of the chat i think uh yeah there must be more questions about slow queries etc but i think we are good already with that sorry implementing yes we're gonna go back to mentee now and we're gonna do our fun quiz this is where you can get swag so you gotta join here um so i'm gonna leave this up again for a moment and we will do mintoo.com uh good thank you deblina yeah and and always feel free to follow up on that stuff if you any other questions okay here is the mentee and again you can scan the qr code now hopefully you all just stayed in it um so go to mentee and i will yeah give me a thumbs up please let me know you're there and i'll get to this next little spot matter of fact let me get to this because oh oh here we go yep yep this will let you add your names in so everyone get your names in do the thumbs up um yeah dublin is also asking we can also delete key space do we have to delete all tables or drop tables before deleting a key space no you can just drop the key space and if you do that all the data and everything all those tables are just poof they're gone so you can do that you don't have to delete all this stuff in it first all righty all right i'm seeing a bunch show up all right now something key about doing the quiz you may have lag watching the video feed watch your phones wherever you're running mentee that's where you want to look when the questions come up because that will ensure that you're actually getting the proper time to answer and respond because time matters your speed to answer does in fact matter all right so with that let's do it let's do it minty there we go i hit the key and nothing happened all right so what is a primary key again look at your phone the same as partition key uniquely identifies a row it's a column in the partition key or it's the main house key let's see what happened that's right a primary key uniquely identifies a road i'm always happy when we see a majority of folks have the correct answer here now for those who answered same as partition key right the primary key contains your partition key but it can also have zero or more clustering columns in it right so the primary key is the thing that has your primary your partition key and your clustering columns the partition key yeah there's definitely table examples where the primary key is the same as the yes that's right yeah it's not logically it's it's a little bit different yes yes alrighty wonderful so let's see what our leaderboard looks like this is always the this is the best part of these all right so we have sai uptol and sarthik are our top three for now we're not done yet we got more questions to go so we've learned it's always anybody's game with these so let's go into the next one all right question that our staff are always the slowest yeah right what is the partition key hey we just talked about this a consecutive number applied to each new record a designated field in your table structure to partition your data an optional table field for optional partitioning another word for garage key i get asked bettina is garage key something for a partition in in in europe perhaps you might have just partitioned your stuff right oh i see i wondered i was like is that like some like something i it's just not in the americas that i don't know all right so yes it is a designated field in your table structure to partition your data exactly all right so what does our leaderboard look like now and again time does matter so you're going to see that get applied here in a moment oh wow but our our top three just kept it look how close this is so psy yeah it's different and we see lahore maker is working their way up here i always have you know i pretend like i don't have favorites but there's always a name that i recognize from the chat that i try to like encourage so go for it laura make here all right nope i have no favorites though i'm no favorites all right what is astra it's a local in-memory version of cassandra it is cassandra as a service in the cloud it's a difficult tool it's a gateway to the stars i always feel like i'm on a spaceship on that one yes it is cassandra a service in the cloud that is what you're using today to spin up your cassandra database and to manage it and everything for you um and again well that's what you're going to hopefully use in the future to try out war and uh you know solidify your understanding by with some experiments yeah exactly and and honestly you know what we focus on for what we're doing in the workshops is the free tier because from a development standpoint when i'm spinning up my app and doing all that i don't want to mess with the rest of the stuff i don't have to pay things right now astra is more than that if you ever wanted to you'll find that it's there are more tiers there are much heavier hitting tiers that are enterprise grade and everything like that but again we just don't focus on those here but yes if you're curious can astro support production workloads yes it can but the free tier is what we focus on alrighty so what did our leaderboard do wow is this going to stay close again whoa wow okay so psy uptil and sardic are just holding the first three positions you see the like little little tiny adjustments here let's see what's going to happen next all right four or five what flow is used to create data models in cassandra data models to application models application to data application models to data or create data models [Music] all right that's right the correct answer is application you start with your application workflows you use that to generate your data model then your your data is applied after um yeah that's that's a really key thing if you can understand again starting with those application workflows and then that you use that to generate your model your data model and everything you're in a good place all right let's see here where is our leaderboard oh i think i saw a difference in this one what's it gonna do oh i thought it was gonna be different in the leaderboard it's still the same all right last one this is it it could still be anyone's game at this point let's see what happens what type is used to store both an id and a timestamp a time id a timestamp id time uuid or all the things in one type id [Music] whoa yep it's a time you you id that's right that's a clear winner yeah that's good yeah i'm happy i think we've all had pretty clear winners in this one right so i'm glad to see that hopefully it means that we're doing this the right way and folks are understanding what we mean all right so let's see what happens and uh [Music] oh i think we have some change no oh wow so i did it all right congratulations came in first place of psy updal and sarthik um you are the three winners ensure that you reach out jack jack you're back right jack is back today so reach out to jack fryer he's in the [Music] here i'm gonna call him out right there there's jack reach out whether you're discord or whether you're in youtube make sure to reach out to jack to get your swag all right so with that let's go ahead and thank you for playing the mentee good that okay there's jack i see jack there he is there's jack's email objectivefryer datasex.com and i'll drop that in the discord as well oh he's ah i beat you too jack [Music] alrighty okay so let's go ahead and move on again if you were in the top three um please email jack to get your swag okay so with that as we get to wrap up what is next so i mentioned earlier that there are some other resources right so it's not just this one github repo one of those is datasacks.comdev this is really built for developers and and kind of for that experience and to help you find the path you know what kind of knowledge are you looking to gain and everything and to uh to explore so let's go ahead and take a look at that real fast do this way and again you're gonna all the stuff all the resources here there's no pay walls there's none of that um what's actually neat through a lot of the experiences if i want to find one of them like let's say i want to learn more about data modeling and i click on this learn more this to me is totally rad i love this feature um so as it starts to explain things and as you you go down here eventually you're going to get into a case where you will you can actually launch your scenario where you can actually do some hands-on right this is actually being powered by catacotta so i mentioned this catacota piece earlier we also do talk about there we go we also do provide some links directly to some of the categories scenarios but when you follow these paths down here what's really neat about cassandra is i'm sorry about catacota is it it's going to spin up a cassandra instance for you right here in the cloud in catacoda i mentioned earlier that if you have say um maybe you don't have enough resources on your local machine to run the database or something like that these are wonderful for this because it's going to get you going right out of the gate and you don't have to worry about installing it locally or anything and then you can kind of explore and do everything that's on the cloud it's all free right um so that's just such a really neat feature about deus ex dev so it's worth taking a look and just kind of scrolling through and seeing if there's anything that fits what you need also we provided if you go directly to catacota now if by the way if you do it this way you will need to sign up right again it's free there's no paywall or anything like that but what we did is we provided another uh there's a data modeling course that we've created here in catacoda and you'll see these scenarios these are both use case-based scenarios some of the questions that i saw coming in the chats are definitely things that come down to certain use cases so if you want to explore different ways data model with these types of use cases go take a look at these scenarios and they're going to help you they're going to help you kind of understand that one last thing here is if you just go to catacota.com davisex there is a whole set of materials in here each one of these is an individual course that has a whole set of materials so if you want to look learn more about like say kafka or doing things with kubernetes or really getting into the admin side of cassandra and getting into the fundamentals and underneath the hood there's even more here in data modeling there's all sorts of materials between this and what's in dev there's like a ton right there's a ton of materials there for you i guess that answers a little bit a question from the whole art maker on youtube about um learning resources focusing on data engineering and architecture you know how to build the pipelines to get the the data into the database and the integration so i think that resource that you just were with the kafka is probably the best one yeah kafka is going to be a good one i believe there's a ds bulk one in here as well um yes uh i wanna say that one is oh this one maybe i have to remember i wanna say i know it's in here yep bulk loading large data sets so here let me yeah so have a look at this one and i see eric you've also given some um additional links there so yeah yes and take us to community if you still have questions you know and and you know uh if you ever have follow-up questions lahore or anybody for that matter you're not limited to what we're doing right here right if you if and this is actually one of the reasons why we kind of um encourage folks to go to the discord chat as well you can always reach out to any of us linkedin discord whatever come ask say hey i'm looking for this do you know if it exists a lot of times we're pretty aware of what's out here and we can we can let you know all right and i should i'd be totally miss if i didn't mention deus ex academy um so data sex academy i should probably go to that as well um desex academy is a great resource especially if you are uh looking to uh where's my i've got a i have it up here somewhere sorry i have a different login for it it's in my safari i know i use safari for some one thing don't don't make fun of me don't make fun of me there we go here we go okay it's right there yeah so i've already logged into it here um desex academy is really nice especially for those or you who are more on the admin path if you want to do that again this is all free you just sign up so there's no cost or anything like that these pads get really deep right if you really want to get into the cassandra fundamentals you really want to get a data modeling and such as a matter of fact this developer one here if you click on this you'll see it has a whole set of it looks like i need to log back in because i haven't been in here in a minute ah extra l there we go see if i get this password right good okay yeah so if you really want to get in depth um i i mentioned very early on when we were doing this that there are literally weeks worth of videos and material i'm not kidding right um so this developer path you see this 201 and 220. this really gets into the core of apache cassandra and 220 is all about data modeling if you enroll in those right you're going to find there are all sorts of videos there are quizzes you can take all sorts of things it's they're really great resources on top of the other things if you really want to geek out and get into uh some of this stuff you have to change your email address now what's that you know people are commenting on your hotmail address i've had that for probably 20 years it's not named no that's okay it's not aol all right so so with that um again datasacks.com is that first place i talked about we have community um always always feel free to ask questions of community and again the neat thing about community it's not just us right we have folks from the cassandra community within data sacs outside of data stacks there are a lot more folks that can answer there it's got more say global coverage you can always feel free to come to us individually but we may also send you a community because you know i can't answer 24 7 and i don't know everything that's for sure um so community is a great a great place to go to especially those longer form questions um as i mentioned we're all over social media we're on linkedin we're on twitter we're on twitch um and i need to change that link down at the bottom but just in case oh i went you know what i see what i did now i went to a uh there it is i lost my thing just in case we'll go ahead and go back to the root that's it yes and here's the github repo i'm just going to go ahead and pop that out one more time uh by the way if you are ever interested if you happen to go to the data stacks academy um route you're going to find if you look for a workshop or whatever you're going to find we have all sorts of other materials that are out there that are all pretty much self-service right we build them to be self-service so there's plenty of other things and with that i think we have a couple minutes here yeah we've got like six minutes left so i think we could try to answer some questions uh if we have any yeah never there was one question actually about getting a certificate uh so we do not issue uh certificates for participating in the very in this kind of sessions right so yeah so there is no certificate for anybody who was looking for one but take the learning i think that's going to be your best certificate right bring it with you and try it out yeah oh here's a question for pavin like dev is there any channel for administrators you know what i would say i would start i realize it's called slash dev i would start there right there are materials that are for administrators uh one of the definite one for you is in academy uh is ds both 201 and 210 right those are really great paths for administrators um there's a whole section uh in the catacota scenarios is actually the one i went to this cassandra intro um this is actually a lot of the ds201 material but in catacoda form so you could do it either way um the one thing about the academy um uh courses for 201 and 210 is they have in them they actually have vms now vms mean that you need to have the hardware to be able to run things right but it'll bring you through stepping through like all so you'll get the cassandra ammo you'll do all sorts of experiments so if you want to go down that path that's that's probably the right path for you but if you want to do the data modeling exercises you can totally do them in astra so yes um so if you go for the academy courses on data modeling don't go to the travel of installing the vm you can just do them in astra it's much easier it's already set up for you and this is really just about how to create the tables and the subtle differences with partition keys and clustering keys that we discussed today but in a lot more detail right so you don't have to use the vms and we know that they are sometimes painful to use yes yeah that's a very good point and eric also mentioned something in the chat he did say check out the playlist in youtube so if you go to our playlist destax devs you'll notice there is this huge set of videos that have been put in 210 is the admin one by the way so if you go to this playlist we were kidding i when i said there was a lot of content we have a ton of videos out here um okay wonderful kevin i see it yeah there is an amazing amount of information that's out here so you can always go to our youtube channel and watch these as well and then there was one question about practicals for kafka again dutch is also covered in qatar coda so you can practice with kafka and cassandra and kadakota so check out those scenarios they have a lot more than just what we've shown today um also you know testing integration testing uh data generation all these kind of things so you will find interesting scenarios in color coda yep all right with that i think i think we're good and thank you everybody for coming with us today thank you bettina um thank you for coming on and doing this together and again if anyone ever needs to follow up reach out reach out to us on discord reach out to us in linkedin wherever and you know whether you have questions or if you're looking for links to resources whatever um just go ahead and let us know and we'll try to help you out and with that yeah and livecast on youtube right yes yes please yeah i like it and if you register on the um on the youtube channel so if you subscribe you see all the upcoming stuff that's coming and um also an event right if you're interested in further sessions absolutely yeah we will have more workshops coming so just keep an eye out and an ear out and uh we'll we'll see you again thank you everybody for joining us today take care thank you bye it was fun gifted with powers from the goddess of cassandra who grew those powers until she could multiply it will move with limitless speed and unmask hidden knowledge with those powers she was able to fully understand the connectedness of the world what she saw was a world in need of understanding from that day forward she sought to bestow her powers on all who came into contact with her empowering them to achieve wondrous feats

Info

Channel: DataStax Developers

Views: 5,205

Rating: 4.9200001 out of 5

Keywords:

Id: pVLN6FsUeyo

Channel Id: undefined

Length: 118min 56sec (7136 seconds)

Published: Thu Sep 03 2020