#1 An Introduction to Apache Cassandra™

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and welcome everyone to the cassandra workshop series hi david hello hello cedric how you doing man i'm very great we do have a lot of people today i'm so excited we are starting a journey for eight weeks man that would be awesome this is this is eight weeks that you guys get to hang out with us not that we're particularly cool or anything like that not men like that but eight weeks where we get to take you on a journey from having potentially very little knowledge of what cassandra is learning about that going through data modeling application coding admin benchmarking all sorts of stuff right so this is going to be really exciting and we're pretty jazzed to get going today yes and you know what it won't be only about cassandra we will be real app real stuff so we'll talk a lot about cloud kubernetes api micro service that would be fun okay let's get rolling all right let's do it so first thing i just want to just set the context here so for today's event you know we've been doing these workshops for some time and you know we we're used to a certain amount of folks we've had all of you who have come on here um we just had a wonder wonderful response right now we're at around like 8 000 registrations so this picture here that you see is approximately about 10 000 people i realize it's 2000 more i'm sure you can kind of figure that in your minds um but just to give some context the sheer amount of all of you that have come to learn with us for this series it's just really awesome so thank you guys all for coming and with that let's get into it so here's our crew um you may recognize some of us from our emojis or not i will introduce all of them but we have a whole team of us here at data stacks not just cassandra and i that are actually supporting these events so today at the top there you see i'm david jones gillardy they got the hat uh cedric lavonne is my co-host today hello again cedric yeah hello so let me put us a bit smaller like that okay are we smaller now that's right yeah so separate from just myself and cedric we're going to be hosting and presenting the material bringing you through all the fun quizzes and exercises and such today but we have a whole team of folks underneath so joining us today we have uh bettina's wintertime and we have jack fryer our community manager uh who if you have any questions about um you know the the workshop or you know any of the follow-up stuff jack is your guy uh alexander velocinef is another one of our advocates and then we have all sorts of other folks from datastax who are going to be in the various chats that are going to be supporting us today uh to ask your questions and really part of the reason for us doing this is for you to be able to interact with us be able to ask questions feel free right we'll we'll make comparisons today since we're getting into the beginning of cassandra um you know we will make comparisons to relational databases and and some differences so many of us come from that world if you have questions about relational databases ask them right ask about how does this compare do not feel worried about asking a question or something like that we're happy to answer so today what we're going to cover this is week one right this is an eight-week series so we're in a week one and today the first thing we're going to do is get you bootstrapped uh a lot of you on the chats we've been watching those chats a lot of folks are asking questions about hey what materials do i need do i you know are they prerequisites or anything we're going to answer all that here in bootstrapping this is going to be the session we're going to spend just a little time to get everybody set up we're going to get a database going we're going to get all the materials we're going to get you your resources to kind of explain how all this works and the cool thing is once we do it today we won't have to do it in the following weeks then we're gonna get into the what wine one of apache cassandra so i also saw some other questions coming in on discord and youtube asking hey is this okay for beginners what happens if i don't know anything about cassandra you're totally cool this is what this is designed for right we design these specifically to bring you if you if you're not familiar with patrick cassandra to then leave with being dangerous with patrick cassandra right so we're going to get into some of the cassandra fundamentals explain some of the really cool features and for those of you who have attended some of our workshops previously and have maybe come to one of these where we do some of the the primer cassandra information we actually have a little bit more for you today so in this series we are able to focus a little bit more on each of the subjects so today we're actually going to focus fully on the fundamentals of cassandra and we we ended up getting uh we're able to dig in a little bit more consistency levels and read and write past stuff we weren't able to do before um then we're going to just very very briefly touch on data modeling the reason why is because to paint a whole picture and to kind of connect some of the dots that we're going to be talking about today you do need to understand at least a little bit about what's going on the data modeling piece however next week is when we're going to spend the whole session on nothing but data modeling right so we're going to do the deep dive and data modeling next week but i'm just going to barely touch it today and then we're going to leave you with essentially the what's next you know here's some resources there is homework i will talk about that here in a moment um so then just to kind of give you the skinny these live sessions will be two hours each week there are two sessions i'll talk about that again in a moment but two sessions for each of the different time zones either north america latin america is one the one we're viewing right now or on thursdays we'll have another one for emea and aipac to better fit folks time zones each one of those you just need to attend one per week you just pick one uh and then you'll have about two hours worth of homework and again i'll get in more detail about that yes and if you cannot miss if you miss one of the live or both lives it's not a big deal everything is recorded everything is available on youtube can go back and watch the live yeah to get the materials to do the exercise and homeworks yep awesome and uh another thing i just want to mention as well is there are a ton of you guys a ton of all of you on the chats um we are going to do our best to keep up uh if you are asking a question and you're you're not getting an answer right away we are watching right um so if if we don't uh catch it or something like that maybe repost it it's okay we will definitely get to it okay so i want to just back up just a little bit and talk about the series as a whole um we've seen a lot of questions coming in about the whole series so here's the plan right so right now again we're in week one and notice that there are two parts so we have part one how to build applications with cassandra so we're going to start with what we're doing today the fundamentals of cassandra get into data modeling and then we're going to spend weed week three and four really getting into the application development spot cedric this is this is kind of your your fun area do you have anything that you want to kind of talk about yes so really the series is about building stuff so of course we will start by explaining what is cassandra but the reason is really because we want you to build something on top of it and why cassandra fit you know next generation development micro services api kubernetes all cloud native level development we want to just explain you why it fits well and of course along those lines answering any question you have about cassandra and so the two weeks week three week four really dedicated to building our app the first will be you need implementing the create read update delete operations and then on so using either java node.js python or c-sharp and then we would do the rest api on top of it you know depending if it java will use a spring boot or quarkus if it's python we will with jungle if c sharp we use the c sharp stack and um with a node node.js we will use express so it's all about building stuff together it will be fun and then in part two now we're gonna start to get more into deploying monitoring your clusters in your applications right uh so you see there are week five some folks earlier before the session started we're asking about like is there gonna be any administration uh episodes or anything yes so this is where we're starting to get into more of the admin functions but it's not all about administration and running your clusters so you'll notice that week five that's where we're gonna really kind of focus on how to operate your cassandra clusters once we get into week six now we're gonna we're gonna change a little bit and get into performance benchmarking your data models this is actually really quite important funny enough we did a workshop on this a couple weeks back and i asked a question in there and these are all anonymous by the way you'll you'll see uh we'll do some of this today i asked a question in there about how many people are actually benchmarking their data models and as i kind of suspected it wasn't many um and what it really comes down to is you know once you are powered by a database like cassandra and you can do the things that you can do with it right just like any database honestly um you need to ensure that you're benchmarking your data models because later on if your app starts to grow go viral and it needs to scale out you need to ensure your data model is going to go along with you so we're going to spend some time in week six going through that process and what that looks like uh week seven where we're going to look at how to then test your deployments how to troubleshoot them and then week 8 i don't mean to like kind of throw week seven under the bus or anything but i get excited about week eight because it's just so much fun if you know kubernetes if if you're really getting into the modern orchestration patterns and such kubernetes is obviously where it's at right now and you know so many months ago uh datasex had you know released the uh the the cassandra operator for kubernetes you know all all free and everything like that where you can launch and deploy and manage your cassandra clusters fully on kubernetes and it's awesome um so we we're going to really cover all the things in this whole series yeah cedric do you have anything oh yeah and the app we have built together we will deploy that as well on kubernetes of course you know close to not if we don't available ask them but i hope this is going to answer a lot of what you are all curious about so how does it work first thing is each week we will have two sessions one for north america latin america one for emea apec those are the same exact sessions there's no difference there other than maybe the people presenting the material's the same in everything it's really just to fit into better time zones for people in different parts of the world right so you attend one of the two hour sessions each week now what that's going to look like is we're going to have courses for you to look at as i mentioned before the discord might as well go ahead and pop the discord let me do this real quick discord now a lot of you are oops that was the youtube link my bad sorry sorry let me grab the discord here i've got this like always here here let's try that again i'm going to pop the discord link here there we go wonderful um if you are on the youtube chat that's that's perfectly fine however the youtube chat's a little limited right we can do a lot more in discords we really prefer if you want to interact with the team that you go to discord by the way i didn't turn off my discord notifications i noticed today so please be nice if you dm me because it will show up right there can i try can i try you can't go ahead yeah all right um while cedric is going to troll me live um so yeah so separate from the discord youtube again that's where we're gonna have interaction with chat um then we're gonna do quizzes and surveys with menti.com now i haven't turned it on yet i saw a couple of you were saying hey the menti's not working it's because i have not actually started it um so we'll we'll get there in a second uh and then for our run time so oh there it is yeah hey hey cedric oh so it's working it is definitely working yeah yeah i just told everybody right isn't it like this thing you're not supposed to do um and then for the run time so oh [Laughter] there you go see i knew when i said something it was going to happen yeah we could have fun with it but uh then for the run time for the database day so a lot of you were asking about prerequisites do you need to download anything whatever i'm going to get to the github repo in a moment but for the first four weeks we are going to use astra i'm going to talk about more what aster is in a moment um hi there um i'm going to talk about this shane let's hi shaelish um i will accept the friend request uh when i have a moment um we will uh we will be using a patrick [Applause] i will go ahead yeah let me take over oh i'm gonna have to stop looking up there so i stopped getting distracted laughing um we are going to be using astra which is the data stacks apache cassandra in the cloud managed database right it's like a click button cassandra database that you can just launch and go it's made and built for developers we're going to talk more about that we're going to use that we're going to use that for the first four weeks so we're going to do that in the beginning then all of the materials all of the slides all the exercises everything you need is going is in the github repository right so we'll be sharing that out with you as you guys need yes the github repo is in the description below the video on youtube you do have all the links you know astra and materials reminding me so we'll drop the links as well hi neo hi there i will i will also accept that friend request when it comes um yeah so all of the materials that you need are in the youtube show notes but we will also provide them as we go and then finally the last one there the coding starting in week three we're gonna use git pod if you're not familiar with gitpod i'm sure cedric has something he wants to talk about that or i should say cedric he's over here um but dipod is a really awesome tool that allows us to be able to you can you can just go to a github repo once you've hooked it up then you can just launch essentially a faye ide right in the cloud right off that repo we're going to be using that in the code examples cedric anything else you want to say about that uh so it's like eclipse or visual code uh you know in the cloud just click and boom it's working everything already installed java python note it's just great for you know as like that you don't have to install anything to do the exercise live because the two-hour session is not only a stalking you would have to do also exercise during this two-hour slot yes yes now um by the way i'm seeing a question that just came in the youtube discord is asking for creating a server let me make sure that link you when you go to the link it should yeah it should be bit dot slash cassandra dash workshops workshop yeah do you have a better link cedric you want to give a different link you want to drop the the bitly yeah but this link should be bringing you right to the um the fellowship of the cassandra rings the server you shouldn't have to create a new server or anything like that yeah yeah that's right yep all right so i'm so hopefully yeah i'm pretty sure the forks with there we will help you finding the proper link yeah but i did see some questions about the github link we are going to drop that here soon yeah all right you know bitly slash cassandra dash workshop is written on the slide and again slides are also in the github repo yes that's right right all the links that we have are embedded in the slides and that's also in the repo okay so the next part then number two is homework so here's how we've set this up we'll do one two hour live session that we're doing now per week and then each week we're going to give you about two hours worth of homework right now the homework is especially important for those of you who want to go on to take your cassandra certification exams the homework is specifically i mean it's built off the very courses and the quizzes and everything that what the actual exam questions are right so if you are planning we really encourage this you know and matter of fact as part of what we're doing in this series you will eventually get a voucher to get yourself uh you know a free certifications exam right that's usually like 375 dollars i think they cost um so we're giving that to you as part of coming to this and spending time with us and everything that will be something we're gonna do later on in the uh and the series but once we do that if you go to take those exams you're going to need the homework you're absolutely going to need it now the courses the way they're designed you can sit and watch the videos and take the quizzes and essentially you'll have all the information you need to uh do the exams and those that'll take you about around two hours right um now there are exercises though that are also in those courses if you really want to get your hands dirty and really want to get into it and learn do them right they're worth it they'll usually have you download a vm you go in and you really start messing with cassandra clusters you know and like i said getting getting in there really deep um now that's gonna take a lot longer than two hours but that's totally up to you it's not required right we really suggest that you take a look at it but it's not required so again it's going to be about two hours for one of these and then two hours for your homework okay so last couple things before we start you know get our databases going so you'll find out the top left there you see that community.datastax.com so there are a lot of you here and there are a lot of questions and some of the questions that we get sometimes are kind of longer form it may take us a moment to respond not just from a standpoint of trying to keep up with during the live session but because they need like a more comprehensive answer or something this is exactly what community.datastacks.com is for it's like stacked overflow for cassandra there are a ton of cassandra experts both internal the data stacks and external users power users of cassandra who've been doing this stuff for years that are there out in community answering those questions so what you might notice is that sometimes we might actually answer your question in the chat with a link from community what that means if we as we we've actually gone and posted that question up in community in a special group we've created for the series to go and answer it in that kind of longer form format so if you see something like that we're not just like dismissing you or something and saying go to this link we're setting it up so we can have a longer conversation and and we can kind of follow up after the event so we really encourage everybody go take a look at community and ask your longer form questions there or if you ask them in the chats we may respond there as well yes i do have a question here about uh do we get any certification or validation that they took the eight week courses that we are doing now we are thinking about of course rewarding you to be with us during the full i don't know if we'd be a certification or a swag box or the the badges the form we will take but yeah of course we will reward you to fully take the full eight weeks but after the eight weeks you would have all the knowledge to pass both certification both developer and admin certification right right okay so then uh the training that i was talking about the courses uh that's at academy.asex.com by the way don't think that we're expecting you to memorize all these links that these urls we're sending to you we're going to provide them not only in the materials they're in the github repo they're in the show notes and you're going to get an email follow-up like we're going to you're going to get the links yeah yeah you don't have to memorize all those right uh so academy.datasex.com even separate from the coursework that we're going to give you as homework there are a ton of other courses and content out there it's all free by the way datasex provides the stuff all free um so if you really want to go on and get your cassandra on and like learn even deeper on all sorts of things it's a great place to go then finally the validation form so this is like the social contract that we are we are kind of building right now with all of you and us so every week we're going to give you a validation form all this is is a couple questions maybe three i think this week or three or four to answer that show us that you have gone through the materials and the homework and such if you do this and if you answer them all correctly right then at the end of the series if you get them all right throughout the whole series and you submit those then we're gonna pick a random 10 out of all the winners and you're gonna get some cool swag right so this is kind of a nice way to let let us know that you're able to get through the materials and such and and that you're understanding things and at the same time there's a little incentive in there for you so cedric anything else before i move on with all this no don't worry about the question those are not tricky at all it just validate that you know you took the exercise what what was you know in the textbook of the exercise or very you know not difficult questions okay all right and by the way i'm just watching the chats i am seeing um can you there is so many message yeah there's well yeah there are a lot of messages coming through right now yeah yeah i have a hard time reading all over them yeah um yeah i do see a lot of questions regarding the materials and stuff so as we go through hopefully uh this would be clear like i said that all the materials and everything you knew you need are will be provided to you okay and then the last thing so attend one of the sessions do the two hours of homework and then relax and chill out right you just learned a bunch of cool stuff for the week cool time to relax all right move on to our first exercise it's not really so much of an exercise um it is more of um it is more of just kind of getting set up with the bootstrap but what we're going to do here is we're going to first go to github repo and we're going to clone the request let's go ahead and do that so the first thing i want you to do and i will drop this link for you all and again it is in the youtube uh show notes but here is the github repo here is the github leco i something weird came out of my mouth github repo not github liko and oops yes so as usual everybody's on github you should you know either download all the repo as a zip or clone the repo each week we will add new stuff in there we are building the we we are making the content better and better a week after weeks so it will evolve over time so week two three to eight we still amend the content during the weeks so we would have to pull the new stuff every week that's right yeah so it is i think it is a better better choice this time around to go ahead and do a clone um so what we'll do and i'm gonna actually do this with you real fast so all right let's see where am i i'm in my base directory i'm going to okay so here's what we're going to do um i'm going to make a directory now however you want to do this is totally up to you if you're not familiar with github that's okay um you can if you click this clone here as cedric just mentioned you can either download the zip or you can clone the repo if you're not familiar with github you've not done it before and you don't have like git installed locally just download the zip right you'll have all the materials that you need right there in the zip um if you are familiar with github and have used it before and you are using the get executable then you can just copy the repo reference here and now in my case i'm just making a directory that i'm going to put this in but i'm going to say git clone i should probably make this bigger so people can actually read it yeah yeah it's sorry yeah okay i know that by this time all right hopefully that's that's easier to read yeah i'm just going to put the link to the github repository again i got that from clicking on clone here and if you just click this little button there it'll do it um and then going git clone and here is the name of the repository and that's it that is literally your prerequisite um and funny enough the only thing we're really doing that is pulling the notebooks right now you'll notice that once this pulls down here okay go to the directory you'll see that we have all sorts of weeks right so for today we're going to go into week one i'm just doing this now and i'm using tab completion by the way obviously there's no way i could type nearly that fast so if you happen to have a machine that does have tab completion it'll be your friend um and you'll notice there's a set of folders in here the readme is going to be the readme we're going to see in a moment in the github repo images don't even worry about it it's just the images we work with notebooks we will use some of the notebooks today and then slides slides the whole deck everything that we're doing here is in pdf format in the slides directory by the way this is all available on on github so if you are cloning like i said just go ahead and do a get clone of this repository here again you can click on that button and copy it now here's what i would like you to do if you could please give me a thumbs up here if you've done that um we'd like to get an idea we just want to make sure that folks are at that stage and you have the repository yes and also question on this on the chat so you don't have to install anything you know during the eight weeks you can do all the exercise using only your brother we will provide everything you know ide run time to execute docker containers everything you don't have to install anything now to do the homework and you know get the customer certification at some points you know you cannot be your customer expert without installing cassandra at some points but not during the workshop not during the live at all that's right and i am seeing some questions come in do you have to do this with us live no you don't you can wait and do it on your at your own pace uh during you know the the recorded session which will be available right after the video you can just watch us during the live session if you like totally up to you um so no you don't have to do it live and oh and if you're not able to clone here let me go back dude to the correct screen yeah you can download as a zip yeah yeah so if you're not able to clone if you haven't used git before if you go to the repository again here i'll post it another time right there in both chats here i'm dropping them in both there you go so if you go to that link and again you should see a page like this go to the clone button oh minimum system requirements yeah high streak i saw that um there are no minimum system requirements everything we're going to do today is done online in the cloud for that very reason all you need is a browser not ie yeah but no everything we're going to do is on cloud-based machines and and all that so you don't need any minimum requirements for today okay and then to finish this off again if you hit the clone button you'll see either the zip option if you have not used get get before but if you have just copy the repo url there and then the command will be get clone i said get get sorry git and the name of that repo yeah that was funny yeah yeah yes we are we will use astra yeah that's correct that's funny yeah we're gonna use asterisk today okay so with that i got i saw a ton of thumbs up it looks like we have a lot of folks um that are good with the github repo so what i'm gonna do next is i'm just gonna click on week one so notice what i did there i went to week one this is where we're gonna work out of today right um so the way we set these up is you'll see that when you start you're gonna have you know the link to the slide deck and then we have anchor set up so i can just click on that and go right to the section we just did this right we just did this section here yeah and that's a point i want to make so each week will work the same we take a little more time as usual because it's the first one but you will find all the materials and all the steps you need to go through in a single readme like that so if you're already familiar with the concept we are presenting you can go on your own and start moving through the exercise on your own as well or in there yes that's right that's absolutely right okay all right so for the next part then what we're gonna do is oh yeah if you're using ubuntu you should be totally fine you don't have any ie in there exactly yeah yeah see now i'm able to watch all the people's messages come through okay so for the next part once you have either downloaded zip or you've cloned the repo you're good to go from that standpoint next piece is we're going to create our astra instance right so again if you go to the github repo if you click on that create your astra instance now i'm going to do this along with you um for those of you who are following along you're doing it right now you should do this right now as well so you'll notice here i want to go to astra.datasex.com so i'm going to go ahead and click on that oh by the way the markdown language does not allow us easily to put in uh target blank so you will need to control click or middle click in order to go to another tab um slight annoyance yeah now at this point if you haven't registered before for astra again we're going to use a free tier there is no paywall here nothing like that this is we're going to create the database if you haven't registered before just follow the instructions that are here on the page even though i'm going to walk through it myself i've already registered so i'm just not going to go through that step but it should just take a moment to get your registration going um and we'll give you guys we're going to give you guys a moment to get that up and going yes and you know all the fields you need to provide please provide the proper key space you know it's killer video k a k i double l r g i d e o uh why because in the exercise later on we expect that you use this key space name you can you know of course you can provide the the one you like but then you will have to change the textbooks in the exercise so i'm just answering a question sure um the question i'm actually answering is somebody asking hey wouldn't this be easier to do in docker we've actually done a ton of these workshops using docker before here's what we run into and why we do stuff in cloud resources a lot of times folks personal laptops whatever machines are working on just don't have the power once you start running like you know a base cassandra workload isn't that bad but once you start kind of ramping things up especially when we get into the kubernetes one later you need a pretty powerful laptop lots of ram not everybody has that so we found it was a lot easier to just start doing this thing through cloud resources um that way you don't have to you could literally just come in with a browser and you're good to go so that's what that's about yeah but not only that i would add that when you do docker with kassandra the image is about 600 megabytes and if everybody starts to docker pool the image right now it will take a while you know to to have everything downloaded and ready to go on your laptop if your network is not that good and probably with youtube already we are sucking most of your bandwidth anyway yeah oh can you use your academy credentials to log in astra no it's not like a single sign-on or anything like that it is a different registration for now um so you will have to use something else i did see some messages coming through um it looks like i think with the sheer number of folks we have hitting the sign up page it might be timing out a little bit let's just give it just a moment um really what i'm looking for is for you to register and get to this page here where you have you're waiting to create a new database i'll make sure that we're reaching out to the astra team and they are good from that standpoint yeah i can i can write to them are you are you talking to them yeah okay yeah we'll give it just a moment to kind of catch up so again if you're having if you're having trouble with that try to retry as much as you can um and we'll see okay we're still seeing i'm still seeing oh the astro url it is right here um in the in the repo but i will yeah you can click on the repo or go to astra.com thank you yep i just saw tasha um who uh or atiesha why maybe i don't know uh who who posted it thank you for doing that yes the database name and key space we have not gotten to yet that is here in the instructions so once you follow this down you will find the step 1b once you get here we're going to give you exactly what those values need to be so for the session we really do ask that you use the values that we have here the reason why is because the notebooks the code later on everything they're going to be based off of these particular values yes when you're playing around with astra on your own and you're playing with your own databases this could be whatever you want but for today we do ask you please use the values that we have here so you'll see the database name killer video cluster killer video so on and so forth and let me go ahead and i will start this on mine since i'm already here um so hopefully you're going to see a page like this i do recognize that some of you are getting the temporary unavailable piece um yep yeah so we'll we'll get that fixed up i'm gonna move on for now and then we'll make sure to follow up with you uh as far as getting those running so for those who are at this this page make sure you select the free tier okay um this is the same information that is also here in the instructions oh oh sheen hi i see that yeah your friend doesn't have to register to show up you could just have them show up um jack if you could post me the link um uh to the eventbrite and then i'll get that over to o'shea okay so what i'm gonna do is first thing again pick the free tier that is very important here we're not trying to charge you for anything that is free forever by the way um now i'm here in the east coast of the us i'm gonna pick us east but if you happen to be somewhere else uh that's closer to europe please pick that now here's where this database key space name and such come into play so you notice i say the database name is killer video cluster i'm going to put that here then the key space name is killer video yep put that here username is kvuser notices camel case capital j capital capital k capital v capital u and then the password again these values are all here in the github repo once you've done that and you're set up here yeah rachel um we are seeing we it looks like we're getting um the sign up service is getting a little bit overloaded and we have reached out to the team for that so um just sit tight and we'll get you going from that standpoint all right so once you're here and hopefully uh given what i'm seeing this this works okay and it says launch database this is going to launch your database now this might take a couple minutes yeah thanks rachel okay let's see okay somebody was just asking me for the link for their friend to register i'm going to drop that the discord there thank you jack loaded good okay so some of them are coming through yeah you know i see a neat question from kryptos chain um wondering why it's called a database right i see it's a cluster name so yeah so even though cassandra patrick cassandra is a clustered distributed system it is still a database right and so when the cluster part of it is really more like an implementation detail we're actually going to talk about there here in a moment so think about this as your database um even for all of those nodes and that clustered kind of like a way it exists you still have a single access point um that acts like a a database right so it's still a database it's just a cluster distributed databases i'm going to talk more about that here in a moment all right so we're going to let this launch yeah we're going to i'm going to move on everyone and um cedric is already talking to the team to make sure that we can get that going i think i think with so many of us on here um that was just a lot all at once we kind of wondered right if that was going to happen with a lot of people they have all load up at once so we'll just give it some time just kind of yeah yeah we will go into a few slides so yeah the service will have time to recover a bit or you know absorb this load yes all righty all right so great so then for astral what is astra um so the first thing i mentioned is cassandra's a service right so it's data stacks is essentially um a long time expert in cassandra we're the main company behind apache cassandra all the drivers everything we've been running cassandra clusters and databases for years and so this is a service that is a fully managed cassander in the cloud service you essentially fill out the details that you saw me fill out press launch that's it you have a fully supported cassandra cluster that is now running for you it was really built for developers right so you'll find that a lot of the tools that are available are there to make it very quick to be able to go spin up a database get going and then have the tools that you need to securely connect to that database so this really kind of eliminates the need for you to install it yourself manage yourself all that kind of deal um there are rest apis and graphql apis right out of the box for those who want to use something like that um it is in fact powered by the kubernetes operator for cassandra so when we get to that session later on uh in week eight you will find it's you know we eat our own dog food with us so the very same operator that we're using for astra is what we're going to be using for kubernetes and then from the lock-in piece right now it is available on aws or gcp um and there's no lock-in to any particular cloud provider there will be more coming and that free tier that we're using is a 10 gigabyte tier that will be available indefinitely with no credit card or no paywall anything like that uh if you're not if you haven't used cassandra before uh you'll find that the language should look very familiar for those of you coming from a relational background or familiar with sql uh cql the cassandra query language looks and feels just like it right so it should be very familiar that is the language that we use to do things one of the dev tools that is paired with astra is studio data set studio it is automatically configured and hooked up to your your astra instance so you don't have to do anything more there once we get these up and running we'll start working with the studio notebooks and then lastly uh pretty much all the major drivers all the major languages are supported right and all the drivers um you know regardless if you were using c sharp or java or node or whatever they're all going to use the same you know they usually kind of keep the same parity um and they all work with astra just fine what's also cool about those same drivers is that you can not only use them for astra you can use them for open source capacity apache cassandra you can also use them for data sex enterprise so if you had a single application that in one case maybe you're starting off and you're doing some test uh you know and you know testing environment kind of stuff and you just want to spin up something in astro on the free tier you can do that then if you want to go ahead and hook up to your own cassandra database it's one line of code change with the same exact driver right that's made it's made like that to be as absolutely seamless as possible so you don't have to mess around with changing a bunch of code and everything when you're changing the versions okay so let's see how are we doing cedric as far as i still see messages yeah so i think it's uh it's the authentication service which is struggling at the moment i don't think this is provisioning you know a bunch of military thousands or 20 000 vm uh but yeah we are working on it okay okay so what i'm going to do then is i am i'm just going to go ahead and continue to let them work that piece yeah um no i don't think there's a goaling driver cedric do you know that one off the top of your head i don't believe there is a go driver a go driver so there's something called go cql which is it's existing it's not maintained by data stacks but yeah there is there is a go driver out there it's it's called go sql so it's just not maintained by us yeah yeah okay okay so if you've gotten to this screen um that is wonderful if you've gotten to a point where the database has gotten up but for now it looks like the authentication service is is having a little trouble so we're gonna let them catch up why don't we get into the material and the cool thing about this is that even if um you know we have to spin up the astra databases a little bit later or it takes us a little bit um you could always do that after the fact uh and and you know after after the session today after the live session okay so this next section is really getting into the why and when of the patrick cassandra right let's talk about what cassandra actually is so first off cassandra is a nosql distributed database so notice the nosql right it's not relational this came as part of the nosql revolution and it's a distributed database that means that it is built of more than just a single instance right so if i was talking like my sequel or oracle yes i may have like a leader follower kind of set up where i can have read replicas and such um but they are not built to be distributed at their core is in fact built to be distributed at its core so now cassandra is built of what we call nodes each installation of cassandra is a node each one of those nodes has a set of capabilities now something i just want to point out here is you'll notice that i kind of tongue and cheek for the thorough put i put lots of transactions a second or core um what i really mean by that is you know capacity thorough put those things can be completely based on the you know hardware that you have for a particular node uh the version of cassandra you're using all sorts of things just know the thorough put that is many thousands of transactions a second per core right i think the minimum would be about three maybe up to like twelve thousand depending on you know how your node is sized and the version that you're using right that's why i put lots in there uh capacity about say two to four terabytes per node now what's interesting about this is again i said it's a distributed database that means that i have multiple nodes that will form my database so somebody was asking that database question earlier right even though it's a cluster why you keep saying database because right here this here is forming my my database all those nodes actually form a single database right it's just the implementation detail that they happen to be distributed nodes so these nodes communicate through a protocol called gossip little later on i'm going to start talking about things like token and token ranges and what happens if a node goes down and there's all sorts of other ways that these these nodes communicate it's a peer-to-peer leaderless system so what that means is is that from a leaderless standpoint any node can do what any other node does right there is no special node any operation that a node performs another node can perform and you're going to see how this becomes really important to the power of cassandra and gossip is just one of the protocols that's used to communicate a lot of that information things like token ranges and certain state and this is being done automatically right you have to manually do this or anything and then finally nodes are logically connected through what's called a data center the ring right so here this would be an example where i have a ring of nodes that form my cassandra database okay so one of the really cool cool features in my mind of cassandra is its ability to scale linearly so what does that mean that means let's say if i want to double my capacity or if i want to double my throw put i double my notes and this scale is essentially indefinite right um the funny thing the the chart that's here that only shows up to like 32 nodes i know of of cassandra clusters out there that are thousands of nodes and this still holds true right so if you take a look at on the x-axis there at like four nodes by the way don't look at these numbers going oh four nodes equals fifty thousand operations a second that's not you know it again it depends on how things are size depends on the version all sorts of things what i really want you to pay attention to is that at four nodes in this case i have fifty thousand operations a second but eight is just about a hundred thousand you see sixteen is just about a doubling 32 so and so forth and this holds out as you increase the size of your cluster so you can essentially indefinitely scale up your database as you need to and this is done by horizontal scaling so so many of us who work with databases and systems in the past we're used to vertically scaling right the joke we had back when i used to work on oracle is you know you can scale up until you run out of money right that's that's the joke right you can add faster you know faster more cpu more ram faster disk whatever but at some point you can only go so far cassandra solves this by allowing you to scale out horizontally on lower commodity hardware so you can just add more nodes into the system now again the data that you see here like saying three nodes a hundred thousand transactions second don't get caught up on that part of it this is an illustration so what i want you to really take away from this is that three nodes at a hundred if i had a hundred thousand transactions a second if i wanted to double that throw put to 200 000 i would add another three nodes for six so on and so forth you know um cedric do you have anything else do you want to add into this asm um maybe a couple of famous uh you know famous brand famous company using cassandra you know like april may made some public statement that they are about 175 thousand nodes of customers that's yeah you know that's crazy you know and some cluster with twelve petabytes online you know customer is a real-time database so you can have you know petabytes of data available online so with that much volume we will have some trade-off but it's yeah yeah exactly exactly and by the way i just saw a couple questions come in one was does each node represent a different machine or whatever it can absolutely it could be different vms right um it really depends on the hardware and the setup but yes each one of those nodes is an instance of cassandra whether that's a physical machine instance or if it's something like you have a bigger single machine that has like multiple vms or multiple instances running so it's just an instance of cassandra another question i saw was is it nodes per data center or nodes per cluster so in cassandra we're going to get into this in a moment in cassandra you you have the concept of data centers you can have more than one data center so let's say i had two data centers then all of those nodes would form my cluster right so the cluster is that single database entity and then i can have multiple data centers there so it honestly could be either it just depends on your setup okay so one of the questions we get all of the time is well you've got this distributed database and i put data in the system but how does it know where to go how does it is it just sending the data to all the nodes or what's happening there here's what happens it's a very simplistic example if you take a look at the table on the right hand side you'll see that i have country city and population now we'll get into what partition key means later just know that in cassandra that the base unit of axis is a partition so again we'll get there later but that's how the data is going to be split up and sent around my database so as i have some data coming in to the system that data is going to automatically get distributed around my database this is not something i have to do manually and if you notice that there are some cases where in that partition key that you saw like usa there at the top right well that's actually the same partition for that data so that's going to go to the same place right but the data is going to be naturally spread around my database and you're going to see in a moment where this distributed nature comes in where it you know where it really kind of shines okay i love oh nice cedric with the with replication here white again click again oh no there we go okay cedric added something to her slightly yeah yeah so i just replicated myself you know yeah just bump set right here so your data is replicated as well so one of the other really cool core tenets of cassandra its ability to replicate right and this replication is automatic so now i've got this ring with these nodes in it that form my database and we just talked about how data gets distributed around that ring but how does this actually work right here's what happens so notice the numbers that you see here in this ring we're going from 0 to 100. now by the way these values we just use 0 200 to make the illustration easy to explain the real token ranges are an astronomically large number you're not going to run out of those um we just again it would be it would be absurd for me to actually put the real values up there because you'd see these long numbers you know that take up the whole page so point being you're not going to run out of these tokens we're just using 0 200 for this example here's what i want you to understand though if you take a look at each of these nodes you're going to see they have a number what i'm saying is is each of these nodes owns a range of tokens so if i go from you know i got 0 to 17 um 18 to 33 you know 33 to 50 you know 49 so on and so forth they're going to each own a range so what happens is that when some data comes in let's say i have data going to partition token 59. now by the way if you're asking how do i know what where did partition token 59 come from don't worry we'll get there just understand that for now some days being written i have a partition token value of 59 how does the system know where to put it what happens is is that any node can handle a request remember i mentioned this is a leaderless system so any node gets that it just happens to be the one on the top left and that when it handles the request at that time that's what we call the coordinator node it just means for that moment for that request it is now taking on the operation to manage this particular right so the coordinator says okay this is for partition token 59 where does that need to go well guess what remember down here you see this purple guy from like 51 to 67 well 59 fits in that range the coordinator node knows that because again they communicate these nodes are communicating all the time so they know what ranges each one has and the coordinator sends that down to the node that has that particular range now this is what we call replication factor of one i only replicate it to one node what happens in a replication factor of two check out what just happened there i add a ring and i shift it now if we take that 67 there as an example that purple color notice now i have two of the rings that have that purple color i have two nodes that cover that range so what happens when some request to write data comes in well here we go we have our partition token 59 now this time it just happens to be that node up there in the top right that one that's labeled 17 that now becomes my coordinator because it's handling the request it knows that okay i've got a replication factor of two two of those nodes have this data i'm now going to forward that data onto two nodes now two of the nodes have the data what happens in replication factor three you guessed it i add a ring i shift it and now three nodes own that range notice the purple color again in the rings you'll see three nozzles in that range again if i have some data comes in quadrature goes i have three nodes that own that range i'm going to forward that on to those three nodes and the data goes into three places so now there's something really cool to point out here there's some really neat benefits you get from this some of you might be thinking do i have three copies of my data yes you do so from the total system do you do you have more space that you need to use in disk yes you do but disk is cheap honestly compared to all the other commodities right but here's the benefit that you get right now i have three nodes at any time can facilitate a request for that data so i automatically have a natural load balancing that can happen and the drivers actually do a lot of this for you so there's a performance benefit there for one but for two what happens if a node goes down how many of you have been working with aws where an instance gets reset for a security update or you have on-prem stuff a hard drive fails whatever things are going to fail period they always do cassandra was purpose-built to be able to handle these types of cases so in a case where if i had a node go down just like this and i had a request to write some data well guess what i can still write my data to two of those nodes but i'm still not lost yet because at that coordinator the one that acts like my coordinator it'll store what's called a hint and it'll wait for that node to come back up and when that node comes back up it'll replay that data on that node to ensure that all of them are in sync right so this is some of the self-healing mechanism that is already baked in to cassandra as part of just how it works and there are even some configurations that you could lose two of those nodes and you could still be available up able to read and write data and then when those nodes came back on they'll automatically replay yeah before i move on center you got anything else you want to add oh man the chat is on fire i do have so many questions so i will try to make some questions so first the coordinator node is per request you know it's not an eden master node when we call the coordinator it's per request the client come and can ask any of the node to take the request so this node becomes the coordinate or not for my request and how does a client will pick one of the node it can pick any of of it you know the data is evenly distributed it can pick any of the node and this node will become the coordinator node then the coordinator node will you know send the data to the proper replica he is sending you know the the data to multiple replicas in parallel but the replication is async and we will see in a minute how does it work how much time do we wait our client wait until we can say okay the data has been written in enough nodes it's called the consistency level we are going into it in a few minutes and is that some of the questions you want to get to there you have more yeah maybe about the replication factor so replication factor is uh hold at the key space level and you can totally change the replication factor you can alter key space with replication and increase or decrease the replication factor if you do so then uh data will be uh you know replicated using the go sipping protocol peer peer-to-peer you know exchanging the data and moving the data when it needs to go i laughed there because cool off in the discord chat said rf0 means no data lol [Laughter] that's great you know what but there is a way to set up casanova fully in memory yeah it should be pointed out by the way that an rf3 so as as cedric just explained right what the replication vector is in our example you're going to find that we use a replication factor of 3 as the standard that is the standard honestly there's no real good reason you should go less than replication vector of three and if you're doing anything else other than that like trying to go higher four or five you should really know what you're doing before you start to go there if you you could have a thousand node cluster and use a replication factor of three and be perfectly fine right so replication vector three is in fact the standard so if you're curious which one should i use three use three yeah okay it's very important yeah i know i i again i do have again some question about replication factor yes it's it's uh so it's per key space uh the default key space uh the recommended replication factor is three we'll see in a minute but you know there is no default this is when you create a key space you specifically provide the replication factor you have to provide it so it's there is no default you know i just saw a really neat question come in from alex c and the discord we will get to this later in the application development pieces but i just want to address this the drivers are also gossiping with your nodes and they join the cluster just like the rest of your nodes do and yes they are token aware um matter of fact the drivers by their very default have a lot of the balancing token aware policies and stuff baked right into them so you don't have to manage that directly as well um so i just saw that and wanted to answer that one all right let's see so our yes rf3 i'm seeing a couple yes the replication factor three is suggested um the replication is the replication factor essentially is the setting when you create your key space that's going to tell cassandra how to automatically replicate which i'm going to talk about right here as a matter of fact so notice the image on the left hand side right you see a map of the world and i have three data center rings there that are connected so another awesome feature about cassandra is it doesn't just replicate within a single data center it can replicate across an entire cluster over multiple data centers and it does this in an active active fashion automatically so check this out so if you look at that on the left hand side i have a ring there over like the you know americas i have one in a mia and one over in apac so this would comprise these are three separate data centers in a single database this is one database so what's really cool here about the replication and what it does for you is once this is set if i wrote some data say over there in the americas that data would automatically replicate out to my other rings and automatically be available in those other regions so i could write some data over in the americas and immediately read it from china or wherever it was and vice versa right and this is happening automatic for you at speed of wire right so there's obviously physics that come into play here you know there is going to be latency if you're going say from australia to the united states or something like that right um so this is exactly why this is so powerful because you want to be able to place your data in your database as close to your users as possible if i had a single data center with like some database just sitting in the west coast of the united states but i had a whole ton of users in australia they're not going to get the greatest experience because they're always going to be incurring that latency that is just again you know physics limitations right um so what cassandra allows you to do is you can place data centers throughout the world into different regions putting your data where your users are then you have your application instances that are talking to those but you let cassandra essentially handle the hard work for you for the replication getting data where it needs to go on the right hands oh do you have something to say yeah yeah again a lot of questions about the replication factor so okay in a classifier cluster indeed you can have multiple rings as we can see here when you define the key space you define the replication factor for each ring okay you can have a d dedicated replication factor in each ring and yeah depending on if you know you can easily lose a node or not you can increase your application factor three five seven but what's the point to have your data present seventh time the reason we we replicate the data is mostly if you lose any of the node that is still there and also which node to pick you know you need to replicate how do i know on which node i replicate when on in on a table um you can have up to 2 to the power 63 rows and those are evenly distributing around the nodes so for a ring each node is in charge of a token range you know a portion of your table and so when the coordinator is requested he will hash using a hashing function and know based on the value it's an integer based on the along based on the value he will knows on which node should this data be stored it's you know hashing mechanism based on token and it's working per table i think i've covered almost five to six questions yeah and i just saw a question come in what are the numbers on the nodes mean so i'll just adjust this again real fast let me pull back here here okay all these numbers are this is just an illustration do not please do not take away from watching this that these are like absolute values that nodes get or something like that right these are just the token ranges that each of these nodes have been assigned that's all these numbers represent right and by the way in a in a real cluster with you know um you know real token ranges and everything not like this kind of contrived example um these might vary quite a bit compared to what you see here it's just the token ranges it just says here is the slice of the pie that i am going to take data for that's that's all those numbers mean okay so moving over to the right hand side of this one so not only do you have this i mean no no would you get what you do no no no let me disappear again yeah let's do it properly you know oh you were in front of icy i see it okay i was on premise so um so on the right hand side with a hybrid cloud multi-cloud this is another really cool feature is that yeah you can distribute cassandra all over the world but you can also distribute it wherever on whatever installation implementation hardware whatever you want it doesn't care so it could be on-prem it could be in any of the cloud providers you could have a single database that's what we're illustrating here you could have a single database that spans azure aws google cloud and your on-premise system all in a single database just think about what that means for a moment think about if you have a case where if you're in the e-commerce world you have to deal with things like black friday um where you need to you need to burst up for a time and you've got you know your on-premise hardware but you need to burst up for a time then pull it back down that's a great use for cloud right but you could have a single database you could just create another data center right in the cloud provider of your choice in and with that same database that you have on premise and boom now you've got your burst and you can remove it when you're done maybe there is some functionality in the google cloud or something that is specific to what they offer and you need to get data there to be able to leverage that or something maybe you're trying to you know kind of reduce vendor lock-in we see this a lot actually we've seen more and more folks that are creating clusters that expand multiple cloud providers so they don't get strong-armed by any one particular vendor compared to another so the point is cassandra doesn't care it doesn't care where you put it you can put it anywhere and you can do that with a single database it's it's really pretty powerful okay so i want to move into consistency now um now there is a relationship between your replication factor and consistency right let's take a look at what that is so let's see how many folks are familiar with the cap theorem should i even ask this and have people say stuff in the chats because the uh the chat's already blown up and i'm afraid that uh it's gonna go crazy um but if you haven't heard of the cap theorem before essentially what it says is something like this in the case in a distributed system in the case of a failure scenario you can only ever guarantee two of the three items that you see there consistency availability and partition tolerance this is not a cassandra thing this is any distributed system so let's break down what this means a little bit right so imagine i have two nodes right and i have a network partition something happens the network is severed it's always a number right network is severed and now these two nodes cannot communicate right well in cassandra what's going to happen is both those nodes can still facilitate requests right they can still facilitate requests but they cannot maintain consistency because there is a network partition this is what we're talking about in the case that i can only maintain two of the three guarantees in a failure scenario by the way if there's no failure scenario everything is coming along nicely you can maintain all three of these no problem now cassandra defaults to what's called an ap system availability partition tolerant system that means the cassandra by default is going to be available even in outages it will be available this is a lot of times you might hear it's called the always-on database right this is why that is so it is defaulting to ensure that if something goes wrong it's available and it can facilitate requests and your app doesn't go down now you can configure it if you want to to be a cp system right where you essentially put the focus on consistency what's really cool about it is you can configure this per query this isn't like a database setting that you have to be cp or ap this is per query that you can do this if you so need but again cassandra will default to being availability and partition tolerance so how does that come in a consistency level okay remember our replication factor from before you see that on the left-hand side now replication vector of three is the standard and that's what we're going to use for all of these so here's the relationship replication factor says how many nodes data is going to go to so if i have a replication factor of three regardless of my consistency level i'm going to replicate data to the three nodes now if you notice on the right hand side it says cl equals one that's consistency level equals one so i'm saying i want a consistency level of one all i'm saying there is in the case that i'm writing at a replication factor of three and i'm writing to these three nodes all of them are going to send their acknowledgements back that they wrote that data back to the coordinator and back to the client however in a consistency level one i'm just waiting for one of those acknowledgements that's it i'm just going to say okay one node acknowledges the right i'm good to go i'm going to move on now if you take a look at quorum quorum is majority right so it's really the number your replication factor divided by two plus one it just means in a case of replication factor of three a majority would be what two right um so we're just saying majority so what does this mean well again replication factor is three i'm still writing to three no's that hasn't changed however i'm just gonna wait for two of those acknowledgements to come back before i say everything is okay how about an all consistency level of all what do you think is to happen right i'm going to replicate the three nodes i'm going to wait now for all three of them to acknowledge the right before i continue on and say everything is okay as you might imagine as i increase my consistency level i'm also going to slightly increase the latency that it might take to wait for all those acknowledgements right so there's a balance point here there's another thing i want to point out about the consistency level of all imagine what happens in an awl if i lose a node i can't maintain that consistency right so we're going to talk about this here in a moment something called immediate consistency immediate consistency is essentially saying this when i write some data and then i read it right after i expect that i'm going to read what i just wrote right so so most of us i think that's what we're expecting and that is absolutely achievable with cassandra so what it comes down to is if you want to achieve immediate consistency to ensure that you can read what you just wrote at any given time even a distributed system you have the number of reads plus the number of your rights if they're greater than your replication factor you're good to go let's take a look at some examples so one way to achieve immediate consistency is to write at a consistency level of all again where i'm acknowledging all three of my nodes at a replication factor of three have gotten the right and sent that back then i could just read it a one i'm guaranteed at this point that i will be able to get the data that i just wrote because i wrote to all of them because i was kind of alluding to in a moment ago because i acknowledge all of them i should say sorry as i was alluding to a moment ago i have no availability tolerance here if i lost even one of those nodes i would not be able to write at a consistency level of all so that's not really a good trade-off cause sandra is supposed to be robust be always on right so how can i achieve this with having my availability tolerance and this is where reading and writing at quorum quorum come into play so if i write a quorum meaning i'm still writing all three nodes but i'm just waiting for two to acknowledge it but then i read at quorum this is kind of neat because now i can tolerate a node going down and i can still facilitate my reads and writes with no seeming effect to the system right what's also cool and this gets into a little bit of the weeds that when you read at a quorum uh let's pretend for a second just pretend that for some reason that like that note at the very bottom um you know that that doesn't have the check mark let's say it had stale data right when i read a quorum an automatic check sum in comparison is actually made so what will happen is if for some reason any data is seen being stale or something not only will the correct data be sent back to the client but the scale data will automatically be repaired this is part of the self-healing mechanism that comes with cassandra but the key thing is that you get that when you're reading at quorum so the standard here by the way just like we said before replication factor of three is the standard the standard here is read and write a quorum quorum this is a good balance between latency between availability tolerance all of that and again unless you have a really good reason just do this and you're going to be in a really good place cedric do you have anything you want to add to that uh astra feels better it will be ready in a few minutes or this minute it was due to authentication issues so you know the cassandra database under the hood of course are working very well it was mostly uh authenticating against the platform who was lagging behind okay yeah okay and so i'm seeing some questions might as well take a moment to kind of answer those i see one here from a new sharma in youtube chat says how does the cl quorum for read decide which node has the correct data um so it can read because because in a case when you're reading and writing at quorum quorum that's the key thing because i'm writing quorum as well right i have this automatic overlap and that goes back to that equation sorry i moved my mouse it's not an equation really but this goes back to this remember i said if the number of reads plus the number of writes is greater than your replication factor you have immediate consistency so remember in our examples the replication vector is three so if we go back to um this well if i write it all that's three i read it one that's four four is greater than three right um in the case of quorum quorum it's the same deal i'm writing at 2 i'm reading it too that's 4 greater than 3. what that really essentially means is i have overlap so it doesn't matter which two nodes it just doesn't matter which two nodes are being read from i only need one of them to be right and since i wrote a quorum i'm guaranteed that i'm going to get that scenario so if i had two nodes that i read from and one of them was stale and one of them was right i'll get the right data sent back and then automatically repair the other one and like i said there are other without going into all the details here there are other mechanisms underneath the hood that are doing checksums and comparisons and such to ensure but it is a guarantee to let you to guarantee that you will get back the consistent data it doesn't matter which two nodes are read from in this scenario right if i was writing it a one and reading a quorum that might be a little bit different but since i'm writing corman reading a quorum this is the way that you can achieve that hopefully that answer that question okay let's see what else do we have coming through here yeah i've had a hard time keep up oh so maybe yeah there's so many questions yeah i would be quicker by doing it yes the live session is getting recorded the video will be available right after this in the youtube uh feed yes so you david oh hello david what's happened to you in florida can i see you oh here we go cedric the skype uh popped out for a second and popped back you it sounds like you asked me something oh yeah no we lost you for a minute yeah i'm bad i'm back i'm here no that's like you can hear me no no that's okay so i was answering questions about uh data center so we don't need multiple data center per cluster you can have one that's totally fine and why would you do multiple data center well maybe due to the law you want to have a dedicated data center you know you live in germany you want to have a dedicated in germany because this is a law you need to have the data close to you know close to you in the same country or you want to have the data close to the client app i mean i'm using uber uber is famous to use to using for using cassandra a lot so if i'm in north america and using uber i would like to connect to nodes close to me just to reduce latencies and same if i'm other part of the world and you can read and write from everywhere and cassandra will replicate from one ring to another asynchronously using yet another consistency level dedicated to data center we won't go into the detail they are in ds201 you should take doing other homework after this session okay and i was just catching up on some questions that i saw let me go ahead and get back to over here okay so all right so that this is the standard right is again to read and write a quorum quorum you're going to be in a good balance between like is it latency availability tolerance all of that there is another it's it's weaker consistency or called eventual questions just interrupting you again astra is back i mean you can you can go to the status page fix has been deployed you can now try and try to you know create the database just to be able to do the exercise after that okay good well then once i'm done with this then we'll pop back over and we'll go take a look thank you um so there is another example here weaker consistency or what's called eventual consistency really all this means is i'm going to read and write out one right this is the fastest way to do it however you do not get some of the self-healing mechanisms and some of the same guarantees you do with quorum quorum and immediate consistency now for some cases this might work wonderfully well let's say you're just you know you've got a part of your data model that is responding to like the number of likes or something does that need to absolutely be consistent or something no right so could you read and write on one totally um you know and and funny enough uh netflix netflix is a big power user cassandra if you didn't know that by the way when you're using netflix you're totally using cassandra and it's awesome what they do with it netflix is such a they're they're just like they know their stuff netflix did so many years ago they did a talk on this actually and they tried really hard to get inconsistent results at a one a writing reading and writing one and they couldn't do it right now we will never say that then it's okay to just do it if you have a requirement for immediate consistency use quorum quorum you're good to go right however if you don't have that particular requirement and you you have a particular part of your application that needs to be fast and you can tolerate having dale you know stale data by so many milliseconds or something this might be an option but again just like i'm saying for replication factor stick it three if you're doing something else you really need to know what you're doing here again stick a quorum quorum if you need to do something else then you need to know a little bit more just to make sure you're you're getting you're setting the right expectation of what you want so quorum quorum is your friend okay so with that let's go like i said we're going to go take a look at the astra thing see how this is going let me refresh we'll go back uh did you hit the timeout i did yeah yeah yeah okay here we go oh funny enough even though funny enough even though that one part died for me uh it it still created the key space or the database all right yeah you know it i yeah it was due to the old system so yeah it wasn't the database part no the baseboard is fine so i'm gonna go back though i'm gonna go ahead and terminate mine so people can follow around follow along not follow around that was new we'll go back here okay oh good i see some nice answers coming through great let me refresh my thing here all right let me see if it will okay and to know the values to enter to uh the form uh they are on the github repo it's killer video yeah you got it yeah probably we will share that link again you know i could probably can't i yeah and make it bigger as well so the value is kill our video so let me put that in the chat there you go the chat is moving so fast yeah here let me let me make this bigger so for those of you who are getting back around to creating your astro database um then here are the values that you need again this isn't the github repo cedric is going to go ahead and drop those links for you oh the link is a link as well i just provide the value you know people like to just copy paste but i will do i will do the link as well of course i beat you to it this time no no yeah teach people how to to fish instead of giving them you know fishes [Music] so i can create my new database so i'm going to go ahead and do this with you again choose the free tier choose the region that is closest to you i'm going to pick fill the database name here link boom okay now this chat is on fire also the link is below the video you know there's a million links you can pick any of them yeah espace name is killer video and again this is just for what we're working on you don't absolutely have to uh do this for your own databases but we do ask that you use these values for what we're doing here today yeah so the database should be created in two minutes something like that especially for the free tier we are spawning a single node um so yeah two three minutes should be fine even if there is a message telling you that it should be 20 not at all not at all yeah yeah the message is just there um i think honestly it's kind of a holdover usually just takes a couple minutes um so here you should hopefully once you fill out those details you hit launch database you're going to see something like this you say view database then you should see this screen right as a matter of fact if we go over here let's see if we can try this again all right all right so many of you if you can yeah here i'm gonna do it right here yeah if you could give me a thumbs up if you are at that screen that i was just showing a moment ago i want to see if we're doing better now okay and while we're waiting for those thumbs up let's see yes so well i see a bunch coming in good and the discord wonderful oh do you want me to hide but you know even if i add i do have the skype logo behind me so i will stay here and you have to tell us how many stampers you have right now we have 258 i'm seeing a ton coming to discord i'm seeing some in the youtube chat and by the way if you can um we should drop the mentee stuff as well uh-huh uh 45-45-72 yeah forty five forty five six seventy one okay but you know at the same time i'm just picking random question you know it's uh screening so fast um yeah oh so there is a replication factor in astra i just saw a question from a tool in the discord so in the free tier now this is getting a little bit of fun details in the free tier the the replication factor is effectively one right those are single node instances for the free tier those are meant to be kind of like i'm just spinning something up and you know i'm just testing it playing with it or whatever once you get to the other tiers they go to your normal replication factor of three with the right amount of nodes and everything so there is a replication factor in astra what it is though is the astra is built to make it uh essentially a push button single click experience to spin up your cassandra database without having to deal with all that you'll notice and i'm going to talk about this later we don't create you don't have to create the key space directly either you put the you know you put it into a field and say launch all of that is hidden in the implementation that's being automatically done for you that's where that's coming from yeah if you're seeing the launch database button is great i'm seeing actually a ton of folks yeah it's coming it's coming i see a lot of people so maybe just a second there yeah on the free tier so the settings of astra is a single uh single node db with application factor one because there is a single single that's right and um you cannot go to cassandra dml and all the the small details it's cassandra as a service but still you will see in a minute that we will use something called data stack studio and you do als you also have the cql console so if you're familiar with cassandra you can go you can go with it okay so i'm gonna go ahead and move on um i saw a ton of you i saw i think we have at least like 500 folks that um are and where i would hope they are um so i'm going to move on and we're going to give that a moment to let those databases come up and here's the thing everyone if if you if you're not able if your astro database hasn't come up just yet or whatever or you know that's okay right um you can always follow and after this the stuff was all available online you can always reach out to us if you need any help so don't don't worry about that okay so i'll go ahead cedric you have some can you go back to the previous slide i i would make it very quick yeah no you know the i would like to talk about vocabulary you know the the slide with the uh distribute you know ebrik cloud stuff before but you know if you can go back there that's totally fine so you know the big the big uh the biggest term in customer is a cluster a cluster can be composed is composed to one or multiple rings what you saw here what you see here on screen is one ring composed with multiple nodes okay when you create a key space key space is like the logical database you want to put in cassandra you can have of course multiple key space in a castle right cluster so when you create a key space you say on which data center this key space will leave and for each one you will provide the replication factor yeah i just refresh about the vocabulary so so rishi in discord asked a question and apparently my fingers did not function properly when i typed my response i meant to say for today yes please not uh for today yes ps uh yes the question was fat finger that's not that you yeah yeah the question was do you need to create the database with the same details provided in the github region yes today yes um again because all of the exercises and such that we're going to be using are going to be based off of that and if you don't use those values it's not that it won't work but then you have to go and change a bunch of data model stuff it's not going to be very smooth for you um so again use the values in the github readme for today if you're doing your own thing you can do whatever you need so cedric did you get what you needed on the slide here yeah yeah totally just do a reminder uh because i got you so when you switch to github and showing brother mode people ask to make your font bigger oh oh okay for um fun pecker for which one the uh the read me uh yeah probably the real me but also astra you know uh okay you they can barely read oh really okay got it all right i'll make it bigger thanks guys thanks for letting me know okay all right so now it's a big big uh slide i love this slide yeah this this slide so so cedric's probably gonna have a lot he wants to say here um i'll just give a very quick high level and let cedric kind of dig in for fun i can make it well what's that well let's i wouldn't make it you know you want to just go yeah let's go let's yeah let's go are you doing it okay so show me the first line again again okay so what we want here to show is based on the capability we have explained to you about cassandra then you have dedicated use cases so first scalability we told you you you need more capacity and new node you have more throughput add new node so first first range of use case focus on rise i throughput eyes volume every read every right so if you have already heard about cassandra's probably for one of those use case internet of things even even streaming time series why because those use case need to write the data very very fast and customer is famous to write the data very fast second capability is about availability can you make it visible david yeah all right so the data is replicated and there is no master so you can lose any of the node it's not a big deal so the system is always on and there is no data loss and we had a question before how the hints working so it also in ds201 but the coordinator node will store on disk uh data that has not been sent to the replica and replica when the replica go back online we just stream what the replica missed okay so with those use case i mean any use case won't always on no data especially you know caching distributing the market data pricing inventory okay let's move oh i got it i got to stop real quick sorry cedric yeah gender just ask me a dm how to implement ssl for for this and it's actually one of the we're not going to get into it today in weeks three and four are we gonna hit on those in the app dev stuff et cetera absolutely we will just explain how to you know how we connect to astra using the driver with ssl and double certificate authentication so one of the things i mentioned that astro was built for developers right and anyone who's ever done ssl in java especially knows the pain it's so challenging right um so we removed essentially that layer completely and added is what's called the secure connect bundle we're going to get into this week three and four but when you create your astra database you're gonna see there's a link for the secure connect bundle all you do is download that that has all the artifacts for tls you know the whole secure connection piece is there completely configured for you to handle a secure connection to that database right um it's extremely easy to use so we're going to get into that i had to stop because i got so excited when i saw that question because honestly it's a really nice feature it significantly reduces the time you have to mess around with anything really to get a nice secure connection to your database anyway sorry no no but you know that's a point to do this event live we can stretch a little bit we can be late you know it could be two hours on alpha but yeah we love to answer your question live that's the point of the event okay so let's go to the next line so after availability it's distributed so remember you can distribute nodes all across the world or you know using multiple cloud providers so now this is a solution to have one data layer available in all the cloud it distributed okay so why would you create multiple data center i told you that already low compliance gdpr uh or you know to reduce latencies because you want to to have your application to be as fast as possible and so cassandra can have data available everywhere anywhere and you can read and write from you know anywhere customer will replicate asynchronously remember and we'll make your client weight depending on the consistency level you ask you can you can wait for one note to have been updated the column close to you or the full cluster you decide and of course how much consistency the higher the consistency level longer will be the response side but you can totally achieve consistency and last but not least and probably the most important so let me show me the last part yes i'm answering questions yeah yeah so next time i would just take over the the slide okay so and you're hiding so don't okay i will hide you for a sec uh sorry about that so it's about cloud native okay so now you see that cassandra can be installed on any cloud right and with the drivers you can build any kind of application on top of it real-time application api micro services and of course because now everything is compliant with kubernetes you can not only deploy your app but also the full customer database also in kubernetes and this is now what we we will tell you weeks after weeks that what we are building is a cloud native application so not only your code stateless api using your uh the language you like but also the database you know you should have to worry where is my data how to connect to katrina is there you do have the bunch of nodes close to you simply use the letter where is it hello hybrid cloud multi-cloud okay so let me put you back davey oh i'm back in my best yes i've seen a couple questions by the way some folks are asking what happens if they terminate the key space or terminate the database and and ask for will they lose their data yes so yeah if you're terminating it you're not just dropping a table or dropping a key space or something you are completely terminating the database at that point now that's also part of the free tier right that um you know that's kind of part of the use is you can kind of create and terminate things as you as you wish without having to worry about the data because you know it's it's a free tier and you're kind of just doing things for test but yes to answer if you do terminate it you will lose your data if you want to change the key space name right you need to recreate it you will need to terminate and recreate it um again being a fully managed system like this there's a lot of underlying details that are being assumed and done for you now even in if i'm right though even in a regular uh open source apache cassandra i don't believe you can just rename the key space anyway um you would have to you'd have to recreate that if i remember right yeah you can just change the key space name right off the bat now not even a table name yeah you have to create a new table so and we'll get into more of why that is actually a little bit later because you're going to see that when you're changing your key space you're doing you're doing a little bit more uh than just messing with the name let's see all right so i saw a lot of folks did in fact get their astra instances up so what i'd like to do now is go into the first exercise so what we're going to do now remember before we had you either download the zip or clone the github repo so that means that in in there if i go ahead and do this right you should have a notebooks directory in there this is where these are going to come into play so here's what we want to do and by the way this is all within the readme itself i'm just walking you through it right so you can see it oh and i've gotten reports that my font is too small so here hopefully you guys can you all can read that better um i will blow that up a little bit okay so here's what we're going to do hopefully once you're at this position right or if you see something like this you can just click on the database itself and you'll come to the database detail screen on the left hand side if you scroll down there's two ways to do do it you'll see a actions drop down you can click on that and you'll see either a link for developer studio there or you'll see developer studios down here and you say launch now right either one will work if you click on that what it's going to do it says spin up an instance of data stack studio automatically configured to talk to your astro database it's actually really cool so here this is one of those areas where we asked you to use the username and password we put in just so it makes it easy for everybody you're going to use the same username and password that are from the instructions so again capital kv user and kv password say test and that will take just a moment there we go i'm connected successfully wonderful i'll hit save and then once i do that then that's it i'm now connected uh and ready to go with studio um do you need to shut down oh sheen do i need to shut down my answer instance if i'm logging off no no your astra instance is going to run automatically you don't need to shut it down or anything over time it will park itself um for you and then when you come back to it if you're not using it regularly it'll you know you can you can unpark it and such but you don't need to turn it off or do anything like that um it's completely running on its own fully managed for you up in the cloud um so you should be good so once you fill in those uh details the username and password you do this that's it you're connected to the studio so here's what i want you to do next this is actually really neat so remember again we pulled again you you cloned or you download the zip so i'm just going to go ahead and pull up my finder because i'm on a mac and the finder is funny as far as i'm concerned and i'm gonna go to my directory cassandra workshop series week one so you will import the notebook yes okay so here's what's really cool you can do this one of a couple ways you could click on this plus or you can just drag and drop it so i'm just gonna grab it i'm going to pull it right in there again if you did this in the zip or if you did this through the git clone it doesn't matter i just want you to go to that number two let me pull that up again can i use that same zoom function in here oh god you can so no this is going to be data set studio you'll see this tar this tarball this is actually the studio notebook itself right you just drag and drop it right on there and it should bring you to a page that looks like this and just like we did before i'm gonna move this forward and all those here we go so if you could give me a thumbs up when you are going yes i can tell you that people working because the chat is coming down uh still some full club issue connecting astra but you know for most part is okay i see i think that i can interact with astra pretty fluently um so people are working i can see also some extra questions during that time as usual so difference between key space and databases um okay so in a cassandra cluster you can have one to multiple you can have multiple key spaces and yes let's let's say key space are like you know database for oracle or schema schema for oracle you know it's not exactly the same but that's really the logical grouping okay when you connect to cassandra later with the driver you can uh provide a key space and then you will only see the tables living on that key space you will create object with a key space scope okay so the best practice is actually to create uh one key space for you know either application or use case because it's kind of you know bounding context a business context every object so tables related one to each other should fit in the same key space okay yes okay all right i see do we have a session event tomorrow no the next session is going to be next week right um yeah so so we'll do one of these live sessions each week and all right so good all right good i'm singing we will be live tomorrow david not us but you know a couple of other faults from that that's that's another session for the apec and emea yes tomorrow on thursday alex and eric will be there tomorrow morning my time uh to do so 12 30 e ist you know time zone doing the same contact content yes same content and everything okay um the oh the for i see a question which which program you use the open the md file um you're going to want to open that that's really meant directly for it's a readme so it's it's really meant to be open directly in github um so if you open it directly there if i go back over here to the github this is the readme itself right so when you when you actually click if you were to go to it the readme it would actually bring you right to that readme right that's all that is so i'm sure there's readers out there that'll allow you to do that yeah so can you mimic again the import notebook process let's do it let's totally do it yeah so and put your screen bigger you know you just follow the dragon drop or do you want the dragon drop with a huge phone huge phone all right so so what i did is i went to the cassandra workshop series folder that i get cloned but you could have downloaded this yes so everything that you have downloaded so you will pick something uh file okay in the book folder yes yeah week one getting started with cassandra to notebooks and then the one we want is data stack studio tar this is one that says two right two days actually all i have to do is pull that drag and drop it now i've already got one um but boom and it'll go ahead and it'll automatically so i have another one yeah for those asking you don't have to install data stack studio on the home page on astra on in the bottom i think it's on the left side you have something called developer studio launch now okay right here and if you're here then you will be moved to studio you will have to provide again user and password the first time remember kv user kv password and you should be good to go okay yeah i saw a couple people saying github zip is not extracting completely that's really interesting you know um vignect we'll talk about the homework and assignment after we'll we'll get there okay all right yeah so then you hit launch now and once you've done that it should yes so why yeah why are we dropping well this is the exercise so data stack studio is a notebook based tools so the exercise is a notebook so what we will do is we import the notebook as a tar file and next we open this notebook exactly as you would do with jupiter notebook or apache zeplin same kind of tool and in there by the way the first exercise is really to make you discover what is data stack studio and as you can see you create cells some have markdown you know text and some have cassandra comments you know i have to say so gerard gillane on the youtube chat just pointed out i think i just realized that for windows users with the zip some of her file names might be too long so we'll have to fix that oh they might be having an issue this is what happens when you have people on linux and unix-based systems we're like oh finally whatever we can just do whatever we want so sorry about that if you are having a challenge with the zip i apologize for that um you should if you need to if for some reason downloading the zip is not working for you uh you should be able to go to the repo itself go to the notebook and and save that out and then use it that way you should be able to do it that way okay so some people are validating that indeed yes on on windows please download the zip on c7 drive windows dragon drop and you know the file path should be short enough for you to import it oh i think we just started a windows linux war here oh no i didn't mean to start a windows macbook yeah there are other zip tools thank you everyone for your comments there uh to help folks out that's wonderful thank you that's those are really good comments in there i'm saying great okay all right all right so with that hopefully at this point you're able to get to this first notebook right um and again you should see something once you once you open it up you know you'll see it if you see it like this you click on it or once you first import it'll go right to the notebook this is already hooked up to your to your astro database at this point you don't have to do any more right um so what i'd like to do here is i'm just going to kind of bring you through this particular notebook so you can kind of follow along if you wish um and just point out how these work right so you'll notice that we embed a lot of there's markdown in here uh you know each of these cells i have markdown or cql um so you'll see that we do a lot with the markdown uh to kind of give instructions so everything that you need in the exercises in the notebooks is going to be right here in the notebook itself right so now i'm not going to read all the text and everything i'll let you guys do that all on your own but what i do want to point out is there's some you'll see some links like this this is just in you know kind of a practice letting you know that we're doing this a lot of times what we'll do is we will put solutions in an expanded section like such section like this so then if you choose you can look at the solution or if you want to work at it a little bit don't right so it's up to you all the answers that you need are always in the notebooks they're always there they might be in one of those expanded sections so something else i want to point out and you'll see this big obnoxious red arrow this is just in this particular notebook here's what we're saying in the beginning of each of these notebooks the way we set them up is it looks like it's an empty cell right um yeah i'll talk about the schema explorer in a moment um i just saw a dm coming for that you'll see a first cell that looks empty really all we've done is we've just hidden the code right now in this first example it's not doing all that much and so you might be like why would you hide that you're going to see in the following examples there's a lot more data model and everything going on so instead of making everybody scroll all down and everything we just hid that just know that when you start the notebook the very first cell it looks empty you just want to go ahead and execute that you'll see something just like this where it says you know success that's it that's all you need to do that sets up your data model oh what what do you see cedric no i just see windows rules windows windows in the chat yeah you know so with 7-zip or you know an nrr you manage to and tell what you do yeah and see what you need i do see a question about can you elaborate on cql we're going to do cql on these notes yes it just is the next exercise oh so we may be a bit late you might tell you know we we took some a lot of questions and stretch a little bit due to the astra issue so yeah maybe let's say 30 minute ish maybe yeah but hey will you have fun yeah and i'll say this you know if if you can't stay with us after two o'clock or after whatever time is in your time zone but if you can't stay with us after the initial two hours that's fine all the stuff is recorded we're online all the materials are there we're gonna you know you can always watch it later but if you can please stay with us we're gonna continue through yeah we will give you the the solution of the exercise in the last minute yes and just kidding don't forget we're gonna have some fun quizzes that you get swag for if you win so that is coming up all right so let me go ahead and just burn through what's going down in here so really this notebook this data static studio notebook is just getting you set up with how to use data sex studio so again i'm not going to um read everything that's in here i'm just going to kind of go through and give you the highlights so as you noticed a moment ago i was actually using the markdown editor you'll see that in each of the cells you have this little eye that's showing the code right so you can determine if you want to do that or not now this particular one is actually kind of explaining how to edit markdown uh and you'll notice if i go to the bottom it says hello your name goes here so it's essentially saying all right let's go down and change the markdown you know i'm going to say hello to cedric it's weird saying hello to myself right and then i'm going to go back up now what i'm going to do is i'm going to play i'm going to run the cell or play it if you will and notice now that here it says hello cedric right so whether it's markdown or if it's cql once i've entered in my code or my cql i just run the cell with that play button and that will execute whatever the statement is okay um there are all sorts of other operations you can do if i want to maximize the cell size i can click that to maximize um if i want to uh if i go to the ellipsis here you'll see that i have other options i think they're pretty self-explanatory if i want to delete a cell or move it these are really nice actually for training materials like what we're doing right now they're also really nice to share amongst colleagues i used to write engineering notebooks now i'll use a studio notebook i'll have all my queries all my stuff worked out and if i want to share with the team i just export the notebook like this i say export the notebook and i can just give that tar file just like we have with you i just pass it to someone else and now they have everything in there right so it's a really nice functionality from that standpoint that's kind of what this stuff is talking about if i want to add a new cell and you see these plus you'll see these plus signs in between all of these all i have to do is click boom i'm going to add a new cell and i can do whatever i want with it um and then if i'm in a cql cell let me go ahead and bring an sql cell yeah let's start with the cassandra query language finally yeah finally right we're finally getting there we're gonna yeah we're finally getting there but now notice that my language here is cql right so i can actually do something with cql and you'll notice that when i do this it gives me this key space piece because if i mark down well i'm not interacting with the database at that point but if i say cql i want to i'll get this key space drop down i think we can make it bigger again oh can you i can try yeah i'm working on the bigger part yeah it's i'm gonna run out of screen really yeah potato fold yeah okay yeah so you can see now that i've got this key space drop down that shows up where i can choose between the various key spaces now in my case i want i want killer video because that's the one we're working with um over here i'm not going to get into too much detail about this other than there is you can change your consistency level anytime with studio so by default i'm going to run at local quorum i could change to one but what i really want to point out is trace let's say that i want to perform a query and i have some part of the query i'm not exactly sure why it is taking the amount of time it is or something i don't think we have any data in these um but do we we don't have user table yet but we do have user credentials yeah we haven't created users yet uh no no i think i think if you look at the schema we don't have a lot yet yeah let's move forward yeah you can simply i think you can simply describe key space killer video just yeah this command will show you what in there so the key i think it's good because we will show how to create a key space so describe key space killer video so it will show what's in there so if you drag okay so you can see that there is only the statement to create a key space create key space with replication and boom here you see we do a single dc called cast dc and the replication factor is one and again i'm hiding i'm hiding part of it so let me do tutu okay yeah enjoyable white all right okay let's move forward okay okay okay all right so i think we can go on now most of people have ketchup okay cool good excellent um so you'll notice here that uh that in the drop down there's a trace so once we get into the next notebook this will become more evident why that's useful but if you do want to trace a query there's a really nice graphical ui that when you do a trace here we'll actually split it out you'll see nice like duration bars and everything it really helps you kind of troubleshoot things and we were talking about the described key spaces now somebody else pointed out the schema button here so describe key space is really the way i would probably do this if i was in this ql shell but here in studio i have the schema button up top so i can now kind of essentially get what i get with a describe um now we don't have any tables in my killer video key space we haven't created them yet right but once we do you'll see that populated and so we'll get into that more into the next notebook just you know synthetic yeah next book would be much more elaborated okay i think we can move describe the spaces it's just describing the tool i think now everybody got an id okay we already had some thumb up about it yeah i think let's move yeah so we can let me let me zoom this back down a little bit yeah we need that spot yeah so so again you can always go through these at your own now the cool thing is since you've created your killer video key space and your database here in astro this is just going to stay up this isn't going anywhere after this live session right so you can do this at your own speed and take a look but that just gives you an idea of what's going on with studio alrighty so with that we good to move on what do you think i think we are good okay so now what we're going to do is we're going to we're going to blast for the read write path uh because we did promise you we're going to try to we're going to try to um get through things here by 2 30 or so yeah oh and jack yeah my bad jack i'm sorry we haven't actually ended yet with the quiz we do have the data modeling piece um so uh so we will get there soon all right so let's get through this reading right path so make sure you start with the right the right path itself um because this is the one that is directly tied to what's happening um with those acknowledgments and when we were talking about consistency level before and i said it's gonna you know whichever one's acknowledged back to the coordinator first this is just acknowledgment that i'm talking about and this actually kind of illustrates some of the really cool features that cassandra has as well some of the things it's doing for you so when you write data when you're writing something to cassandra it goes to two places both ram and disk and what's called the mem table and the commit log now what i want you to notice here let's just this is a row of data right we have some row of data that's coming in when i write that to a node that is being written both to memory in the mem table and the disk and the commit log now they have different functions though the commit log is append only and its sole purpose is to be there in case something happens to the node that way i have the data it's persistent because what happens if i lose a note or i lose the jvm right i'm going to lose all the stuff in memory so i need to ensure that that data is persisting on disk that's what that's that's for then the men table is going to offer us another functionality right that comes into play when we do reads later but then also if we're using something called a clustering column now we haven't really gotten into this yet so when we get to the data model piece you're going to hear me reference something called a clustering column clustering columns are used for ordering or for uniqueness but for this case we're just going to focus on the ordering part so in our example here our city houston right that's going to be our clustering columns we want to order on the clustering column let's check out what happens when we get some more data so by the way i should mention um when you've written to both the mem table and the commit log that then sends that acknowledgement back so when we talked about consistency level earlier and we were saying that we're you know those green check marks that you saw that when it gives the acknowledgement back to the coordinator that it's finished to write this is what we're talking about it means i've written to the men table i've meant i've written to the commit log okay so let's watch what happens when i get more data i get another piece of data notice the color right watch this so look what happened in the commit log the data was appended to the end of the commit log but in the men table why did that go to the top of the list because we have a clustering column for order on the city and when we use a clustering column it's automatically going to store the data in its ordered format so since this is a text field it's going to do it alpha numerically and we're going to say that d comes before h why is this important why am i telling you this because when you do ordering in cassandra your data is being ordered while on the right in memory where it is fastest later on this is actually getting stored on disk this way what's really cool about this is that i'm already pre-computing i'm pre-baking my order so when i go to read the data later it's already ordered if i did this in a relational database i am paying for the order by at the select i'm paying for it on the read but cassandra we're trying to optimize this by paying for it on the right so when i read it later it's as fast as possible and then notice again the commit log is going to append at the end because we want that data to persist on disk as fast as possible in case something happens so check it out i'm going to bring some more in another piece of data different color it appends at the end of the commit log and you see it goes into alphabetical order for sdhs same thing as i move on austin you know that last row got appended onto the commit log for persistence but it was automatically ordered in the mem table then what happens when the mem table fills up well it's going to flush that data down to disk and something called an ss table what i want you to notice though is that the data that is flushed out is now ordered it's actually stored like this on disk in its ordered fashion and once i've flushed it i no longer need the mem table i no longer need the commit log because all my data is now persisted if something happened to that node that node comes back up well i have my ss table and that is there in disk you should also note that ss tables are immutable so once they are written they are not changeable um let's see i'll have to by the way i'm seeing if i can read your dm and if you if it's short enough i can actually make out what you're saying if not i can't so i didn't quite catch that question um i have a lot of dms by the way i will i will be following up after uh today to get to your dm so if i don't get to it right away let me know yes and same we do have a lot of questions on youtube chat so we will review the question you know some we cannot answer on spot and for some we do the full explanation on community.com yeah killer wolf is asking is there a limit to the commit log i'm assuming are you asking like a limit to the size or the amount of files or whatever um i would say disk space is the only limit i know cedric do you have anything else there [Music] um you will split the commit like file space on the size i guess there is some kind of threshold commit log blah blah blah but yeah and you know that would be also the the volume and you know what you don't want to have a commit log very large because uh you know when you lose the node and you restart the node the first thing that the node will do is called bootstrapping and it will reload all the commit login memory so i do not want that all right so again this mechanism when you write this is automatically being done right this isn't something you have to actively do um but just know that you're gonna write to both the mem table and the commit log when the men table fills up that flush to an essence table which is immutable and it flushes in its ordered fashion if you are using a clustering column oh very important thing so yeah i saw a couple of questions about now you know the ss table is immutable what if i want to change one value you know lone star has been changed how do i change that you know ask us happen only and the last right win what will cassandra do when reading the data it will take the latest version you know it will have multiple values for the same version it will compare the timestamps and take the latest one but go for it with the read path okay cool by the way i just saw a question come up asking is the um is the second notebook they're working with cql is that actually part of the two-hour homework originally no um we did have a hiccup in the beginning that ended up eating some time uh that has taken away from that um so so maybe we do end up making it part of the homework um we'll figure out that notebook is actually going to be kind of key to go through um so i i know we're i'm i'm trying to get through this now that's part of the homework and you know especially if you don't manage to yeah it was part of the exercise at the first place but as we stretch time yeah we'll we'll have to make part our homework thank you for asking that question okay so for the read path now remember we have the mem table and then eventually we flush data down to our essence tables so in the read path what's going to happen when you go to read something it's going to go to the mem table first right see if your particular partitions are there and then it will go to any ss tables now data may be stored and cedric was just talking about this a moment ago right you could have data for one partition that's actually stored in many ss tables depending on like if you had an initial insert and then later on you had an update or something to the same partition so data um thanks cedric yeah so data could one partition could actually be stored across a mem table and multiple ss tables so let's see what this looks like so the men table's actually pretty simple now again by the way we're showing this to you just so you have an idea of what's going on all the stuff is being done for you under the hood right so there's nothing here that you need to manually do but let's say i have a request for some partition token 58 to come in then in the table it will actually just be able to go if it has that in the main table it'll actually just be able to read it read that partition right out and send it back now in the case of ss table imagine i have this immutable file with all these partitions in it right um but it would not be very efficient if every time i went to read a partition out of an ss table i had to scan through it to find you know where a particular partition is so what's happened is the ss table not only holds the partitions themselves but it also has byte offsets for each one of those partitions there's a mechanism called the partition index and what this is essentially it's an index of those byte offsets so if i have some request come for that partition token 58 it's going to go okay where's 58 i find it in the list and go oh it's at byte offset 717192 it goes right to that spot in the esses table it doesn't have to scan through the whole thing it's all about reducing the amount of i o time the amount of i o seek that i have to make the efficient is absolutely fast as possible if you think about it the slowest component we have is disk right now ss ssds are actually quite fast these days right but you know if you imagine especially spinning disk right you know that's going to be the slowest components we want to make this part as absolutely efficient as possible but over time ursus tables they get a lot of partitions they grow really big right um so even the partition index and such may not be enough so there's another mechanism called the partition summary which is essentially essentially an index of the index so the summary is in memory this time and notice here what's happening is you'll see that again i'm looking for my partition token 58 but now it's going to cover range i'm going to go okay well are you in token range 56 to 100 58 fits in there it'll give me the byte offset for whatever tokens are there moves to my partition index and now i go right to my byte offset so it's another optimization that's in place to help reduce the amount of io reads that i may need to get my data and then finally you have actually no i have another one another piece that was called the key cache now imagine this is also in memory imagine you have lots of reads that are going to the same partition over and over that'll eventually bubble its way up into the key cache right and then if you notice here it is storing the byte offset directly it's essentially just kind of giving the direct address to where that data is on disk right so this is for data that is being read a lot it the key cache will eventually come into play and then you'll essentially be able to read that position right out of memory go right where you need to in disk again this is all happening for you again automatically and the last one here is the bloom filter i like to call the bloom filter like the not it's the not there filter meaning i could have a single partition that exists in many ss tables but i don't want to have to look at the indexes and separate all of them so the bloom filter comes into play to say is it not there right and it will be able to tell me definitively if it's not in a particular access table and just skip it right so why go and read that if i don't need to so it is another optimization that's in play to make reads even faster so all of these things come together to try to make your reads as efficient as possible and reduce the amount of i o seeks and i o time that you have to make those fast cedric anything else you want to add here uh no you wear shop great great all right so all right so we are going to move into this uber high level data modeling piece um it is really uber high level i saw a bunch of questions come in earlier regarding um you know like you know what are key spaces like how does that come into play we're gonna explain that now right this is where this comes into play so in cassandra a key space not only contains all of your tables but it's where you set replication if you compare to relational it's equivalent to like a database or a schema right it's the place where you're going to hold all your tables within tables you have rows and columns just like you do in a relational database that is not different at all logically this looks very similar to what we'd expect from mysql or postgres or oracle or whatever the difference is in these partitions so we've been talking about token ranges and stuff all day so far um these partitions this is this is the key difference right here right um the partitions are your base unit of access this determines where your data is going to live on your various nodes even in a single table so let's look at a concrete example i like concrete examples so in this case i have a user's key space it's going to contain a table in this case users by city now there's a convention we use here that if i when i define my table and then whatever my partition key is i'm going to say buy city so i'm saying users partitioning by city so in this case i have two partitions phoenix in seattle this would actually bend my partition key city is is the partition key in phoenix and seattle are each my partitions and each of those partitions then are going to hold any of the relevant rows that use that particular partition so you can see in both cases actually i happen to have three different rows that are in each of those partition now i mentioned clustering columns before right we can use them for ordering or uniqueness now in this case since i'm using on the last name and first name again it's alphanumeric since they're text fields you'll notice that if you take a look at phoenix there you'll see that the last name is automatically ordered hls helsin last fall smith same thing with seattle like i said so when it's being stored on disk it's actually stored in this ordered format now if i have more than one clustering column they're going to order in the order of the clustering columns how i define them so i've defined last name first then first name so i'm going to order by last name then i'm going to order by first name now in this example it doesn't actually um have any last names that are repeated but if you can imagine for a moment if i had two helsins and i had one that was kevin and another one that was andrew andrew would come first right so again it will be stored this way on disk anything else that is not your partition column or a clustering column will be data columns anything else is just your your data now let's take a look though at how this actually looks when you have multiple partitions so tables are going to hold multiple partitions so if i have again a partition of phoenix partition in seattle partition of charlotte all those logically you're going to look if i did a select start from users by city i would actually see all the data from phoenix seattle and charlotte however each one of those individual partitions may actually be on different nodes in my database right so that's something to really understand this key thing if you can understand this one concept you get to understand like oh this is what this is what i mean by distributed right my partitions are distributed around my database um this is also why for this quick example it doesn't matter as much but you'll hear us say that doing like a select star um a select star from everything is is kind of an anti-pattern in cassandra and the reason why you know if you're talking a small data set you have like you know a couple hundred thousand things it's not that big of a deal but when you start you know if you're using cassandra a lot of times you're in the big data realm right we're talking about gigabytes petabytes whatever terabytes of data and if you start growing to that size and i do a select star i'm now selecting all those partitions that are around all those nodes as they're scanning all those nodes right so it's it's it's really important to understand that these the data that is in your tables is actually again partitioned by these physical partitions that are actually being stored around your nodes cedric i feel like you have something you want to say yes you know it's uber eye level we go more into the detail next week about partition and data modeling but i got 10 times what is a good partition okay so i will make it very very quick so we we want to avoid two big partitions so what is a partition too big more than 100 000 records more or you know number of records or the volume more than 100 megabytes within a single partition that's you know tends to be too big because now my partitions are replicated it's taking time and also on a single cell we tend to insert data not bigger than 10 megabytes but you know having partition with a single record you know a unique identifier the partition key totally fine as long as when you request data you don't need to eat a lot of partitions you know select star from a static start from a table you don't want to do that because you are doing a full scan of your cluster but except from that you know the partition key should be your where clause or your pseudo group by but i will dig more into the detail next week i will go this part yeah we'll we'll get in a lot more detail about all the data modeling and stuff like that in in the following week okay then the last thing is we're getting close here to wrapping up the material um now you didn't have to do this in astra because this was done for you uh but when you create if you're using your own cassandra database you create key spaces like this now i mentioned that key spaces not only hold your tables but they also they set your replication so here i'm going to create key space users it's going to be my user's key space but notice this with replication right so there's two things that i want to point out right right here the replication strategy uh network apology strategy is the standard this is what you should use you'll find that when you create key spaces the first time they use simple strategy this is because we're kind of erring on the side of even like a single node you know and just the basic setup to get going right just know that if you're doing anything that is outside of like a quick database on your laptop or a test system or something like that that you should be using network topology strategy if you're not and then you have multiple data centers it's not going to work very well right so just that's the standard it starts with simple you just change it uh if you are in a real production system notice that replication factor right there that number sets the replication factor that's it if i change that number my replication vector changes matter of fact you see that data center one that's my data center i could put a comma there say data center two another three and i now i'm automatically replicating to two data centers right um and it's automatic cassandra just does this for you that part is automatically handled i mean that is super super cool you can also go back the other way if you remove a data center you alter this key space you alter your replication strategy and your replication factor um and that's it and you can just go either way right it's super powerful that it does all that replication for you out of this one thing um if you're creating a table if you're doing this from the relational world if you're used to this it looks exactly the same it's very very similar it's on purpose a crate table a key space dot uh whatever fields my column definitions and then the primary keys cedric do you have something you want to add no that's totally what we cover next week so let's next minute we'll just move past it yeah we'll get into a lot more detail about all this and then i'm gonna yeah i'll go ahead and wrap it up because i know we're getting into the thing so what we're going to do today is because we ended up taking more time than expected we're going to give the working with cql notebook as homework we're going to add that as part of the homework honestly it should take you like 10 15 minutes it's not that big it was part of the work anyway but yeah now you have to do it on your own and import it the same way as we did with notebook number two drag and drop yes that's right you just drag and drop it right into your studio and you're good to go you'll notice there's some bonus sections in there as well this some of you were asking about like cql let's get in the language this is going to start to do that you're going to start to go through some of the basic commands of how do you do things in cassandra using sequel here's what we're going to do here's homework we're going to wrap this up yeah so like we said finish the notebooks if you're not done so go ahead and do that next notebook um this form these links by the way should be in the email that you're gonna the follow-up you're gonna get from jack we also posted all of this in the show notes of the youtube video as well they're also here in the slides we try to put them everywhere so you guys can get to them um but please provide feedback uh to the to our team with the forum here um if you haven't already go create an account on community and create an account academy again it's all free there's no paywall there's none of that going on here um community is again the way that we're going to answer long-form questions and only that it's a really great way to interact with other folks not just data sex folks that are in the community and have lots of experts that can help answer things academy is where all the coursework is right the link for ds201 so for you know i recommend everyone to do this but if you're really looking to take the exam eventually you have to do this right you will not pass the exam if you do not go through ds201 uh so go through ds201 you can just watch the videos takes a quiz again we'll provide the links around um and then this last part this exercise of the week uh this is that validation form that i was talking about before that just lets us know that you were able to get through the material and and do some of the homework and again if you complete those throughout the whole series and you get them all right we're going to have some really fun swag for you for the top 10 uh or i'm sorry there's no top in that one yeah and we will run them eyes i think yeah and we do provide the links for like where's academy stuff again you're going to get this in the materials we send you um and they're also in the github repo and everything i've kind of talked about all these pieces and if you didn't see it you know the last thing there is that the cassandra workshop series is the github link again you can always reach out to us in discord on linkedin wherever if you have any follow-up questions or anything um again there is the direct link i've provided a link in multiple places for 201 but 201 is the one you need to take for homework and then finally next week we will be getting more into data modeling we're gonna in the following follow up with the application development stuff and then in the final four uh we'll get into more of some of the admin performance testing benchmarking and then deploying with kubernetes yes with that all right 235. not bad not bad we wrap it up so tomorrow we do the same same stuff for you know apache regions because it's late for us i'm based in paris it's getting late and next week's new topic but you know even tomorrow i will be there answering questions it won't be me on screen but until then see you next time yeah and thank you everybody for sticking with us for an extra half hour and going with us through those hiccups we'll have that taken care of next week since most of us are authenticated with our astro databases now we shouldn't have that at all um so again thank you very much for coming we will see you next week and thank you cedric for co-hosting with me today oh that's what's alright everybody always a great pleasure take care bye
Info
Channel: DataStax Developers
Views: 23,143
Rating: 4.943038 out of 5
Keywords: programming, cassandra, cloud, beginner
Id: y4Gt_LQ8sdA
Channel Id: undefined
Length: 132min 41sec (7961 seconds)
Published: Wed Jul 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.