#2 Data Modelling in Apache Cassandra™

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone hey patrick i think we are live i think we are live too how are you doing alex yeah very well it's well not the best of the weather today but looks amazing to see how many people we have today in the stream so you know what i don't care about wherever we're all inside so um and that's the best thing about being a nerd is most of what we need to do is inside a room with air conditioning oh yeah makes sense what's the temperature at your side of the planet uh i'll make everyone angry it's beautiful sunny and it's about 20 degrees celsius right now okay thank a special thank you for celsius because i still don't understand farming rates i never got it either i don't know why fahrenheit even exists like why does water freeze at 32 degrees and boils at 212. yeah i just like water freezes at zero it boils at a hundred easy yeah really easy yeah okay all right now that we've chosen that topic let's go there yeah so take a look uh everyone at the screen what you see here is kinda study room and people playing tennis but i'm not care about the tennis today because what we have is the amount of people you are seeing at your screen is the approximate amount of registrations for this course because everyone wants to be successful with cloud native applications so we have over 10 000. yeah and i actually recognize some of those people if i look in the stands uh i've seen some of you at meetups and con and accelerate cassandra summit so it's great to see some familiar faces in that crowd yeah definitely when did we take that picture link both that's just visualizing crowd sizes as we got some people like you know we really want to understand before we did a lot of uh offline events where you travel to a city and gather people together and you know like there is like a meet up size uh events like 50 people sometimes up to 100 but usually around like 50 or something like that and we were happy because there is so much of a live communication as soon as we went live numbers just start to explode and now for this as workshop series we have more than ten thousand i still cannot believe in it and well i mean this is this is how it goes right everyone's sheltering in place which is good stay home stay safe and it doesn't mean you can't stop learning right and there's lots of ways to do it of course we're live on youtube and on twitch we have our discord as a matter of fact love the discord already looking over there and people are talking about fahrenheit and celsius all right that's an engaged audience um but i mean this is easy enough to do and it's not only that you get a pretty i mean you don't get a diminished experience and that's what we're trying to do you're going to learn about cassandra you're going to learn about kubernetes you're going to learn about microservices and you can do all this online this is i mean i watch a lot of youtube and i watch a lot of coursera so yeah why not yeah whatever uh all this pandemic was such a big surprise it's still a great time to take time for yourself and learn something new okay so uh what are we learning today oh well you know what yes we learned a lot of great things today but what i want to start is from whom we're learning today and we have a really incredible team today we have just a great data stacks developer advocacy special unit doing this workshop series for you and there is a very great team i'm so happy to work in this team of very many people today on the screen for you uh patrick mcfadden and me alex volchniv developer advocated data stacks but also we wouldn't be so strong without the crew covering us in discord and at youtube channel and all the people developing uh discord so uh that's that's the most important of whom you are working and who is preparing content for you and now what we are going to learn today well you asked really good question we have a incredible set of things to discuss this week is the second week of this workshop series i hope you did attend the first week if you didn't no worries everything is recorded and available for you including exercises all the steps everything except of the quiz well sorry sometimes you have to be on time but the main thing the content is available and last week we were talking about the very fundamentals of the cassandra what is it why it's great how it works and so on and today that's actually a very special moment for me because that's one of my favorite topics data modeling for apache cassandra or i prefer to call it creating an efficient data model for the highly loaded applications you may use the best database in the world well for most of the workloads cassandra is the best one but if your data model is bad you can ruin everything what cassandra gives you like you have higher highest efficiency you may ruin it with a data model done wrong you have the highest availability you can ruin it the data model done wrong so we talk how to be successful with cassandra that's the main topic for today padre oh that's a good topic and you know that's the one i love too your application starts with a great data model and if you're using something like astra it's pretty much all you're going to work on because you don't have to work on the actual database anymore you just click a button and get yourself a new database and run and that's actually what we're using today too two alex's we're using astra like we did last week so if you didn't catch last week that's fine you can go back and watch the video um we record everything everything's on youtube of course like everything else in the world but you can catch up from last week if you're watching this late uh let's say you're watching this version recorded go ahead and catch up um you we have uh seven more weeks of content coming or six more weeks of content coming and it's all really good and we're walking you up to from the beginning to actual deployment of building a cloud native application and this is an amazing skill and by the way you get a certification at the end as well yep so we teach you for free and you have a chance to get your certificate for free this usually costs like around 400 dollars i believe okay so and how does it work every week you have to choose which session is better for you from the time zone point of view so one is running right now and one we will do again uh first day which will be for me first day morning so choose one what matches your time zone we have basically covered all the globe which is great and during the workshop you have to use some tools and you may need some places to visit so obvious place to be youtube.com datastaxdevs i hope where are you exactly right now or maybe you are watching us at our backup channel at twitch we also streaming at twitch but the main channel is the youtube then the main place to ask questions is our discord server so bid dot lee slash cassandra minus workshop and that's the best place to ask questions can you ask questions on youtube well yes you can but as soon as the workshop is over your questions will be gone and discord is much better place with much better coverage and ability to always discuss the things to run the quiz we will use mend here and it's going to happen very soon for the first uh warm-up then to do all the exercises we use astra astr is a great thing data stacks astra is patrick said a cassandra is a service the only thing you have to worry about is to worry about your data model we are taking care today so you don't have to worry about anything anymore and data stocks takes care of the management of a cluster and trust me there is a lot of work to do to handle the materials everything at our github.com slash data stocks academium cassandra workshop series and i hope you have been there already and if not you should and starting next free week free with cedric london you will do a lot of coding so be prepared to do a lot of coding work with gitbot still you don't have to install anything everything is in the cloud as we do after the workshop you have to take care of the homeworks and that's an important step don't take your homeworks or home assignments like not too serious they are serious if you want to not only to just get the certificate but really be efficient and successful with cassandra you must do your homework take it serious at the homework you may need our forum and place to communicate and discuss questions with many of the people community.datastax.com you will need to get some trainings at the academy.datastacks.com to chat you can use discord which is uh as shown on the screen and you will have to validate your presence at the week two to get your voucher uh with a google form we are sending to the chat in the end of this workshop and after that relax you already great today let's get to the today's topic first thing we have to discuss uh some things about the cassandra which may be not typical for another databases so you may need to learn something like key spaces partitions and so on second part and the main part is the art of the data modeling how do you develop a good data model when we discuss data types basic data types advanced data types and then what comes next and before we start we are speaking over time patrick let's people answer our questions like we are answering questions all the time let's go mentee and mentee code for today eight nine four nine four seven i repeat eight nine four nine forty seven and remember this is this is a competition so we're not the first one we're gonna start out with just a quick question to warm everybody up make sure you're connected but this will be a competition at some point we will be watching who's doing what yeah so fast people and people answering questions in the right way we will get some stuff okay and it will be about what we're learning today and so um it'll be relevant and that's what we're gonna try to keep you engaged here but um yeah follow along this could be a fun contest exactly so uh we have i see already more than 100 people at mentee don't be lazy join me now we have a lot of guests today and join it to be ready for the competition but now just simple questions to warm up we give you 20 more seconds only 20 seconds and we start all right well this is when it gets real we actually have interaction other than chat although i'm looking in the uh if you're doing the youtube chat that's fine we do have people there but really we prefer you go over to discord uh discord makes it a lot easier if you have a one-on-one question you can split people off into a different room uh it makes it a lot more interesting for like if you have something more in-depth to talk about we have a lot of folks in there so discord is a really good place for you to be and that's a discord stays online uh we have right now currently we have 600 people active in the room um i've seen it a lot higher but i would also encourage you to help each other um if you if you're not new uh to the community this is a great place to uh participate help out each other out if you see a question in there that you know the answer to feel free i mean answer that question yep i see people are asking what's the deadline for the homework and you know what we really want it to be a self-paced course so we don't know your situation how busy you are and what's going on next to you so take your time we don't close assignments except uh so forms are available if you can do it on time during the same week it's great but if you cannot do it for any reason we don't want to block you seriously so take your time and pass the work homework when you will be ready but don't wait for too long okay okay so i see most of the people log it in so let's start with questions okay manty get back to me that's not the first time you see they are scared of our numbers every time we have more when a hundred or two mantis starts to stuck i feel like david had this problem before too yeah not only david yeah you have to do a hard read yeah i think so i will do a hard refresh for this one yep see this is why we have friends that sit in behind the scenes and slack us like hey do a hard refresh exactly it's just me and alex think again we've got a whole bunch of people working with us right now we're just the only two people on camera so it's being loaded all right okay is it is it that good german bandwidth that we're looking at yeah i have pretty good bandwidth here i bet it's on the menu side because well my my connection with youtube is perfect you see i'm streaming and youtube is absolutely happy of the connection so yeah okay i will try to do it again just a second i love this part of the show yeah it's my well it's a live show like yeah it's live you know you do it live this is what happens so i'm gonna go over to the chat and see what's happening over here real quick yeah so let us ask your questions meanwhile it's always funny whenever these things happen everyone asks is mentee using cassandra uh no [Laughter] and i you know mentee is funny because i i really like it and we've seen it take thousands and thousands of users but um yeah sometimes it just has these weird things you know they're a startup we'll go talk to them but um yeah there's a lot of lag on it so okay we may have to uh yeah we have to move on i will try to start it in the next screen so give me a second maybe this one will work oh it looks like this one works perfect so key spaces tables partitions and a little bit of more and that's important why it's important so i will say only few very tiny words about the infrastructure but you have to understand how it feeds to infrastructure as we are going to spay to speak a little bit about the key spaces so in this bottom of every system there is a simple server doing his or her job and it can be a single bare metal server it can be virtual instance or a docker container well as long as you use astra you don't have to care of that but if you are running your own cluster you better understand things like that then data center is a group of nodes and in theory they allocated located on the same physical location or a cloud data center or an availability zone depending on your setup and finally cluster is a group of data centers well it may consist of only one data center but you definitely need at least one to run some cassandra things so it's a group of data centers configured to work together okay but we are speaking about the data model so let's get closer to data the smallest unit you can access in general uh in the table basic databases is a cell cell is an intersect of a row in column storing data when we start to speak of a row row is a single structured data item in a table having some properties stored and at some point you may ask me alex why do you say these basic things i know what's the row i know what's the cell i even know what's the table is and well yes and no because cassandra data structure is a bit different in the relational databases and you want to access the some data you specify that database you specify table and then you can specify a row and that's all what you think like select a username from users where a last name is something in cassandra it's different because cassandra is designed to work with any tremendous huge amounts of data you may imagine while dispatching the data and storing the data within very milliseconds and it's being distributed all over the world if you think it's a simple job try to do it yourself so in the relational databases when you have too much data to handle on a single machine you have to go sharding and some of you i know like it's hard i did sharding it's really painful if you did that you know how much uh problem it brings if you didn't god bless you hope you will never do it in your life in cassandra you don't have to power to take care of a sharding because there is no shared charting all the data is partitioned in the very beginning so what's a partition partition is a part of a table partition is a group of rows united by the similar column having the same value within the same column or columns and they are stored together that's very important to understand if we have people in the table of employees we have people from different uh from the same department it will be a single partition i explain it in a few steps later and they will be stored together and that's from one point of view very important positive feature from another point of view you really have to be careful at some points but don't worry we will make you efficient with cassandra in some next slides and finally table well i bet you familiar with table it's a group of columns and rows storing partitions and then key space is a group of tables having the same replication strategy and replication factor and some other properties we will not cover today because it's more about operations and maintenance and if you watch it from above so we have a key space which is a container for tables tables are containers for the partitions and getting the more close to life example if we watch the table of users by ct we see what the city was a partition key column for this table and we have some clustering columns and data columns and i explained them in the minute but you know what i believe all of you or at least most of you are very technical persons ladies and gentlemen so let's get closer to the code talk is cheap show me the code the code in this case is very simple when if you are trying to work with the database store some data in the database you have to create a key space first key space which will define your uh properties for your table storage within this key space so here in the cql which is a cassandra query language looks very familiar to sql but not the same i am defining a key space called users with replication strategy network topology strategy it was explained at the last week so i will not go deeper and replication factor so in this example for the you at the united states west one data center we will have three replicas and four you know europe central one so let's say it's frankfurt next to me we will also have replication factor free notice what you may have different replication factor for different data centers and when we speak about creating a table in the sql i believe syntax should be should be very familiar to you because it almost completely matches the traditional relational sql queries creating tables except of the last line which really really important so we discuss it very thoroughly in the next slide but still when we're creating a table we have to define key space it will be allocated to table name well it will be hard to access table without knowing tables table's name some fields and type of this field and then primary key in our case as a primary key we define city last name first name and email and city will be a partition key okay what does that mean primary key that's an important that's important thing the main duties for the primary key is to ensure uniqueness of the role and uh it may define sorting if you want to imagine this example without email within the primary key what would happen very simple city last name first name and every second john smith in new york will be very unhappy having his data over written so first duty of a primary key is to ensure uniqueness and what does it consist of it consists of a partition key which is absolutely required you cannot define a table without a partition key and clustering columns clustering columns needed to define uniqueness and sorting order i explained it in the next slide good example well not perfect example but acceptable example of a primary key city last name first name email we expect every person have unique email so we are fine with that a primary key consisting only of a user id okay it's also fine it will be table users and primary key will be equal to partition key okay works as well that's fine what would be a bad example as i've shown you city last name first name some people may be very disappointed partition key now it's getting more interesting so we discuss it what we have to work we have to be ready to work with any amounts of data even with very big amounts of data and what does that mean big data there are a lot of different definitions most of them complex and not really explain anything my favorite definition for the big data is the data which doesn't fit to a single server it means what we have to partition and partition key is the field which will identify the partition value and therefore create token based on that i'll show it in a moment and finally clustering columns are needed to ensure uniqueness and sorting order what does that mean first uniqueness it would be a very bad table users by city if my primary keyword consists only of a city it would mean what i may have only one user per city and when i add last name and first name it's getting a little bit better but still bad because well i may have more than one john smith in the city or whatever most popular name in your country ivan ivanov or i don't know then the example the downstairs city last name first name email gets much better but there is a second duty for clustering columns which comes very handy sometimes primary key video id comment id what happens right now like take a look we have a youtube or killer video and there are a lot of comments for every video if they have the table like that video id as a partition key and comment id as a clustering column if its comment id is just normal unique id i will have those comments not sorted imagine you are opening a youtube video page you have all the comments well well at least like it was before uh most recent uh in the beginning and oldest in in the bottom good so when i fetch comments like that i may have them not sorted maybe you remember right path in the first week everything is press sorted so you don't have to take care of the sorting if your data model is right take a look at the last fourth primary key video id created that commentating i still do need to have comment id in the primer in the clustering comments because to ensure uniqueness two different comments may be written at the same time so second will overwrite the first i don't want it so i add created add and i needed to establish the sorting order and to have the latest comments first and here comes the very important point how exactly partition happened to be how we create a partition or better to say how cassandra creates a new partition or how cassandra finds a partition to add something or read something from we are working with this yeah real quick um i think you're gonna have to get rid of my head because i'm cutting off slides man oh no no no no no but no way and you are cutting them only a little you know what i am a magician i can make you a little bit smaller if you don't mind no don't yeah i shrink you sorry and i will shrink myself yeah shrink yourself a little bit yeah you know this is as wonderful it is that to see your face i'm sure the sides are more important yeah good well you know i'm not a top model to take care of about my face too much i can do it in mask no one cares but look it's a replication factor yeah good oh the other way the replication factor there we go yeah yes like oh okay now i see it all right as you were good thank you so much so we are working with this example users by city with ct last name first name address email and primary key we discussed it before now that's important to understand every note is is responsible for the range of those tokens so let's say let's imagine the very simple example we have only we have 10 nodes and one hundreds of tokens so first nodes will be responsible for tokens from 0 to 9. second node will be responsible for tokens from 10 to 19 and so on so all the 100 of tokens will be covered and well obviously those b numbers are much much bigger but idea is the same the same we have really a lot of tokens and as much uh as we have notes with replication factor one of course those ranges will be splitted over those nodes and if you have replication factor free every node will be responsible for the free ranges out of this list now you're inserting an euro and you obviously must specify a partition key it cannot be null you always must it's a five a partition key what happens next uh we are hashing this value using murmur free hasher and result is the token to use for example if we add a user from seattle uh using this uh hasher you can get this token and token will be two four six six seven one seven one three zero and therefore token for this partition will be the number i've just read i don't want to read it second time again sorry so and therefore always see all the users from the same city will have the same partition token and they will belong to the same partition so they go to a proper node or nodes depending on your replication factor and replica nodes responsible for this range will store this data and therefore then you read the data you must specify the partition key well you can try to avoid that but we do not recommend that to allow full cluster scan and well when it works i i've seen a lot of questions at the last workshop like what's them how does driver knows which note to ask or don't or maybe driver doesn't know which node to ask and ask a random node and then this random node asks the next random node and if you have 10 000 notes you have to ask every of them of course not every cassandra note is very smart as every cassandra driver is also very smart in the beginning then it starts to work it loads the date about shema there where is every partition stored which node is responsible for which tokens and then when you do a select in your driver driver using exactly the same way to calculate the range to calculate the token uses this have more more free hasher to calculate the value and knows which node to ask and goes directly to this node good now receive a label the slide of a year award i did it because well i do believe what this slide and next few slides may that's what the rare thing really red line what makes distracts you what makes you or successful with cassandra or not if you know the rules of a good partitioning when you will be successful if you don't your data model cannot be good and therefore on every next high load you will get some real troubles so let's cover the rules of a good partition how do you define partitions so it will be efficient well we have to cover about we have to cover three main rules first rule is store together what you retrieve together don't forget all the rows within the same partition are stored together very close to each other within the same file or within the same block in your memory on the same node so when you are trying to load this data and you ask for a single partition and everything is unique in the single partition then you just get this partition immediately and it works extremely fast lightning fast but if you have to try to fetch some data from multiple nodes from multiple different partitions that's obviously it's going to be not so fast process if you have like some hundreds of servers basically you have to ask a lot of them and that's not the best way so we try to store together what we retrieve together in this example i told you before if you open a page video page at youtube or kill your video you get the list of comments immediately in the first example partition key is video id and primary key is video id created and commentating and that's a very good example of the store together what you retrieve together because in this case partition will be based off the id identificator of the video and therefore yeah sure i think this is a good time to insert a question i saw this in chat and the question was how cassandra make sure a unique partition or how does it make sure that we have a unique key value and um you mentioned this but i want to make sure this is really in everyone's head is when you when you create a part or a partition key will be hashed using the murmur 3 hasher we used to use md5 and now we use murmur 3. murmur 3 is and it's just thank you going back to that slide it is a consistent hashing algorithm meaning whatever string you put into it will always that if you put the same string in you'll always get the same uh 64-bit number out of it and so that number that how that's created that's actually the feature of murmur 3 is that it is so randomized and this is pretty close to like a cryptographic algorithm but it's not suitable for crypto um but murmur 3 will take anything and make it unique and 64-bit that that bit space is so big that there's no collisions inside of that now sure in a long enough time there could be a collision but um not in any of our lifetimes and not before you retire so don't worry about that but i want to make sure that that's that's what's uh built into cassandra that's making sure that whenever the data is placed on the ring inside inside of the cluster that it does have a unique space and it's grouped by node thank you all right good so and uh finally as i open a video page i get all the comments of this video together because they will be located in the same partition as they all belong to the same video their partitioned token will be the same they are stored together very simple now avoid big partitions take a look cassandra may look simple for a developer but inside there is a lot of work and as your partition are getting bigger and bigger you may run into a trap then your partition is too big and then compact compaction cannot handle of your as a stable files and then some nodes may stuck completely so take a look there is a very good example explaining that you're you are selling some things in some countries like in europe for example at some point you go to china and you somehow magically become extremely successful in china and like everyone in china wants to buy something of your goods congratulations but if you have used the partition key like that with country now you have a big problem because our partitions will be fine but this single partition will be too big and therefore uh it will be hard for a cassandra node uh no tool to take care of the things like that and then your system administrators or database administrators will be unhappy they will come make some noises and call you uh bad names and no one wants it so watch your partitions don't let them be too big technically there are no real limitations like physical limitations because it's well-defined it depends on your hardware but in general we recommend to have like approximately up to 100 key rows in a partition and up to 100 megabytes in a partition in general that's about size more than about the rows and in the first example we see video id for comments from a video this is acceptable because although some videos may be too big i mean amounts of comments but the most popular videos at youtube at this point have around 5 millions of the comments per video like gangnam style or something like that 5 million comments is obviously more than 100 row 1k 100 000 rows in a partition but it's acceptable because all of them are small so usual comment is like video id comment id created that user text so there is not so much information and they aren't so big good now i told you to avoid big partitions but there can be something different take a look at the example huge internet of things infrastructure hardware all over the world different sensors reporting their state every 10 seconds and there are really like a lot of them every sensor reports its id timestamp of a data like when this happened and the value and we speak what sensor id is universally unique id timestamp well is a timestamp and value is just a float or maybe integer in our case it's not important and in the beginning i am as a developer of the data model for the system decide i will go with a primary key sensor id reported at and my partition key will be sensor id because well uh rule number one store together what you retrieve together and well in the beginning it works pretty well and everyone is happy but after some time i get into trouble something is going wrong try to think a little bit of what can go wrong using these uh partitioning patrick how do you think how much time we have to give them well let's see we uh we're at 10 minutes before the hour so we have about another hour and 10 minutes before the end of our workshop so um what slide are we at right here uh we have a lot of slides so okay i hope you got enough time to find out the problem and identify the issue but if you didn't well then watch what's happening on the screen let's go i told you avoid big partitions but that's not the all trough avoid big and constantly growing partitions in the beginning it was pretty fine but as every sensor reports its state every 10 seconds constantly then the partitions which were perfectly fine in the beginning got growing and growing and growing and boom and we have cluster paralyzed because all the partitions are too big so we need something better uh but well using this sensor id seems to be obviously not the best way to handle it is there a good way to solve it answer is yes answer is bucketing bucketing hey well that's a big enough word did everyone catch that yeah suck it in you should probably explain what bucketing actually is yeah sure bucketing packeting i hope you got it so what's the bucketing means our partition key may be not a single column but a composite column or multiple columns value so you may specify in the partition key not only like country but also a city and then this composite key will be calculated using this uh will be fetched and therefore value will be different because we are using not only single column value but two different columns values and in our case we may use bucketing so use composite partition keys to have multiple partitions per sensor and i again give you like 5-10 seconds we don't have more to think what you would think to use out of what we have to do buckets and to separate sensor reports still having some reasonable good partitions five seconds five boom so let's get to the answer answer is one of the typical approaches for the bucketing is to use some scientifical field which needed only for bucketing base it on the timestamp if you thought like maybe a thumb stamp could be a good partition key well yes and no it would solve our problem of two big partitions but meanwhile when querying uh database we always have to specify a full partition key and therefore all the columns of the partition key you definitely don't want to try to select data from your database specifying all the possible dates especially considering inequalities for partition keys are not supported so what happens in this case this month here uh new uh new volume it happens out of timestamp and it can be integer or string that doesn't matter basically today take a look that's july of year 2012. so uh month here value will be in this case zero seven to zero to zero and that's absolutely scientifical field you don't want to use it for anything else except for bucketing but it creates great buckets for our data so it's easy to calculate and if you nee it's easy to retrieve data to store data and if you need to get data for last three months you simply can still get it within the single select statement with uh specifying explicitly what kind of a partition you need and we can calculate the partitions with no issues here and alex i always recommend people um use data that you have and not make up new data so if you have a time stamp in the iot stream say like a device gives up its time stamp like here's here's the data use the data inside that time inside that time stamp and that way you don't have to make up anything there's no calculations that need to be done and it's also synthetic um in a way that whenever you do your query it matches what you're looking for yep and that's very easy to use and then what happens when we have a composite partition key like that it can since consists of sensor id all the time the same for the same sensor and uh month year so as soon as month is go is over and next month will be august uh it will be very for zero eight to zero to zero and then bucket is full and it will be never written again so great last thing we want to avoid avoid hot partitions or maybe better to say avoid uneven distribution of data over your partitions if you see what one your partition is the size of i don't know 100 megabytes and three hour your partitions imagine you have only four partitions word situation but okay a small usually it means what the partition is being written or written or being read all the time while other partitions are idling you may say like okay it's not so big problem but well it is a problem why let me get back a few slides all the partitions will be spread over the amount over the known amount of nodes in your data center and when you have one partition is very busy and other partitions uh i think therefore some of your service servers amount equals to replication factor very busy storing and dispatching data while others are being idling idling and obviously then you may imagine you have 100 servers and partitioning done wrong you may add the another 100 of the powerful servers you can add 1 000 of powerful servers it will not help if yours partition is still being dispatched by the same server that's why i say that's why i call those slides is slight of a year award because if you do it wrong well you will have it working wrong and that's why data modeling with cassandra is so important so in our example here we have three example first example is market as good for you too is so easy to see we have primary key consisting only of a user id every user id has unique id and therefore we will have as much partitions as we have users and it's perfectly fine you may have any amounts of partitions that's not a problem second key video id is the partition key and created that and comment id are clustering columns to ensure uniqueness is it good or not well it will work as i said before the biggest known amount of comments for a video at youtube at now around five millions so therefore and those are small data it will fit but if anything goes wrong or if uh humanity and infrastructure here develops in some weird manner so every of your internet of things uh units like i don't know your alexa or your mobile phone will go to the internet and write their own comments when immediately boom those partitions are too big so if you have surprisingly big amount of comments then it's already not so good model and we may need bucketing as discussed before and uh last example primary key of country and user id and partition out of country that's the bad example for some countries you will have hot partitions and for some countries like you know i live next to belgium like pretty small country for example or there are some even smaller countries they obviously will be idling good and the last point again always specify the partition key if there is no partition key in a query which node you will ask as long as you have cluster of three nodes or five six notes it may be not so bad well it's still bad it's a very bad practice but if there is a no partition which note you will ask this or this or maybe that if you don't know the partition key you don't know which node to ask and then you ask every node in the uh in the key space responsible for data of this k space as you can guess that's a very bad pattern very anti-pattern and you should avoid that at all costs at all times good and therefore that's how these query should be considering we are working with the same users by city if you just select address from users by ct specifying only first name we don't know which node to ask there is a technique allowing you to do the query but i will not tell you because you better not do it but you may find it on your own and we can uh select address from here specifying the partition key and any of clustering columns if you want to good now that was a big section yeah so we have uh let's take a breath here i'm gonna i'm gonna add a little bit to that is that you just learned something that's really probably the most important part of cassandra data modeling is how to create a primary key everything else in a data model is just along for the ride the partition key the clustering columns understanding how those work and they're not going to be intuitive right away because it's not exactly the same as a relational database but understanding that is really critical so just want to make sure everyone knows that and if you're going to do the homework and again remember the homework is to get certified if you want to get certified do the homework because the homework goes way farther in depth you will get a lot more out of that today's session yes your last week's session gives you a lot of information and if you're not interested in getting certified then you don't need to do the homework exactly okay then i will shrink us again good so data modeling for cassandra is different but you may be not understand yet how different it is and as we seen at the mentee in the beginning what many people have experienced with relational databases we will use them a relational data modeling as the example so our next development will be built on some comparison and it's easier to understand so how do you usually work with relational data modeling here first you go with a domain of the data the practical domain you are going to work with in our case at this step we will work with employees and departments so imagine we are working with um field like for application like enterprise resource management something nor something just to handle employees first you analyze the domain you analyze the raw data to identify entities their relations and their properties you have to take care of then you design table using normal forms and foreign keys and then you start to use them and you have to get normalized data from multiple tables you use join when doing queries to join data from multiple queries for example normalize it means what you don't have too much of data duplication normalization is the process of a structuring relational database well after some normal forms in order to reduce data redundancy and improve integrity and what does that mean it mean what you repeat no data except explicitly must be repeated if you have employees and departments you don't write for an employee an employee table what's his or her department is you write just a name and first name so pretty unique things and then you have a special table for departments and then when you need to retrieve the data with your select um department select first name last name department from employees left join department blah blah blah you know how it works and it's good and that's this approach was the most used one for dozens of years because well it has some serious benefits some very serious rules simple to write something changes you write it one time and using foreign keys and relations you may contain maintain data integrity in much simpler manner because well very often it's a job of a database to take care of the data integrity on delete cascade and all the things like that and negative sides of a normalization slow reads and complex queries so when you need to do some select for a dashboard and then you are going to go going to load really a lot of data you may have some real troubles because the query may be tremendously huge and obviously very slow uh one time in my life i run into a limitation i have asked our attendees all the time so now i want to ask patrick maybe you have met something like that before what's the maximum allowed amount of joints in a single select statement in mysql server of a version well i don't know the version i don't remember the version but it was approximately seven years ago the maximum amount of joins you can do how much joints you may do in one single select statement in my school isn't it 64. ah almost oh my god i see a man of a country here yes i actually wrote a relational database in my life come on yeah and i know why that there might be 64. so answer any answer is 61. yes all right okay right and i think i know why too all right so sure good you need three digits off of that it's pretty close yeah but uh seriously guys uh girls can you imagine people answering the question in chat 61 61. all right so if you're doing a select with 61 joins in there stop seriously i mean can you imagine this query it's nearly impossible to analyze a slow query uh tool i guess like that seeing that obviously it works very slow and well it's not that well it's a data warehouse query let's let's be fair that that would be a data warehouse query if you're doing something like that you're not putting that on your website and if you are well there are different situations but in general yes avoid it at all costs okay so we discuss discussed the normalization and we agreed what uh very often when you want your reads to be very fast you cannot afford normalization so when if we can afford a normalization when what do we do we go with denormalization the normalization is the exactly opposite strategy when you duplicate data storage in your data to increase the performance especially read performance some people uh argue what like cassandra is read optimize it cassandra is right optimize it and when people read about the normalization and say oh cassandra is read optimize it well that's not about that that's the same approach you can use with every database including like your postgresql or oracle or whatever you're using it still works the same way you duplicate your data you can read it faster you can do a single select instead of 61 joins but with some price everything because it comes with some costs what's the benefits of a denormalization very simple quick read you have everything as on the example on the right side you have a department name within the employees table and therefore when you want to read something about your engineer then you just read this data and you are done one single statement simple queries really easy to analyze there is no need to use a slow query thing mostly well because everything is simple and you can read and understand it easily negative sides you obviously need some multiple rights when you are updating data and now you have to take care of the integrity as a database cannot help you here anymore developers have more work to do but you know what as a developer i'm mostly fine with that because it means higher salary now no sql data modeling it goes exactly opposite way we don't think about the data then model and application but we think of the application to model to data or i would better say not application but users people customers like uh stop thinking of the cold hardware think of the people instead it's much better and we think what our guests attendees users customers clients wants to have what their behavior and what the results they want to get and what's the data we want to store and based on that we are identifying workflows of the application dependencies of these workflows and needs of these workflows based on the workflows we can design queries to fulfill these workflows how exactly we are going to reach the database and knowing the queries we design tables using the normalization and when to handle the problems of multiple writes we use batches uh when inserting on updating the normalized data to multiple tables and you may want to ask me what's the batch it is and i will tell you cedric cedric will tell you what's the budget is next week every next week we have something very special to you so stay tuned and visit the next workshop as well but enough words talk is cheap show me the code let's go practical let's work with a real example so you will see how it happened in real life in our case we are going to work with a killer video killervideo.com is our reference application we show how we work with cassandra and datastax enterprise and therefore it says it's a killer video you may guess by name it's a youtube killer uh site we developed to when overcome youtube at some point in the time just kidding so first thing we are going to do we are going to add comments and feed new feature to add comments uh to the videos and designing process step by step says our first thing to do is to identify entities and their relationships which is called altogether conceptual data model because well entities are still the same you do relational or non-relational and then we must identify application workflows which leads us to know our queries and no inquiry is everything when we speak about the conceptual data model and we think of the comments conceptual data model is pretty simple we have basically free entity here user video and comment and every user having id email every video has id title description and every comment has outer has a target video on timestamp when it happened and the comment text by itself every user can write multiple comments to a video but every comment may have only one single owner and basically that's our conceptual data model for this case it's very simple i bet you've seen something like that much harder than that many times already then we start to think of the application workflow to simplify this example i will cover only three use cases and use case 1 a user opens a video page as shown before we open a video page we watch the video we want to see the comments well comments at youtube i usually don't deserve to read them sorry let's kill your video it's much easier because there are not so many comments so far but anyway how do we load comments when a user opens a video page use case one answer is simple we find we need to find all comments related to a target video given the no video because user has opened this video already and we have to show more stress and first that should be a pretty simple use case i don't want to get too deep into that open the video watch the comments simple use cases are two number two and three are very similar they uh for different purposes but they're matching when user opens his own or her own profile he or she may want to see what's the uh recent comments and uh the rating of those comments like maybe someone was uh excited or not so excited out of his comment and second case when user has been reported for spamming and the moderator wants to verify a user if it's user a spammer or not easiest way for a moderator just to open all the comments of this user sorted by time and see the latest to see if he is or she is writing spam or not maybe should be blocked so workflow 2 then sounds fine comments related to target user using its identifier get most present first well this one also should be pretty clear comments related to user given the no user id if you want to see user id you need to know the identification of this user pretty simple when we map this knowledge all together and we start to think of the queries and how do we want to execute them we need to find comments posted by a user we have a known id and we need to find comments for a video given known id and that's how our query may look like normally we don't recommend to use asterisk in the beginning wildcard in the beginning say a select wild card in general bad idea but to simplify it may work well now i want you to think stop listening and think a little bit why do we think of having two different tables in this case why do we think of having already table comments by user and comments by video there is a very solid reason for that so answer is partitions partition keys you need to know it when you are selecting something if you need to load comments by user you cannot execute query to comments by video because only video id is known and vice versa so when we getting to logical data model comments by user table looks pretty simple to us user id is going to be clustering uh sorry partition key then creation date and comment id as the clustering columns to ensure sorting so we don't have to sort manually anything and comment id just for uniqueness and video id and comment will be data columns just storing data and comments by video video id as a partition key and creation date is clustering column the same as for comments by user tables look very similar but they cannot be executed they cannot replace each other we need to have two tables here to load data quick and low data having the any amount of data we have is still able to dispatch those comments within milliseconds then interesting tree comes one of the tricks i really like cassandra you remember we just had five fields per table and boom we have already four why major comes we have a great type called time uuid as normal uuid is the universally unique id very long line of different characters and numbers digits it's good but comment id is something even better it's like a universally unique id but with a timestamp integrated directly into that so it looks like uid it behaves like uid but also you can sort by uid and it will be sorted based on the time you built into that and you can extract time if you need that so the fields creation date and comment id are merged to have one single time view id and our table is already smaller and that's great and finally based on all the things we discussed before we can create table if not exist comments by user with user id uid comment id time uid identificator of the video and comment just text with primary key user id and comment id you see only one here and with clustering or order by comment id and basically it goes the same for comments by video but much well in a little bit different order so we are having primary key video id not user id and all right moving on yeah we have the latest and the shortest part of a workshop the most important things are covered now we have very briefly to speak about the data types so a list of the basic data types is pretty big but there is symbol in general because it's exactly the fix we are used to have in any another database you are using all the integers strings and all these kind of things like that the only interesting approach is the time uuid which i've explained at you already so i don't want to stuck at this step collections those are a little bit more interesting collection are the set list and the map well map is pretty simple k value k value k value set is a set of unordered things and list is the same as set except of it has order of the things happening inside working with set we can think of simple things like text then you set a tag to a video when you usually don't care so much of the order and there are no things like ordering text so set is a very simple approach to store text like text and working keeping them together within one field things you can do with the collection type set is pretty obvious you can insert things including sets you can update set and you can replace the wall set in general or you can add something to set with just a normal plus option to add something to the existing text as you see down here and important set doesn't care of your order list looks similar but it cares of order you are inserting data into that so you still can insert you also can replace the entire list wrapping it out completely and you can append list to add some more and therefore it will be the last in the list additionally you may replace an element and change one element of the list to another one map well this thing should be familiar to everyone working in the software development key value key value like in this case we can have collection of type of phone numbers like every person every user may have multiple phone numbers like work one home one mobile one whatever you can imagine one and we go with phone numbers so what we can do very easily we can insert a new one we can replace entire map we can add to map we can change existing k value like everything what you can do with key william user defined types those are really great because you can define your own type uh in cassandra and work with it like with a native type so you can simply create a type killer video address notice the key space name all the user defined types or udts um explicitly assign it to the key spaces to a key space you created in and if you want to use the same type with multiple key spaces therefore you have to create it multiple times specifying different key spaces you may wanted to use and then within the udt you may define different types like texts integers whatever you may need and after the creation of a type you may use this type within the key space you created it in so for killer video users we can define location type address and that's great because you don't need to have all the additional fields for that and user definer types are very used by cassandra developers what you can do with user defined types you can insert you can replace the ball udt you can replace a simple udt field and notice the last pair page update killer video user set location ct equals so you do not have to specify the wall udt you can access one specific field of the unit team which is pretty great and also moreover you can require to read one specific field of udt like select location dot city from killer video users and yeah a given location is this car in this case is a type of address or whatever type you create counters counter is a very special type so counter a 64-bit signup integer there you can very quickly work with impressive values such as likes views and these kinds of information notice there are a lot of banks using cassandra and you definitely want your bank to use counters to operate with your salary or account statement or things like that because well it's not going to look very nice counters are not for that counters are for quick imprecise iterations adding likes adding views and so on they support only two operations increment and decrement first operation assumes the value is zero and you cannot set counter to be of the exact value it only accepts increments or decrements so and it cannot be a part of a primary key it cannot be mixed with other types in the table rows with counters cannot be inserted and updates are not idempotent that's important to understand counters should not be used for process values that's important so how to use that very simple very simple table killer video video playback stats there we have only two fields two rows video id as unique id and the partition key and primary key uh in the end and views as a counter and having these killer video videos you could set views you can increase them and decrease them specifying an integer value you are adding or removing good so it was a really quick jump into the data types and you will cover them much deeper and work with them in person on your own at the homework but when i say on your own that doesn't mean exactly alone because you are not alone you have community.datastax.com to work with and to ask your questions you have discord server to talk all of the things about those exercises with the developer advocates of the team and we did a great job today all right um so yeah uh that was a lot quicker at the end there so um everyone should have an aster account if you have not gone through the process of the last week we went through the process of creating your own astra account and astro gives you cassandra's a service this is uh as a free tier free for life there's no time limit on it and there's a limit of 10 gigabytes of data which uh is actually pretty generous for a free tier but it gives you enough cassandra that you can connect to it using your own code or with a notebook but it's a great place to try out cassandra without having to go through the process of installing it or getting someone to set it up for you you can just click a button and you have ready to go you're ready to ready to rock so if you haven't done that go do that now that's free for you um to do the the notebook homework and to do the workshops uh or the the um the work that's done in academy you'll need that so again free for life so once you have it it's yours forever um what i'd really like to see is you know at this point is i mean who's ready to go build an application who's ready to do this because i saw a lot of people out there ready to do it so i mean if you're out there right now talking you know i see a lot of people in chats both in discord on youtube but are you going to go build something really cool i want to hear from you right now i mean yeah you're ready i got some reddies out there so cool and plus ones me awesome i love to see that kind of energy we want to hear from you we have uh there's a couple of things i want to point out if you want to do i posted a data stacks example in uh in discord um but we are looking for people to build example code that we can promote and put up on we do a weekly newsletter or bi-weekly newsletter we would love to promote what you're doing so if you're doing something really cool we want to see it and if you want to do a talk join us at one of these workshops we would love to know about that too uh contact myself contact jack any of the developer advocates anyone that's out there that you think is part of data stacks actually let them know you want to be a part of this we need good community voices another thing to point out is we also have our accelerate series which is happening soon and those are a lot more fun but it gives you some stuff like what's happening in the future another thing that's happening is apache con and apache con is going to be virtual if you're using cassandra now we would love to have the uh there's a cfp open right now it will be for the next four days uh i'm going through all of my community stuff and then finally get ready for cassandra 4.0 uh looks like we're going to be shipping a beta here pretty shortly and i'm really excited about 4.0 because it will be the most stable database on the planet it's taken years for it to come around but for good reason there's going to be a blog post on the apache website describing all of the testing that has been done on cassandra 4.0 um you know there's some big users that have been testing full workloads in casino 4.0 this will be a dot zero you can actually put in production and i'm not kidding never had that happen before have you alex no never actually you know my golden rule of thumb was always a way to first major patch uh for uh in normal conditions i would wait uh for 4.1 but you know what i know we team working at the cassandra 4.0 and i'm i really trust these guys yeah and uh it's it's it's starting this whole new version of what we're going to be doing with cassandra in the future um if you're working with kubernetes cassandra's going to be better built for that it's no longer using jdk8 jdk8 it's using jdk 11. thank you i saw that in the chat um we have a really interesting blog post on the dudestacks website about our testing with uh some of the new garbage collection techniques around jdk 14 shenandoah zgc very interesting worth looking at um but come on let's get down to it you're probably going to use astra because you don't want to run your own cassandra right so why not um yeah and what's happening in cassandra 5.0 which is after you know we're talking about after 4.0 we want you to be involved in that too so get in the conversation great okay then a few last things left homework or you know last year's i live in germany so i would call it house of gaba uh what do you do first we have to do notebooks for this course it will be data modeling and advanced data types two notebooks and they are stored at github.com data stacks academy cassandra workshop series week two so uh go get them uh do them using your astra insta but restart should help yes so we should be back online okay sorry about that everyone we're back yeah exactly so uh we are back and therefore i want to uh proceed with the slides so homework week 2 notebooks data modeling and advanced data types they're on our github when there is the academy.datastax.com where you have course ds220 and there are exercises you will do during the first step of homework so all you need is videos only and then you have to fill the form validating your presence and i will try to switch back in a moment yeah something is very wrong with my hardware but i will try to do home stretch we're almost there alex you can make it yeah yeah yeah uh it just happened at the very last second incredible well at least not in the beginning so google chrome yes that's it well while alex is banging on his computer with a hammer we are we're pretty much at the end of this but you we you can see on your screen what you need to do you should also have an email with everything you need to do if you have any questions we never leave twitch or we never leave discord uh we're also on uh asf slack most of us uh so we're out here um if you have questions um we would love to get your questions on community.datasacks.com because that's where you can when you ask a question you're actually sharing with the rest of the community because everyone can look at what your question was and they could see the answer it helps us with like google search or someone has a similar problem so really you're the question putting questions on a community is you being a part of our community thank you um so next week uh you know we're going to be getting in more into application writing and that's pretty exciting um and i don't know what else we have to talk about alex i think we can let everyone go yes uh we have only two more points this week we have a bonus for you so uh if you want to have more and ds22 and notebooks are not enough you can get deeper and go to katakoda.com data stocks the links should be available at our main page at the community.datastacks.com and uh there are singular partition exercises multiple partition exercises and advanced data types for you to make it to to make you learn it better and deeper yep absolutely and let me check if we have anything else i guess we don't uh last a few words about the resources so what you need to know if you will need that we have academy.datastocks.com for you to watch the courses it's a part of your homework of x week one and week two so you have to be there already uh to communicate and ask and share ideas and ask questions there is a community.datas.com connect with us follow us at youtube twitter twitch and well whatever you prefer we are here so you can reach us and materials for this week are published already to be at the github.com data stacks academy cassandra workshop series big tomb and uh all the links are going into the chats uh their youtube chat and also on discord um you know feel free to ask there's plenty there's plenty of links to go around but um yeah yep and uh so week one and week two of these uh cloud native cassandra applications workshop series are done next week will be application development with cassandra part one and please notice a workshop at wednesday and thursday are exactly the same both are live maybe with different speakers but the content is the same and um they fit one the best fits for your time zone because we want to cover all the world with those workshops so choose the one which feeds better for you and we are getting closer to past two with power to part two with testing deploying and monitoring your applications thank you so much for being with us we are really excited to see you uh uh learning with us and that's really great to have you here all right thanks everyone we'll see you in the next one next week have a great week thank you
Info
Channel: DataStax Developers
Views: 12,618
Rating: 4.909091 out of 5
Keywords:
Id: 5NoixINC9l4
Channel Id: undefined
Length: 89min 24sec (5364 seconds)
Published: Wed Jul 08 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.