Cassandra Data Modeling Methodology

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] [Music] [Music] no [Music] on the same page [Music] [Applause] [Music] perfect [Music] hello everyone if you can hear us then mostly probably you are very happy to start our workshop cassandra data modeling methodology that's alex wolchniev and david gillardy developer advocates at datastax and if you can hear us if you can see us please set the thumbs up in the youtube chat so we will know we can start hi david yeah definitely hello alex i do but yeah it goes very well i'm happy to be here you know data modelling is one of my favorite topics maybe the most favorite and i'm very happy to be here and speak about that and you know what we already have like almost 100 of people attending these well 100 people who want to know about data modeling what does that say right that says actually wonderful things about all of you that are here with us today especially when you're doing data modeling with cassandra you want to get this stuff right yeah exactly so looks like it's all good and we can start smoothly it's uh 6 p.m and 3 minutes at my place what's time at your place it is one o'clock in the afternoon so one o'clock in the afternoon okay pretty nice as well and meanwhile uh could you please write in the youtube chat very from people so we will know where are you from that's incredible how distributed the community is such an interesting time we are living yeah totally i i see india i saw brazil earlier by the way uh persona yes we love models with you france lithuania that's awesome you know i'm from chicago originally and the town i'm from is actually like a big lithuanian town like all the street names the church everything it's all lithuanian there that's cool i didn't know that and i'm originally from saint petersburg and now um living in western germany in the city you don't know so name doesn't matter we've got netherlands i see phoenix arizona hey aj burns i have family over there i said i travel pretty often oh wow san francisco minnesota okay great so let's start we have a lot of things to cover today so don't want to take too much of your time just two words about ourselves we are developer advocates and data stacks and our job basically has one main duty help developers to upgrade themselves some people say like upgrade developers i find it wrong if person doesn't want to uh be better if person doesn't want to be upgraded better uh professional when there is nothing we can change just nothing but that's our job is to help people to upgrade themselves and be better professionals and that's what we are doing uh to do today what's that we are going to do today help you upgrade your understanding of a data model for the distributed database our primary thing for today is apache cassandra but the same principles are applicable to any kind of a distributed systems basically so um then let's start let's start we have a lot of things to do today a bit of housekeeping first we have a live stream on youtube that's mostly probably where you are watching us and we have a secondary or backup stream on twitch and we answer questions and comments on uh discord and our discord server you can find it at bid.ly cassandra dash workshop and on youtube stream but in general we suggest to use discord because youtube chat is closed as soon as the workshop is over and discord stays with you and there are multiple thousands of people there so you will find someone to answer your question mostly probably notice we don't monitor and don't answer questions on twitch we want to focus on some reasonable amount of channels for quiz to ask questions to you and then play a game and win some swag we use menti.com you may know this system already it's completely under emotes you don't have to put your email or anything you just go and play a game bus and win some swag and to do homework and to do some practical steps we use today katakura katakuri is a great tool you don't have to install anything you don't have to make cassandra running at your place you just go and use everything and vap so your homework if you're new to us or new to one of our workshops yes we assign homework we assign homework we're going to make you do things but you get to earn cool stuff for it right so it's it's worth it it's worth it for leveling up and it's worth it for the cool stuff exactly okay uh and don't worry we'll get you that minty code it's coming yeah it's coming it's coming uh so uh achievements unlocked now you so many of you doing great work participating in the workshops and learning for really and asking so great questions so we know somehow to highlight you to help you be visible on linkedin on job market and so on so we introduce achievements or pages um you know we have a regular big certification with the exams and things like that so beige is a thing something kind of smaller but still available and we already having people from kate sandro workshops and intro to cassandra workshops getting those uh now we are working on the platform to deliver them and we have to publish them on linkedin actually this week so do watch this workshop live or record that do the homework we um to call a bit later and then enjoy your well-deserved page and publicity a little bit of for the homework you will have to do two scenarios on katakura it's simple it's free it doesn't require any kind of complex installation you just do everything in your browser i wouldn't say it's free david because data stacks pays for it with sponsor training notes for you uh to be able to do the homeworks so for you it's free and it's sponsored by data stacks by the way it's one of the best companies i've ever worked for so cassandra data modeling methodology or i better say how not to screw up the cassandra that's uh very important to understand um and to see like how do you make your database not just running but successful and there are only two rules of success know when to use it and know how to use it very simple and let me check something it looks like something happened with david and i've lost him hey david are you here can you hear me oops okay i hope david will join me later david david david sorry just a second and i will proceed meanwhile sorry ask for this little delay i hope he will be back with me very soon so two rules of success well know when to use it and know how to use it that's the only two things you need to know if you know them you will be fine if you don't know them well it's really easy to fail with cassandra so when to use it and how to use it when to use it don't be like this guy this tool is not suit to fix a tire on such a track you need something bigger obviously and my second point here is know how to use it otherwise you will be like this guy well this one is obviously an accident but i like this speak really a lot that's exactly how i see people pretty often using their databases don't be like that this workshop will help you to not be like that and what cassandra is good for that's the topic we start but before that i want to do something i have to ask you some questions our mentor code for today is five seven seven nine oops five seven seven nine three five one five or actually i can just push this button and you will have the code and the qr code on the screen so you can simply enter this code on the mendy.com website or use this qr code to get in so let me see you are joining because i want to understand what's your situation how deep is your knowledge and maybe there are some topics we have to get deeper with you than others so please answer my question that's not a quiz yet because well quiz will require some speed from you and you have to be quick but that's accessions for me to tune up the content okay i see people are joining that's great so let's start smoothly first question how much experience do you have with relational databases i mean traditional relational databases what you usually know as sql databases postgresql mysql oracle and so on so everything that relates to sql okay and i see we have some um interesting points here most of the people have a significant experience with sql some people are use it all the time or big experience only a little bit or no prior experience okay that's great to know like so we have some novices here and then next question how much no sql experience do you have and i don't mean cassandra right now but all the other nosql databases you know i don't know mongol radius some people call it's not database but well insist and all the other nosql databases you may call so we will see no prior experience a little bit i use nosql a lot i'm a true guru not just like how different picture nearly everyone knows sql but not so many people have significant uh no sql experience that's pretty cool and you know what we are going to change that the world used to work with general purpose databases which can be jack of all traits master of known now we tend to use more and more purpose build databases those are specific uh and dedicated for some purpose and really great in it now it really matches the idea of a microservices when you can have different databases for different parts of your system good so how much finally cassandra or data stocks enterprise experience do you have okay no prior experience that's fine our basic understanding up to one year one to three years of experience and over three years experience i see uh mostly no prior experience so take a look in general this workshop expects to you having some understanding of cassandra not too much but still to have some i will run through introductional slides in the beginning of this workshop but frankly speaking pretty quickly so please if you don't have any prior experience with cassandra and your knowledge of cassandra now be very focused on the beginning i will not repeat basics because we have to focus on data modeling methodology okay dsc vj asks dse stands for data stacks enterprise it's let's say kinda enterprise version of apache cassandra delivered by datastax with some very nice features but we aren't speak about that today good so and finally last question what about normalization and then normalization that's pretty interesting world used to have normalization as a holy grail and today i'm going to shift your mindset a little bit and show you what normalization can be harmful good so most of you have basic understanding or good understanding and some are beginners don't worry that's explained in your workdrop so if you don't know uh those you will know them after today which is already a good upgrade so we are done ah yes and finally finally finally i thought it's the last question how much your work is using databases i mean maybe you're a front-end developer which don't work with databases at all and maybe from another point of view you're a database architect or database administrator or database a reliability engineer which works with databases all the time like nothing else other than the databases so that's i want to know um i see most of the people work with databases a lot or at least some people or sometimes not too much hey david david you are back hey yes perfect right right as we start my mac decided you know we don't need networks we don't need to work yeah i mean yeah that's a lunch break for him i guess okay so we're getting arrested i was hungry i was really that's i'll be honest i was hungry so alex you might turn your screen again to me uh oh yeah give me a moment sorry everyone yes oh you mean that's a live uh translation so there are always things like that uh start sharing so yeah doing these online and uh doing them in person is in person you can't get rebooted so there's that good hey thanks alexi yeah all right so at this point uh we got all the knowledge we need to have about you and we are ready to proceed finally and we see what there are some people the primary duty with the databases people working with the databases a lot some people not too much and seven people usually don't work with any databases when welcome to the great world of the databases it's incredible basically you are using databases all the time just not directly so we are done with questions don't close this screen because uh quiz will happen exactly here on the same place when we will proceed and now i'm switching back to my uh presentation so what cassandra is good for and what is bad for what is it good for absolutely everything no not absolutely everything that's the point you remember two rules of success when to use and how to use so let's keep uh talking well if you know the original song you you wouldn't want me to you know what is good for absolutely nothing so that was fun but you're but you're right but you're right alex yeah good so very quickly just few words of what is a cassandra i'm running through those slides cassandra is a nosql distributed database you can have single cassandra note like you run for your mysql or postgresql for example normally but usually you run it with multiple nodes so data is distributed that's not only distributed but also decentralize it so there is no thing like primary nodes secondary write replicas read replicas how we use it to call them master nodes slave nodes nope nothing every node has the same duties that's completely decentralized there is no single point of failure it's highly available nodes together built data center also known as ring while data center is ring is the same cassandra scales linearly this plot is not to show what cassandra is faster than other databases although it's usually faster my point here is the more nodes do you add the more performance do you have there were a very nice research done by netflix team and publish it in their technical blog by the way netflix has a great technical block you absolutely must subscribe to that if you are interested in the technologies netflix are using they are using cassandra avalyn and they scaled their deployment from like i don't remember 20 machines to 350 machines it was a test deployment and it scales linearly with no overhead zero to very low overhead and adding your nodes so basically there are no limits for scaling um main deployment uh on netflix for apache cassandra consists of more than 70 000 nodes and you know what yeah i was going to say this holds true for thousands of nodes right along well that's not a single cluster but still but you know what you may think oh that must be the biggest deployment of a cassandra in the world and you know what it's wrong we use it to think what apple has the biggest cassandra deployment in the world getting close to 200 000 servers dedicated to kakapacha cassandra and now we know it's wrong because huawei has even bigger deployment yeah so then data is distributed on the right you see the table of country city population basically it's the table of number of citizens per city per country and the normal traditional databases you have this table stored on a single machine or maybe some write notes replica nodes but cassandra works differently in cassandra this data is spread distributed over the cluster why cassandra is big data already your traditional relational database might work great as long as it fits your use case and what's your use case as long as your data fits to a single server when your data doesn't fit to a single server you have to use sharding sharding is a very painful process it's a manual process and it brings and makes very very very bad things to your infrastructure your database ecosystem cassandra makes partitioning natively from the very beginning it's completely automated it's not for you to take care of the data allocation shuffling data on the servers adding new servers you just throw in a new server you get more nodes you get more output you get a capacity and that's for cassandra cluster to take care of how the data will be allocated data distributed automatically it's called partitioning and it's great that's what makes cassandra big data already but data is not only distributed data is also replicated rf you see on the left side is a holy grail of cassandra it's replication factor replication factor free what means whatever your role or better to say partition every partition is stored on three different nodes what makes cassandra highly available then you write some data and you send some data to a replica then every node responsible for this set of data will get identification of that and it works very smooth but if note is not available then one who got the request will store this data without a confirmation oh i see my note is not responding for whatever reason i will store hinted handoff and as soon as failed note reports recovery okay uh collects i'm back online sorry what did i miss then uh note what got the request uh dispatches that hinted handover back and the note recovers completely so there is no uh missing data well cassandra has a lot of self-defense mechanisms yeah david yeah exactly and i was just going to add to that which is this this mechanism that alex is talking about right for hints and the self-healing mechanisms of xander doesn't just apply to individual data centers right this actually applies across data centers you could even have a case where you have multiple data centers and if a whole data center went down which we've seen in some major cloud providers in the last so many years that whole regions get lost or whatever as requests are coming in and one of the other data centers is taking over that traffic they will also store hits for that other data center and when that thing comes back up it'll it'll get that you know back in sync as well it'll heal it up um and one other question i just want to hedge off because we almost always get it should you use any rf less than three unless you're just messing around with like a quick development test environment a single node or something like that no stick with three or or more really three is the sweet spot three can three by the way the standard as alex said the holy grail of that that holds whether your cluster is three nodes or thousands of nodes right you can you can do that and have a very nice balance between availability consistency and performance with alpha three so just use three unless you have a good reason not to yep and there is a good question uh what i see is what happens when multiple nodes fail in depends on the on your settings so when you run a query you must specify consistency level uh well it's uh it has default settings so you don't uh have to when it will use the poll setting by default uh but you can have uh specified a consistency level desired consistency level and if you have replication factor free you may say i require consistency level one so even single one answer will be enough to me so this way your data being dispatched stored or written or read even as long as one single replica is available but it may bring to some uh troubles uh we discuss it later and then second point is uh then you uh you can specify consistency level quorum which will require to be available majority of your nodes and then you will need at least two and if you have replication factor free two nodes are down then cluster will answer desired consistency level can be reached i have only one node you require two to be available so uh there is a problem for you to handle um so alex i do see a question or a set of good questions coming in um and two of them are related with regard to do you advise having nodes across regions or zones um and that was from vj and then bennett said can you require that replicas are all in different regions so you can totally put your nodes wherever you want right now here's here's the caveat right is you want to ensure that nodes that are in a single data center at least geographically speaking because we're talking like speed of wire speed of light you don't want to put them so far away that there's going to be a lot of latency and current incurred between the nodes however you can totally have distribute nodes in a single data center across say multiple aws regions or most multiple cloud regions and yes that's a good practice right um and i think given what we're talking about the folks who are asking that are probably asking well hey i want to what happens if a cloud provider goes down in a region i want to protect against that exactly yes that's a great idea so just make sure though that you're not doing it where you're setting up data center where you have nodes that are all the way in the west coast and the nodes that are all the way in the east coast or something like that in the united states it's going to be too much latency but you can either do it in small regions small geographic regions in a single data center or you can have multiple data centers that are in those further apart geographic regions um and facilitating kind of the same thing so the answer is definitely yes you can spread them across regions just fine um i think you already answered the kills question is it the same as like having a quorum because you talked about quorum um from that standpoint let's see ah sarahan does the failure risk frequency increase as the rf increases you know that's that's a neat question really oh that's a good question that's a really good question i mean in general more notes more servers you have higher chance what some of them will be down uh if you're having two servers or you're having 2 000 servers if you are having 2 000 servers there is a pretty high chance at least one of them will be down at some point uh that for us that's not about failure risk frequency um right but more about the answer uh possible answer delays take a look we have replication factor three we execute a select statement with a consistency level quorum it means two so we will wait for two answers so uh and there are always some possibility of network latencies as far as we use majority of three nodes two nodes uh we only get to fastest answers and the slowest one will be then we don't wait for it it works pretty quickly but when you have let's say a replication factor uh 100 you will have to wait for 51 responses and that's why we don't recommend to have replication hungry replication factor 100 right because there is much higher chance to run into latency because you're just waiting for a bigger group but you know what and then david oh and if you take an example if you have a replication factor of 100 and then i am trying to achieve a consistency level of quorum that means in order to achieve that consistency i will have to get an acknowledgement from 51 nodes right so you can see there's definite diminishing returns as you just just simply increasing replication factor doesn't answer what you need that's actually why we say replication factor of three even if you have thousands of nodes is it is the standard it is the sweet spot um you don't need to increase your replication factor heavily in order to get the benefits that you get for like availability and such as a matter of fact again if you increase it too much um as alex is pointing out not only will you increase a lot of latency but now now you have to wait for all of those nodes in your consistency when you go to read and write and such yep uh now last question i answer before we proceed because uh those are questions for introduction to cassandra workshop and that's data modeling workshop so we have to proceed with some respect for those who come for data model then last question i asked answer before we proceed cassandra works on ap mechanism as per cap theorem um sorry ibispajit it's a not correct statement people say what cassandra is ap from cap but that's not technically correct cassandra is configurable switch consistency level to o boom you are cp but not ap so that's good that's good query level yep cassandra is configurably consistent uh not eventually consistent actually so uh we are happy to answer your questions please let's discuss it after the workshop in our discord okay so yeah yeah keep your questions coming in the chats and i'll try to answer them without saying invoice yep now uh that answers question asked for cross cluster replication not cross cluster but cross data center one single single cluster may have multiple data centers you may have data center in the united states europe china and they all will work as a part of the same cluster simultaneously and you specify replication factor per data center so you can say for my data for my application i want to have three replicas in the united states five replicas in europe and zero replicas for china because this application this part of my application just doesn't work with asia so i don't need data there so i will save some time and money on the data application and disk space also cassandra is completely platform agnostic so you can run it on the microsoft azure aws google cloud on-premises data center whatever you want to have simultaneously so you can have the same cluster spread over multiple clouds and that's really cool so it's completely platform agnostic um and that's a very good and very powerful technology which you can run everywhere and don't depend on any of them today you need your data center microsoft azure here we go you are welcome it's easy to organize tomorrow you aren't satisfied with your pricing policy anymore boom you can easily migrate to another cloud so that's actually a really really cool and no vendor lock now there is one incredibly important uh limitation to understand and a really small amount of people working with data understand that um there are let's say one line with two endings and some databases in between some databases then to the left some databases then to the right let's call it oltp or ola and or lab ltp stands for online transaction processing and a light alap stands for online analytical processing what does that mean otp workloads tends to have highest requirements from the time point of view you want to open i don't know a profile picture of your friend on facebook you want to see this profile picture right now you don't want to wait for an hour you need it now um that's typical for oltp that's why it's called transactional processing so how i'm a high amount of transactions then it tends to run more or less simple queries like simple queries is maybe not the best explanation but queries don't change to be one kilometer long mostly so they tend to be easier and simpler and queries don't change too often i mean every application develops over time and business requirements are changing and your queries are changing as well but that usually happens within weeks months or years not many times per day and on the another side of this line we have olap online analytical processing which is different answers can wait like my analytical team works with this database they're getting salary for what they're doing they can wait queries tend to be sometimes very complex and including getting data from the very different pieces of your system to bring it all together and queries tend to change today you have to realize i need the least i want to make a marketing campaign i want to send an email to all of my um [Music] subscribers all of my clients from the last year who did at least one um hubble who has bought at least something from a car um things i don't know whatever way could be able to buy for a car some car tires for example and did at least two um operations um two purchases from uh category i don't know beauty and style and exactly these sizings and then you can construct the query and get some data for this but tomorrow you are looking for completely different data so queries tend to change all the time and that's gums about analytics and which kind of databases we use it to run all the time traditional rdbms are in the middle you can have some kind of oltp transaction processing with them and cassandra is different cassandra tends to be strictly on the transactional processing side what does that mean to us cassandra is very good when you need your answers now very quickly right now no wait milliseconds microseconds right now i want to use simple queries because joints are slow joints are not scalable and i need to have a simple query normalization we will speak about that later and queries don't change too often and if your system if you have requirements so you aren't so much into fast answers but you are going to run complex very queries and they will change all the time then cassandra may be definitely the bad choice for you or better to say you just cassandra is not enough usually you have mix of things you need a little bit of oil tp you need a little bit of olap and in this case we can speak about the teamwork for example apache cassandra is a perfect integration with um apache spark and then you can have them together running and successfully executing cassandra will be on the oltp site spark will be on our lap site able to answer whatever the query or data you want to retrieve but not so fast as cassandra of course it's really hard to keep the pace with the cassandra good yes in discord or i'm sorry on youtube says it perfectly you can't have them all you can't have it all in one database yes i completely agree with that yeah you've got there are trade-offs and cassandra takes the trade off of being like super fast at scale yeah exactly you can have them all but uh not within the single database basically that's what data stocks enterprise does you have cassandra you have some additional tools and then you have buffer ltp on olap yeah but not within the single tool but within a platform so then uh so what the main features for apache cassandra big data ready uh highest a possible availability geographical distribution extreme read write performance vendor independent those are the sweet spots of cassandra if you see this list and you know like okay my project needs all of this when you definitely need cassandra if you see these um list and it doesn't touch you like okay am i small my um project may be not really ready for that right now we don't have such high requirements when you have to think on the next two lines i'm adding cassandra requires qualified engineers and cassandra is otp specific so if you are the only one working on a small project a single region project and you don't expect to have billions of clients from all over the world then cassandra really can be a bad choice for you because well you need to have your own team of database engineers to keep it running or maybe then you have to consider data stacks astra and you already have those engineers supporting your cassandra automatically we are data stacks we run cassandra as a service that may solve this problem requires qualified engineers but if you want to run it on your own you need to understand and again that's oltp specific fast queries simple queries don't tend to change too often and you know what there is a little very nice consequence more and more companies are growing bigger and bigger amount of data grows tremendously by the year 2025 amount of data digital data on the world expected to be more than 200 zettabytes and that means world needs a lot of data engineers and people able to work with the most powerful databases in the world which is apache cassandra and therefore learning cassandra technologies you are able to get better and better salaries in the best companies over the world netflix instagram apple and many many many many hours they all use cassandra because at this scale you need something extremely powerful so use cases when to use cassandra everything what relates to scalability availability distributed platforms and cloud native you say cloud okay cassandra is able to distribute your data and keep your replicas on different availability zones automatically you don't even have to set the replication like i want to be so i want to have one replica availability zone one second and availability zone two cassandra knows it and does it for you so that's really cloud native you want to have global presence here we go you can have cloud uh you can have your data centers in any places of the world mission critical there is basically no database more available than cassandra as long as you cook it right of course so it's still share with responsibility cassandra give you the tools but that's for you to use them and then finally scalability cassandra skills linearly and able to um handle any amount of writes and read as long as you have enough notes and if you don't have enough notes when i throw more notes give more notes and it will be able so good if this part was a too quick run for you that's perfectly fine i see we have a lot of people with no prior knowledge of capacity cassandra we do cassandra relate related workshops every week and we did intro to cassandra only recently and we will do it again very soon so in general i suggest you to just watch the recording of the last workshop do your homework get your page and be a better engineer so that's great now uh hey david are there any questions regarding first part you want to mention right now before we proceed you know i think yeah i think that i got most of them in uh in the chat here there's one i'm actually answering in discord saying most of the times teams complain and repair being slow when the key space grows inside anything that can be done to prove on that and i was answering that um one of the big things i actually see i realize this is more on the cassandra standpoint less than the data modeling uh but one of the big things we see a lot are cases where folks aren't actually running repairs or running them regularly at all and then they decide they need to do it one day and they're like way behind um so one thing is to ensure that you're running them regularly um there are different types of repairs there's things like incremental repairs you don't have to do a full repair every single time um you can also control you know it does help if you think about it repair is going to be working you know if you're if the data if the underlying partitions themselves are getting bigger and bigger and bigger and they're unbounded they're not control then that is also going to have an impact so it's good to keep your partitions i think we're actually going to talk about that a little bit later um you know keep your partitions within practical limits and stuff to help that as well i don't know if alex there's anything else you want to add not to this one but there are two points i want to mention also i see from questions on youtube are joints possible in cassandra asked by harani i already answer it in chat but i want to explain joints are not only not possible in cassandra way will never be possible intentionally for a very simple reason the main principle of cassandra right now extreme performance on data of any size and your joints may work more or less fine as long as your data is on a single machine but cassandra is what distributed decentralized what it means to us joints don't scale and as long as john's joints don't scale they uh are just awfully just overhaul on the distributed systems do you can you have joints on the distributed networks yes you can it's possible but when you kept you have to um forget about any kind of a performance as soon as you uh can you are going to have join on the distributed environment uh forget about milliseconds and cassandra is all about right now so no joints second point uh from uh camara my team said you have to unlearn normalization to get more understanding of cassandra and that sounds great but i think we can upgrade this phrase a little bit when you learn uh sql and when you learn databases in a university or in a practice everyone says normalization normalization normalization normalization normalization they give you this work this whole world of normalization and whatever you do you do with the normalization and you know what normalization is great normalization is really great and for many of my use cases and tasks i use normalization perfectly but what i want to give you to give you something except of this word i can give you i don't know a crossbow and there is a place to use sword and where is the time to use crossbow and this crossbow is the normalization so this one is to upgrade you and make you able to understand and use those tools uh in the different places there is nothing holy in the normalization that's just bad when you use it on the wrong time in the wrong place sometimes you need a gun and the denormalization can be your gun for for many of the projects so relational uh for normalization noise query for generalization let's take a look what's been normalization first of all database normalization is the process of structuring a database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity it was first proposed by edgar cote as a part of his relational model that means what you tend to write every piece of your data only once basically every name like name of the department can be mentioned only once in your database and if your engineer works in the engineering department you will have two tables departments employees and foreign key relation from engineer to a department and that's how normalization works deduplication remote reducing data redundancy normalization is great normalization gives you very powerful things like simple right if you're going to rename your department you have to execute only one query which is cool and it also gives you data integrity all the foreign key checks all of them on delete cascade and others and others and others so normalization is great but everything comes with a price and price for the normalization slow reads because you have to use joints and joints are slow and on big data joints are extremely slow and not always even possible and complex queries a question i ask all the time i believe there are people from yesterday uh workshop uh there are some people from yesterday workshop who know the answer already i kindly ask you to stay quiet uh but please tell me right answer in the youtube chat how my how many joints you can have in a single mysql select statement so select blah blah blah from blah blah blah join join join join how many joins you may have in one single statement graph uh very close not perfect but very close maybe something has like bigger limits smaller limits hey cedric you know the answer i asked you to give the chance for people to uh ask but yes two digits correct two digits answer is 61 indeed and how did i've got it yeah and answer is um 61 61 uh answer is how do you expect your select statement to be quick if you are using a dozen of joints okay it's a rare case to use 61 join but well i've seen it in my life and how do you expect your query to be big and how do you expect your query to be quick over hundreds of servers there is basically no way so that's the price of normalization let's move on sample relational data modeling you create entity table you have the constraints you have index fields and you actively use foreign key relationships yes vj61 join in a single statement yes don't ask me so uh that's how it usually works you can have all nice things so like create index constraints email unique and so on you have relations that's because they're relational databases and cql cassandra query language is different joints don't scale and we need to work with a big data so we cannot have any joints and we can have very limited aggregations when you handle petabytes of data you cannot have uh quick aggregations anymore it just uh also don't scale so well so uh speaking of the relational data modeling you know here's a question alex um um from vika did the uh did that system work the one with 61 joins uh yes very slow uh and and you know what we did to handle that david we know what we did to handle that denormalization oh true yeah isn't that the way in any way anyone who's ever done any data warehousing here in the relational world what did you do to speed up queries denormalization right yeah the whole point of denormalization is to speed up your reads yeah you know so we'll talk about that a little bit later actually yep good so uh then uh relational data modeling how it usually works uh first you analyze raw data you identify entities their properties and relations you design tables using normalization to duplicate the data to remove redundancy and you use foreign case to link the data to establish a data integrity on delete cascade and so on and so forth when you use join than doing queries to join normalize the data from multiple tables and then read performance getting back you apply the normalization as an inevitable evil but but what i want to say the normalization is not evil the normalization is fine as long as you control that yes that's a dark side of of course but that's still at cycle of course you can learn uh there is uh that we aren't in a star wars movie you can have both and you should have both so denormalization is a strategy used on database to increase performance in computing the normalization is the process of trying to improve the read performance of a database at the expense of losing some right performance by adding redundant copies of the data you see this table on the right before if we needed to retrieve data of our engineers we needed to do select and join and joints are slow and don't scale too well and now we have table employees and if i want to retrieve data of my edgar court i just need i will get my engineering as a part of a single select statement simple queries you see remember oltp uses single queries simple queries benefits of this quick read read once and simple queries easier to maintain have you tried okay not 61 but have you seen queries with 10 joints 15 joints it's really hard to maintain them and with simple queries those easier to maintain but again that comes with a cost so dark side of force have some requirements multiple rights now your data is denormalized so you need to keep it under control and you have to update all the pieces of your data spread over multiple tables now that's on the application side and basically it means on the developer side you know what i'm fine with fat as well as i'm getting paid for wet and while cassandra enabled people are usually getting pretty decent digits let's say in their salary and then finally manual integrity we have to sacrifice data integrity or better move control of the data integrity from the database layer to the application layer okay so applying the normalization brings us to combine table columns into a single view eliminate the need for joins and queries are conscious and easy to understand but there is still price to pay for event the normalization in apache cassandra is absolutely critical and the biggest takeaway is to think about the queries first we will speak about that in the next steps again there are no joints in apache cassandra queries in relational versus nosql databases very simply explain that in this slide in a relational database one query can access and join data from multiple tables in apache cassandra you need to query data in the different way and sometimes it goes to up to one table per query and that sounds sounds awful and well indeed it brings a lot of complications but when you do handle again billions of rows and petabytes of data that's the only way for you to be able to dispatch them within milliseconds so what's about the modeling queries when we need to understand what are your application workflows you need to know your queries in advance is critical because your tables are dedicated for your query queries between queries everything in cassandra world and that's different from the relational databases i cannot use join and well i can create an index to support new queries but indexes over distributed data again painful and have some downsides we don't speak about indexes today and yes it can be up to one table per one query when we speak about uh workflows for uh killer video killer video is our demo application we use to show how do we work with cassandra it's kinda youtube you can say so also videos users comments if you know youtube you'll consider yourself knowing killer video and speaking of the design of our tables you need to understand the query and what's important for us query always query queries are not random queries are based on the workflow where it's based on what you are doing on the database what you are doing on the application which application you use it today before the workshop think about that mostly probably you will log it in when maybe you opened your profile you did some basic operations every operation involves one or more queries at least definitely you send the message to your friend hey i'm going to attend the workshop about cassandra it's going to be cool join me that's includes query or maybe two or maybe more you know denormalization requires multiple right i'm not so bad about that because point is i execute them simultaneously so it doesn't require too much time rights are very cheap and quick on cassandra cassandra is ready to devour any amounts of right you execute and then speaking of the use cases on how do you use your application youtube or kill your video you get from the user operations basic user operations to the queries you are going to do as a result user logs into the site what we are going to do find user by email pretty obvious a user opens a video we have to show comments for a video find comments by video with no video id we have to show comments ratings we need to show ratings for a video we have to show basic information about a user so we have to find user by id then a user wants to see his recent comments and reactions likes dislikes and so on we need to find comments by user latest first so no sql data modeling goes in the opposite direction you don't start with analyzing your data well you do but we will see very soon how it works but you have to think a lot about the application or people say i think of application i say think about people like humans first it's not about software or hardware that's about people let's think about people first analyze people behavior what are they going to do with your application what will be your particular workflow login find a friend send a message to your friend okay that sounds good to me when we identify those workflows their dependencies and needs and define queries to fulfill those workflows knowing the queries design tables using the normalization and then use batch when inserting on updating the normalized data of multiple tables now we will get deeper into that so don't worry cassandra alex yep yeah alex before we move on if you could go back to the other one uh back to the previous slide this one yeah just something i want to point out yeah that one um and we're gonna we're gonna get to the the fun one the pop right uh later on but you know for those of you who are familiar with this especially coming from the relational world right you start you you start with um as developers we usually have to we're given the schema right we're given erd we're given that and then we use that to generate our queries this is kind of a paradigm shift right here right what alex showed is the little bit of a paradigm shift because it's it's it's in our control so instead of us as developers who are coding these apps having to um you know reference somebody else's schema and you know look at okay what tables do i need to join how am i going to make my query um you we are the ones who then actually generate what the data model is going to be right and it's it's actually in our controls so it's a little bit of a paradigm shift but it's really powerful at the same time it also free it seems like it would be the opposite it doesn't seem like it would be freeing but you'll find this this methodology and the way that we go about uh data modeling in cassandra once you get the process down it's a lot easier to get going and there's a lot less constraints but anyway i'll let you go on there alex yep um so i see i hide some information on this slide uh uh sorry evie cosney um not um okay what was the question uh ken james james james wasanath uh can't read the words on the bottom left it's a used batch when inserting or updating the normalized data of multiple tables cool so um cassandra data modeling it all starts with a cell and when we go to a row and then we finally go to a partition so everyone understands the cell and the row and the column that's simple thing we have to remember is a partition data is partitioned partition is a base unit of access if you don't know your partitions you aren't going to be successful uh with your queries because every query should understand its partitions and how data delivers then data of the table is a group of columns and rows storing partition overall our structure is key space contains tables and tables contains partitions with column data example data on users organized by city is going to be just very simple as that we have key space which contains multiple tables we have table which contains rows and columns we have partition key columns which defines the partitioning we have clustering columns which define sorting order within the partition and we use data columns just to store data and we have partitions therefore based on the partition key columns creating a table in cql is very easy if you know sql this statement must look very familiar to you we are to create table killer video users by city within the key space killer video table users by city some column definitions city last name first name address email all type text don't worry we have more types not only text but more data types and when we define define primary key based on the partition key and clustering codes primary key is identifier for a row it consists of at least one partition key and zero or more clustering columns partition key is the one what defines the partitions so based on the partition key value data will be partitioned therefore splitted into multiple partitions clustering columns are needed to ensure uniqueness within the primary key so for example this sample is bad because our primary key to user will be ct last name first name and can you imagine how many of john smith's we have in new york i remember it's at least 15 of them or something about that and as cassandra is a nosql database it uses absurd so insert or update depending on the situation basically every next john smith will overwrite the data of the previous john smith which will make the first one unhappy so we use email here to ensure uniqueness and to have primary key unique for this one then and sorry i have to interrupt myself but there is a great question i have answered for how do i unlearn defining foreign keys very easy you go and do practical steps on katakura with me so you will see how we do it very soon again i'm running through those steps because we expect you to have some understanding of cassandra so i'm repeating that quickly if you don't get enough into that you have to watch introduction to cassandra workshop where we go through it slowly so rules of a good partition partitions are very important if you don't do your data model with a partition right you will run into troubles inevitably you need to know how to use your tool to be successful with it first store together what you retrieve together in this example uh when you open a video you see comments for this video on youtube or on killer video it doesn't matter it makes sense to partition your comments by video so all of the comments to the same video will be stored together and then you can retrieve them with a single simple network operation not requiring looking for my comments all over the cluster which is called full cluster scanners which is obviously very slow so start to gather what you retrieve together in this case i retrieve together comments for my video and therefore i use video id for those comments as a partition key so we store it together then rules of a good partition second avoid big partition you don't want your partitions to be too big why you remember why we are doing partitioning because at some point your data don't fit into a single server in general the idea is the same even with partitioning you still can have too much data for a single server so there is only one hard limit up to two billion cells per partition cell means columns multiplied by rows very simple that that's only one hard limit others are recommendation not more than 100 000 rows in a partition not more than 100 approximately 100 megabytes ish in a partition what does that mean those are recommendations for example for my video and comments table i'm not afraid to have a bigger partitions more than 100 000 rows and a partition you know what on youtube some videos may have much more than 100 000 comments but that's not a problem to me because this data is small basically it will be video id comment id comment author id daytime text and there's a small what would it be some kilobyte so it's not so big deal for me i'm not afraid to have more than 1 100 000 rows in a partition but there is something to consider next point which is very often a mistake like very common mistake constantly growing partitions even the beginning i partition my data for my i o team a lot of sensors reporting their state from all over the world reporting it all the time every 10 seconds and i design my table like that partition by sensor and clustering column reported at so way will be sorted by time and the data is very simple sensor id timestamp and value temperature humidity whatever you want to have this this shamma will it work answer is it will for the first few months and then at some point my partitions will be too big and it won't be so efficient after all so what i may need to have here some bucketing bucketing is a very important idea i'm basically introducing a synthetical column here mindful here integer or a string it doesn't matter and my partition key is now composite it consists of two parts sensor id and month year what is a month year very easy now it's merged so basically zero free to zero to one merge of a year 2021 as soon as merge is over within one week this partition will be closed well like not really closed you still can force put some data into that but the month changes next month a year will be zero four to zero to one and therefore new partition happens and then partitions and never grow too big as soon as month is over you are automatically having a new partition very simple so oops and finally avoid hot partitions uh what does that mean these big partitions if some partitions are big and some partitions are small you will have an even data distribution and some of your servers we will have five terabytes of data and some of them will have five gigabytes of data that's obviously bad but that's not the hugest problem hugest problem is hot partition then some partitions are being accessed all the time and some partitions are not if it leads to uneven data distribution imagine you have a data center of one of 10 servers and you have replication factor free every partition is replicated three times and you have one partition being accessed all the time while others are idling it means what your free servers will be overloaded and other seven are idling and you know what you can try to use elastic linear scalability of cassandra and at 90 more servers upgrade from 10 to 100 it must give you some performance b boost should it or maybe should not the problem is you still have replication free you have your data on free servers even if you have 100 1000 data data say in the same data center it doesn't matter you will have three servers overloaded and all the hours now 97 servers idling just because you have had bad data model so monitor your servers monitor your partitions avoid big partitions and avoid hot partitions that what really really really really matters uh some people may say what in my video and comment uh it may be the bad idea uh also creating hot partitions because you know some videos are being commented all the time like our stream right now but some videos aren't commented too much like some not popular videos then it happens what this video is being commented and my partition in the comments is hot and second one is not being commented and my partition is not what uh that's not the problem because i have a lot of videos and the valves are evenly distributed over the cluster this way my data is spread over the cluster and all servers are busy in more or less the same way and when i throw more servers when i upgrade from 10 to 20 uh data will be reshuffled to more servers and every server will be responsible therefore for smaller amount of partitions and performance pressure will decrease that's very important to understand about the hot partitions it's fine to have loaded partitions as long as they are evenly distributed over the cluster so uh how cassandra organizes data we have uh here example um of the library uh you you will see in a moment with replication network topology strategy so the one application you use on production and for data center west we will have replication factor free for data center east we will have replication factor five and we have a table here create table library venues by year the year name country and home page uh fields and primary key of here which is a partition key and name okay and we are going to have two tables artifacts related to the event we run and then use by year for those events we are having let's take a look at some things based on the data we have notice venues by year uh or organize it with a partitioning based on a year therefore for every year i run i will have a single partition and that's those my data spread it into multiple partitions and they are replicated based on the replications for key space i set here for my key space library i said dc vest free dc east 5 and my table venues by year relates to a key space library therefore those settings will be applied and data is spread at 3 times in the data center west with replication factor 3 one two three one two three and five times the same data in the data center east as those data centers working together in a single cluster with cassandra query language we have two parts of it it's separated into two parts data definition language ddl and that's all about creating and manipulating shamma create key space create table create index create materializer view but also we have data manipulation language dml which is uh must be very familiar to you with select and insert update delete so we when we create tables we need to think about some very important points uh definitions for us create table whatever name we have column type and it's possible to be to have it static i will explain it in a moment with primary key with at least one or more partition key and some clustering optional keys and also we can additionally establish clustering order by ascending or descending because we want to store data in ascending condescending field when uh things we are thinking of all time when creating table table name names types and possibly static designation for columns partition key must have clustering key optional and raw ordering in the partition if we need to set up at weave clustering order by now uh if you have experience with sql you understand most of that but there is a question what static means always comes explanation is very simple do you remember my example with videos and comments for the video one person may write a lot of comments to them video that happens and as we use the normalization every comment record in the table of the comments will have video id comment id user id but also user name so we will don't have to execute any two select statements and all of them again written by the same person in this case i may call uh the column type static for the outer of a comment and it means what this um value value of this column will be the same um for the types of the partition here we have in general static is helps you to have a single value per partition and when you want okay video example maybe not the best here uh but we will work on it a little bit later and then you can easily change it with a single one statement sorry well partition can be single row or multiple row what's the idea here a simple idea with single row partitions i use a unique partition key absolutely unique every person in my users list will have universally unique id universally a unique id looks like that and they will be unique obviously and therefore i will have as much partitions as users i have that's perfectly fine but also we may have multi-role partitions and multi-role partition is very simple now we have here venue and here and what does that mean all the artifacts uh will have that will be the same will have the same partition will be within the same partition as long as they share venue and year so we have here data stocks accelerate uh year 2019 and data modelings or yeah yeah year 2019 but so those make the partition first partition second partition but artifacts where we can have multiple rows within the same partition because artifacts for the for this table will be different and country in this case is indeed the static value it's shared by all the rows of this partition so if i update this value and replace it from usa to the united states of america just renaming for whatever reason i need to execute this statement once uh per partition because all the rows of this partition are sharing this value uh good so sql select and things we need to know about that idea of a select is simple you want to retrieve some data and first you have to specify some selectors so select what columns aggregation functions including user defined function functions when from table name and we can have only one table per query no joints whenever we may have primary key conditions and that's important an index condition and now notice restricted to primary key columns remember i told you cassandra is not for every use case but you may need a world platform serving your data do you remember we have partition key clustering column data columns what does that mean you can not use data columns in your word statement so in this case we have a primary key venue year and artifact and when you're in artifact we always have to specify a full partition game and we can select by artifact but as long as title is not a part of a primary key i cannot use it for searching data so in this case title is a data column and without using secondary indexes i can not search by it you will see how to handle it very soon but that's important search conditions are limited to prime to primary key columns partition and clustering group by only by primary key columns again order by clustering key columns limit okay that must be clear and finally a low filtering and there is don't do it big red field danger what's that alone filtering a low filtering so let's take a look i want to retrieve data from this table and i know i'm looking for artifacts i am looking for talks from data stocks accelerate we did in year 2019 when i do my select statement select everything whatever i need from data stacks from table artifacts by video the linear is data stacks accelerate and year 2019 and i get the whole partition and i'm fine why because driver of your application driver cassandra driver you use to retrieve data from cassandra is extremely smart thing as soon as it starts up your application startups and connects to cassandra it will load with wall of information about data schema allocation of your cluster and you need to get some data you give select venue data stacks accelerate year 2019 then your driver executes a partitioning token calculation and it already knows gem allocation and it knows which exact note ip address to ask so we don't go to this note and ask do you maybe have any information about data stocks events in year 2019 or maybe you know sir with 100 servers it will take ages so your driver knows which node to ask and that's really cool it's just one network operation every node is able to answer every question because if node has this data it's able to answer and if node doesn't know have this data it will work as a coordinator node and it will ask replica nodes there is no problems yeah and you will get it and an example i like to use in this particular case is the the food delivery example so imagine you have a city which many of us live in that have millions of people right what's more efficient is it more efficient to say take your food delivery and go directly to the address of where that food delivery needs to go or is it more efficient to just start at some random house and just start knocking on each door next door and go is this your food is this your food is this your food so essentially if you know like what alex was saying if you are trying to query outside of your partition key using something like allow filtering you're doing the latter thing you're actually just knocking on the doors until you find the data that's why you just scan all that data it's much much more like orders and orders of magnitude more efficient to just go right to the address right so never ever use allow filtering ever and i'm not going to shame anybody it's okay if you didn't know this before but if you're using it your queries now you seriously need to stop you you need to take a look at your app you need to take a look what you're doing you need to stop using them immediately because you can get old into thinking you're okay because at small scales small amounts of data allow filtering will seem to do just like what you think it should but then as that data scales up and things it it's going to come down to its knees and you're going to have problems so that's my alloy filtering rant too yum yup yup don't ever use it so uh if you don't specify a partition you make your uh food delivery guy to knock to every door of the building and if you have a huge building it will take ages so cassandra has a defense mechanism from that if you aren't specifying partition key for example on verse statement when therefore you will have nothing uh you will have no answer just an exception can't do that don't make me do it it's stupid i don't want to do uh to have it run if you insist you can say a low filtering and it will lead with query to be executed but it's a very harmful and uh don't run it on production so by the way alex before we continue on there are just a slew of questions that have come in i've tried to do my best to answer all of them but everyone's asking so many wonderful questions and by the way if we don't answer your question right away we're watching don't worry i'm that's what i'm going to do right now and come back to them so i just wanted to come back to a couple if that's okay with you um because they kind of hit on some of the things we've been talking about um and i'm trying to be um respectful of time here so one from dyson over on youtube is why can't the comment id be used for sorting it can you can use common id for sorting but think of what a uuid is we're in so in cassandra when we talk about ids and by the way there were a set of questions around ids in cassandra we don't use generally you can we don't generally use integers for ids we use uuids why is this well if you think about it cassandra is a distributed system right so if i have multiple nodes that then let's say i have two writes that are coming to insert the same user at the same time somehow right or two even two different users but i want to generate a uid now what happens if i network you know i sever my network connection and these two nodes can't communicate right then if i'm using an ins then what's to stop me from getting two users who now have id5 right whatever the next the thing was and then now the network comes back they can talk uh-oh what's gonna happen well last right wins you know so the last one just to overwrite the one before it so that's not a very protected way to use ids in a distributed system so cassandra uses uuids now going back to that point those uuids are going to be nice you know um sign ins and stuff and they're they're very long the chance of you actually getting a repeat on one is ex infinitesimally small but the point being is going back to dyson's questions why can't comment id be used for sorting well you technically could you can make it a clustering column but how do you sort in a uuid sorting on a uuid isn't going to be very effective so if you were thinking about it i'm trying to read your mind a little bit dyson if you're thinking about it from an int you're like oh yeah i could totally sort out that right but if you're thinking about it from uuid it's a little bit different again when you add it in a clustering column you know um it will naturally sort it for those types of values that can be naturally sorted um let's see uh james i i know you've been asking this question and i wasn't ignoring you it just it got buried and uh hadn't gotten to it um an upset happens only when all of the clustering column is matched or when a partition key is matched right so the upstairs is going to happen if you match exactly in your in your primary key right so if you've got that that's the thing that defines a unique growth so that could include your your partition key and clustering columns if you have them but the key thing it's off your primary keys so once um if you have a an exact match to a row then yes you will just upsert um let's see there were a couple others here there were just so many and again wonderful questions by the way yep uh we i'm afraid we have to proceed because well there were wonderful questions but we still have some topics to cover that's fine i'll i will um i got some of them out and i will do the rest in chat as much as i can yep uh thank you and uh just one thing vj asks so when to use allo filtering uh in short never never moreover don't allow your colleagues to use it include it into your code style quality checks included into your continuous integration uh that's funny so basically if you don't have code review if you don't have code style uh quality checks including for example these allow filtering things one single junior software developer of your team if you don't watch what he is doing he doesn't know cassandra he just wants things working uh he executes select uh sees what it doesn't work goes to stack overflow stack overflow gives a bad advice of a low filtering he puts a low filtering interstellar statement select statement works he is happy ticket is closed cool uh things just got much worse because if this code goes to your production you are deep in troubles and your cluster productivity may be heavily affected so you have to uh hunt down you have to follow and exclude any kind of fellow filtering you may find in your project literally by just checking your code so let's move on we have more things to discuss still uh sample sql statements we partially discuss it on this table already maybe select all from artifacts by venue the venue equals something and years equals something this will work because well we splits a five a partition key and then you equal something e or equals something and artifact equals something it will work because we specify a partition key and clustering column so and finally the venue equals something and here equals something an artifact more than and activate less than it's still fine because we have partition and clustering columns but let's take a look at something wrong some things what won't work select something from artifacts by venue where venue equals something what's wrong with this statement don't remember to avoid full cluster scan we have to calculate the partition token partition token in this case will consist of two parts the new and year but we give only venue it won't work we cannot calculate hash based on only one part of a key so yeah james very right only one key that's wrong when where we knew equals something an artifact equals something will this work nope uh we don't specify year which is a part of partition key which basically brings us to the same station as before where artifact is more than an artifact is lesser than we don't specify partition key at all that won't work well venue equals something and year equals something and title equals something what's wrong with this statement we have venue and here so we have a full partition key partition can be calculated but i cannot sort and i cannot search by title because title is a data column and then finally where country equals something which is again we just don't specify the full partition key so very important implications for data modeling you have to remember at all times when doing data modeling for cassandra primary keys define data uniqueness partition keys define data distribution over cluster partition keys affect partitions and verb for partition sites clustering keys define row ordering within the partition and if query primary keys define how data is retrieved partition keys allow inequality predicates and that's important partition keys allow only equality predicates i cannot make it more than here year 2019 or less than year 2018 why because it's inequalities and partition keys support only equality predicates you can have in so we're here in 2018 2019 2020 because it's still not inequality that's still equality also from the range so this will work clustering keys a allow inequalities predicates and ordering only one table per query no joins and finally we come to the main topic of thoughts of a workshop we are doing today cassandra data modeling methodology sadly i have no slide uh but i want to mention what a data modeling methodology was developed by our colleague dr artem chabot and that's a big honor for us to have him in our team that's so let's get to the process so how exactly do you design your tables and in general the idea of data modeling is here pretty simple we have to collect and analyze data requirements we have to identify participating entities and their relations it's a no relational database but we still have to understand relations we have to identify data access patterns it sounds very scientifical sooner i get simpler on that particular way of organizing and structuring data designing and specifications specifying of a database shimmer and shim optimization and data indexing techniques so then all if you if you are to make any screenshots or photos of your screens that's exactly the moment you have to do it because that's the main slide of a workshop cassandra data modeling methodology is basically very simple you need to understand the data identify access patterns apply query first approach and optimize and implement physical data model so we go to understand the data with a conceptual data model and that goes to entity relational diagrams we have to understand our data if we don't understand our data uh there is a very low chance we can produce something good but also we have to identify access patterns or easier just use cases workflows how exactly people are going to use our application and that goes with application workflow diagram when we map them together as a first transition and as a result we get that logical data model we describe this model using chibako diagram and then we apply second transition we optimize logical data model and we get physical data model again and we get chabatko diagram on a physical level and cql already so statements to execute and create that works in the very simple way designing process step by step conceptual data model application workflow mapping getting logical optimizations getting physical and for our sample i will use youtube very easy everyone knows youtube you're mostly probably watching us on live on youtube right now or maybe not now not live but still and i can say also thank you so much for the warm comments you are leaving corva thank you guys hear you next wednesday that's always exciting and it keep us focused keep us happy and help us produce more and more content about the things we are going to do so we will work with videos and specifically with video comments and um first thing we need to understand again is a conceptual data model so we use entity relationship diagram and that's on your screen that's what we have right now conceptual data model the data we are going to use is user and video and comment video can be commented by user and user leaves comments on the video and that's many-to-many relation because one user can comment many videos and one user can have many comments for the same video and one video can have comments from multiple users every user has some properties email id some others we will skip them for now every video has id title description and others and comment has some relations to user one comment has only one author and to a video one comment relates to only one video that's simple and it has obviously timestamp and text simple as that so i believe in general this idea must be very easy because well it's a user writing a comment on the video i don't want to stop for too long here when second step what's our second step application workflow we need to understand application workflows and here comes something interesting some of you may be now like okay comments videos that's pretty clear why do you use it as an example it's too simple if you are thinking it's too simple to be an example for educational purposes let's see what you will say after those slides we have to remember of the use cases we have to we need applique to understand application workflow and we have to use case here first use case it's actually second here but it's primary use case we want to cover someone opens a video page like you have a video page open it right now as you listen to me and therefore you are going to get some comments comments for this video so that's a use case user opens a video page and let's think of a workflow we need to find comments related to target video using its identifier most present first sounds simple so far but that's not the only use case we have we have a second use case which is first on this example sorry user opens a profile uh which user which profile first there may be some spam reports on some particular user and as a moderator so user stories yes as a moderator i want to see all the recent comments of a suspected user to make a decision to delete his comments or not to ban him or not that's the first case there is one more i can open my own profile to watch my own recent comments and see um my recent comments and their likes dislikes answers and so on you see those are the same basically because let's take a look find comments related to target user using its identifier gets more present first it goes both to me as a user watching my comments but also goes to administrator or moderator watching my comments so basically we have two workflows here merge it into one because they are technically the same so we discussed that conceptual data model user comments video we discussed application workflow i want to open a video page and see the comments i want to open my profile and see my comments now we map them together to get what mapping process i am uh getting uh i'm start to think of the queries so query is between query is everything i think of queries which particular query i will execute to get something find comments posted for you by a user with no posted for a user with known id most present first very second five comments for a video with a known id most recent first and that brings me to the idea comments by user and comments by video if you are too much in the relational of course you are that's why you are at this workshop you are great thank you so much so you may think why do i think of comments by user comments by video why can't it be just comments and the answer is i basically cannot execute those statements to the same table why let's take a look in the comments by user i know user id but don't know video id because i want to get all their comments by user as a moderator or as in my profile so my own comments but still i know com i know user id i don't know video and in the second point then as a user of youtube i open youtube video and i want to see all the comments i know video id because i selected it from at least but i don't know common user id who wrote that and there are four comments by video you see for this one video id is a partition key for the video id i need to have it uh up and running and i need to comments to be already grouped into something and for user id i need comments by user uh distributed and partitioned based on the user id and that comes up to a question where is one entity comment but two tables then normalization no joins downside as always with the normalization we have data duplication if user writes a comment we have to write it into two different tables and if user edits his comment if it's allow it in your system you have to make two updates basically we've batched it's easy but we will discuss batch a little bit later what happens then idea is very simple we need to have two tables positive point in both queries in both use cases we will need no joins and even over thousands of comments query can be executed very quickly asking one particular server be no responsible for this data good and i see question about uh arrows was answer it already up and down is the sorting order of the column values so that's our logical data model comments by user and comments by video and then in the final step we are to get physical data model and going from logical to physical we are applying optimizations to make our tables simpler take a look at the logical data model in the table comments by video which parts of the primary key i need to have video id will be my partition key that's pretty clear to make the partitions based on video when i need creation date why do i need creation date to have sorting based on time but then that's not enough imagine i do not have here comment id imagine i have only video id and creation date that may lead to a problem if two users write the comment in exactly the same time second one will overwrite the first one primary key must ensure uniqueness and when we have to add comment id to a primary key because well overwrite overbuys will be unique same for comments by user we only have user id instead of video id and now we apply very nice optimization called time uuid the trick is cassandra supports time uuid which is at the same moment of the time universally unique idea this long long string which is universally unique in the universe and meanwhile it's a timestamp so they assortable you can sort your uuids by time so that's very convenient you can extract a date from it and you don't afford to have two values here for comment comment id and create it at they're not needed your comment id already has created it inside and it's sortable and it's really cool as a result that's how our table looks like and that's our physical data model in chibatco diagram and finally that's our query for shema definition language we have create table if not exist comments by user user id uid comment id time view id video id uuid comment text primary key user id comment id we have clustering order by comments id descending so i will have my resin first and basically same goes to comments by video thing that's how it works and that's the wall process so let's walk once again from it through all the steps we just did you remember first you understand your data and create entity relationship diagram and you identify access patterns and create application workflow diagram then you map them together apply very first approach to get your logical data model with chiba diagram then you optimize them uh and implement optimizations getting physical data model and as a result you have physical layer chabatco diagram and sql so that's how it works step by step and that's what you do designing your data model and now you know what we are getting close to the end of our time so uh now we will do two things actually that what you will do on your own and we will run the quiz uh not so homework don't forget to get your workshop uh upgrade complete achievement page you need to do homework and first thing for us to do will be order management data i send you a link in a moment so we've uh well link is on the screen but i want to copy paste it in the chat it said that datastax.com learn data modeling by example order management that's a great example uh it's about tracking orders and their statuses and doing updates actually i believe i have it right here so yes i will send the link to you right now that's a part of our homework please don't do it right now because we are running quiz in a moment that's the link to eat and oops sorry it's linked to a wall topic and i want to have order management so that's my guy order management very well so that's the writing and david who do you please yes nikita thank you um so we will be could you please see a copy pasted to discord and that's the first part of the homework in short order management data you will do as a part of your homework it's a very simple idea this diagram looks special like it looks big but actually it's very simple just don't get too much into the details yes you will implement it and yes we helped you and yes with katakura and our support it's really easy even though it maybe looks like not so easy we are going to have users shopping carts containing items and order based on the shopping cart has order status history and delivery options and some addresses we are going to use to delivery so uh it's uh that looks daunting but uh don't um be afraid of that because again that's all prepared and now you have to follow our steps and think why it's doing why it's happening exactly this way so we already did analysis for data access patterns and attitude relationship diagrams if you thought of the logical data model how it would look like from our point of view maybe your idea would be a bit different and how physical data model would look like and how finally you can apply it idea of the link uh we sent to the list uh we sent to the youtube chat is first i actually let me move it a little bit and show it to you so i hope you can see my screen the link i send you sent you explains of the very things we are doing why it works this way we describe we describe application workflow we explain logical data model and finally when it's all done there is one very important thing to put your hands on it don't be just theoretical you need to put your hands on that you need to push the button start scenario and then you have your scenario started you use the katakuda integrated on our styles.com website and you have real cassandra running there and then very simple idea you can just run execute statements from the left side read through the things we are doing make your experiments and then do some of your own tables and see how it works so i can just push the button and get my results here key space created i use this key space i create table but that what you have to do is to walk through it read that and see how it works and try to change something that would be the best one and then finally then all the things i define that when you play a little bit with all those statements here comes the real homework now you have to design queries to execute those statements and to retrieve the data we have to select something find all orders placed by jaw find all information about order but also we have to design updates so cancel order and because we are having multiple tables how do we cancel order you always can use a solution here which explains uh what particular operation we have to do but in general i ask you to try to do it as much without the solutions as possible because that what upgrades you as a developer for real good so um actually let me move back to my slides and last few things you may need to know before we proceed to request update statement will look very familiar to you update statement is simple but there is one cool thing which may help you a lot uh in your um operations which is called lightweight transaction we explain them more in our works but now you need to know what they exist lightweight transaction help you to define if conditions in your updates inserts and other writing statements so for example if not exist usually cassandra does not do read before write if you want to store something you will get an upsert but if you want to avoid absurd if you don't want to insert data if a row with this primary key exists then you can go to if not exist lightweight statement cassandra will execute read before right and you will get notified what this statement wasn't executed and second statement uh we want to mention what you need to know and they aren't part with this workshop but you need to know about them to design these to use successfully those applications design these applications batch statement help you to execute multiple statements to different tables updating the same data batch helps you to have multiple statements within single batch cassandra batches do not have rollback once again there is no commit roll back begin batch apply batch website there is no rollback mechanism why because transactions like that are slow and distributed transactions are extremely slow so batches don't support rollback uh but there is very low to zero reasons for cassandra to fail applying any kind of them insert statements if your cql is not valid batch will not be accepted and if your sql is valid there are no in data integrity checks there is no operations like that it just gets it and puts into the table basically i've never seen any batch fail it or if it fails it fails all together good uh and basically we are ready to go to quiz i know what we have much more answers questions when we can answer right now but i see cedric and david are doing great job and we i see most of the questions are answered if we were not able to answer your questions please find us on discord or community.datastax.com and now i'm about to switch to quiz so i see we have almost 100 viewers on youtube and therefore i want everyone to be able to participate in our quiz and answer questions don't forget if you joined us uh too late to answer the first questions it's fine because squeeze is just starting please go to mendy.com and use the code 57793515 or scan these scan this qr code you will be able to participate in the quiz now speed matters speed is the key so you have to answer uh quick enough otherwise you cannot be on top of the competition but that's important to give correct answers as well because well that's uh we accept only right answers i see almost 60 people join at us already so i give five more people time to join and we start we are almost out of time for today so just five or sixty one i see 62. okay three more three more and we start two more two left 65 and 66 okay let's start then quiz time now we are going to answer fast to get more points so answer fast to get more points a key space what is a key space is organized into rows and columns contains tables and set replications is the base unit of access is a place to store extra housing house keys uh sorry james no music today i will set up it for tomorrow i swear thanks david that was good good so correct answer is key space contains tables and sets replication it's not the base unit of access partition but it's organized into rows and columns no it contains tables and sets replication and let me see the leaderboard so who was the fastest one who was the fastest one like a lot of correct answers here and bill gave the fastest answer good result and we see a lot of people here with a very very good result so just some minor points difference that was first question ladies and gentlemen so let's see what's happening next question second answer fast to get more points what is a partition key i hope this one will be harder an optional column to a low group by a column to define partitions required required column to set sorting order the key to all the doors and time is up so most of you have answered it correctly a column to define partitions required that's a correct answer and let's take a look at the leaderboard did any of the leaders made a mistake okay no mistakes but now we had some difference in the performance not alex is the fastest hey that's definitely not alex alex you see my arms here i'm not typing first so that's that's not me by no chance it's meme and casper second on the second place and powell on the third place very good result but a lot of things will change we still have six questions so let's take a look question three answer fast to get more points table can have many rows per partition true or false it requires special cassandra yaml configuration it's illegal that must be illegal david what do you think i don't know i think you could probably get arrested for that yeah you can have many rows per partition but don't forget don't go too much because too big partitions or too hot partitions are bad for your cluster so and what's happening on leaderboard oh powell was a little bit slower this time so i guess there will be some changes and not alex is still fastest hey what's happening and casper two holds for second place but but get to the top three congratulations that's a really great result and powell and steele in top 10 so a lot of things may change okay i see dyson dyson's been answering or asking all sorts of questions and i always like recognizing the names there question 4 of 8 and we are getting to the middle answer fast to get more points in cassandra tables which are required data columns clustering columns or partition keys hey david do you like requirements oh i love requirements well i'm not so much with you at these questions sorry but you know what it doesn't matter so um they give you some time to think because for beginners it may be a complex question but actually only one of those is really required and correct answer is a partition key is required you cannot create cassandra table without specifying the partition clustering columns are optional they are important they may be required designing and defining on your design but only partition keys are truly required and definitely not data columns so going on here okay hey who is that not alex what's going on kasper and vaz are holding the second and third places great job everyone so question five let's move on answer fast to get more points inequality predicates i allow it on all table columns partition key columns clustering key columns no inequality predicates are allowed time is up so clustering key columns you cannot have inequalities on partition key columns you have no pre predicates on data table columns so all table columns don't fit and no inequality predicates are lowered it's wrong because clustering key support inequalities so are there any mistakes in the leaderboard so casper and vas are here and there are some changes on the second part of abort and dyson gets to a fourth place which is a really really good result and there are some minor difference in points like 50 points that's nearly nothing daison mufasa javadev definitely have chances to get to it so keep fighting we have more questions yeah and i have it on on good knowledge that uh the person there on the top might be cheating so yep so question number six answer fast to get more points in the data modeling methodology we start modeling the physical data model with logical data model with conceptual data model and application workflow we've copy paste from stack overflow well okay that's what i usually do copy paste from stack workflow just add some a low filtering can it will work you know so correct yeah no alex no you don't say that live no yeah that's brutal so uh in the data modeling methodology we start modeling with conceptual data model and application workflow so entity relationship diagrams and application workflow diagram those uh measure for the beginning most of you've answered it correctly it means we did our job well and it looks like some people are gone uh that completely changes the board so dyson that was a true jump to the top first place and soma gets to the second place with the fastest answer which was enough to bring him or her to the second place and mufasa is dead or such a name such terrible name i'm almost crying uh mufasa is dead on the third place and skeets is ready to fight to get to the top three and jay barnes is also very very close good so let's take a look we have two questions left and question seven is going to happen right now answer fast to get more points primary key defines row uniqueness pulse correct primary key is duplicated it should have been for ages already so people are answering and time is up and correct answer is yes primary key indeed defines raw uniqueness and that's the main duty of a primary key primary key may also define sorting order based on the clustering columns it's half but it's optional what's really important duty for primary key is to define raw uniqueness and it's not deprecated aren't going to happen anytime soon so leaderboard who was the fastest and what are the jumps java death was fastest getting to a place number five mufasa is that it's on the first place with dyson on the second you found the 10 10 points difference out of 6290 that's a great result and mr burns uh is on the third one congratulations very good savaras and javadev keep fighting you are close and summer you have chances you do have chances so don't give up that's the last question question number eight so answer fast to get more points that's your last chance to change things in the leaderboard how does cassandra perform joints cassandra joints require a join table exactly like sql joins cassandra does not support joins cassandra only joins clubs i would say cassandra joins developers from all the world take a look people from all the world are here ah josh yeah thank you george barnes i remember you uh from the last workshop i believe yes great cassandra does not support joints and i see some people answer it just like sql joins and it scares me for half of a workshop i believe it was fatigured because half of a workshop he was saying cassandra does not perform joints okay so thank you so much i hope you enjoyed this workshop i think we are done for today thank you for answering our questions thank you david for joining me i think it's a good time to stop the translation and i'm going to enjoy my dinner work is done what i forgot something okay okay okay if you insist so leaderboard and yes so so tight point to point with so minor difference dyson alex was the fastest but tyson did it awesome oh james arms look at that jay barnes got up there yes jay it's josh uh george barnes uh made it to the top three congratulations well done and dyson is on the first place with 7245 points that's a great result that's a very good and quick answers and no answer failed eight correct answers and some of them were not so fast but it's definitely all the correct answers mufasa is that is on the second place with difference of 10 points that was around one microsecond difference between dyson and mufasa and josh barnes on their third place so those top three make the leaders of today's squeeze and you do deserve some prizes from data stacks so as jack fryer wrote in the chat dyson mufasa and george barnes please contact jack fryer at jack.fryer at datastax.com to claim the price remember to send the screenshot of your menu screen please yep uh yeah and i remember uh josh indeed participated participated yesterday and second ended on the second place so welcome to the top three i think it feels good it must feel it must feel good now if you didn't land on the top three still if you're on the top ten great result there are many dozens of people competing with you if you see your name on the on the list now you did great and if you are not in the top 10 you know what prices we sent i mean yeah that's cool there are some nice t-shirts and uh cups and whatever we sent ask jack but that's not the main point you still can come to the next workshop and earn your prize what matters really matters is that knowledge you acquire today and skill you will work on and polish during your homework and as a result it gives you incredible chance to be better engineer to be better developer administrator or wherever you are as a result be better in the field you're working on get to better workplace get higher salary and enjoy more of what you are doing and share your knowledge and as we are doing right now and that's really great to see you here we enjoy this workshop as much as you do hopefully and and the homework by the way the homework is designed to really gel in all the things that we talked about today so we really really encourage you separate from the badges you can get and all that we we really encourage you to go do the homework it it doesn't take that long and it's going to really help solidify all your knowledge yep and uh regarding here regarding vj's question about uh george barnes workshop yesterday we got a different workshop with different topic so it isn't really helpful for this one so how do you like this workshop you spent over two hours with us today so i hope you liked it a lot and it looks like it's correct and now thank you so much for the feedback uh we are happy to answer your questions and they are happy to join you by the way regarding you are happy with this workshop do you see this gentleman um to the right from meme his name is david gillardy developer advocated datastax directly under him you have a like and subscribe button that's a very good moment to hit this button both he like and subscribe so you will get notified by youtube by our next stream and then open question what's your feedback what did you like the most what was not so exciting what would you change which topics we have to cover in the future that's an open question and you can write multiple questions uh you can write multiple messages here menti.com 5779-3515 show top 20 next time oh i'm sorry we cannot do it because uh it's mandy it's a third party application we just don't know for real i don't we can we have no button to do it keep up excellent work thank you sometime topics were almost the same as yesterday uh today yesterday were more about the cassandra basics and today about more about data modeling but i can agree and there are some things for me to change in this workshop i agree add music yeah and into that into that we do you know for some of the workshops except for like advanced data modeling we don't go into any of the cassandra basics we just get right into it right because it's a different workshop for something like this we kind of present some of the the basic standard fundamentals for people who are brand new um so it just depends on the workshop right but maybe some stuff like alex said for us to adjust in the future uh this workshop was superb more soon we are doing workshops every week sometimes multiple times per week so um thank you so much and yes more workshops are coming soon there is a link um in the youtube chat regarding how to subscribe to the next events yes and take a look david where is it where is it where is it you did greasing david was too quick to answer most questions my database is slower than him that's incredible that's why i was so quiet actually yeah you guys were asking everyone who's asking such wonderful questions like some of them are really solid and ahead of like huh and i was trying to keep up my fingers are bleeding over here that's funny yep uh great now oops get it back uh workshop is not done workshop is not done yet we need to do some things uh speaking about the homework so yes for the homework what we are going to do so let me switch a bit that was it where was it what was it okay just just a moment yes here we go so and by the way while you're bringing that up alex i just want to point out uh everyone please take a note of the links that jack is dropping in those are really important for you because those are free giveaways that's that's separate from the swag yeah um somebody asked about cassandra certification earlier you can totally get a voucher free a voucher for two free tries for your cassandra certification those are 145 bucks a piece so we're giving those to you for free or there's a 300 astro credit in there so even though astra will roll over with 25 bucks a month which is good for like small production workloads completely free right that's that's something that you just get for even using it but you get a 300 you can get a 300 credit for doing nothing more than just creating a database in astra it takes like a couple minutes um that is good enough to run an actual production workload for for months like many many many months um so it's it's your choice right so make sure that you are getting that form from jack so you can get free money from us exactly so oops wrong page oh james is asking what do we get for completing the homework alex yeah so i'm going to first of all don't forget about the voucher uh don't forget about the certification as exactly as david said you can get a free course on cassandra at the academy.datastart.com academy.com and with a voucher you can pass a certification for free well it's not free but it's sponsored by data stacks so data stocks uh covers that and you don't have to pay anything uh learn and pass the certification completely for free it's incredible and it can be your first certification even if you are a student so when you will be finishing your education already as a certified cassandra developer or administrator basically at the academy.datastacks.com we have uh all the courses related to a cassandra things we are doing it for workshops but much bigger because well cassandra is not so simple topic we are getting from most of the basics now now next point uh yeah grab the voucher it's safe now for the homework workshop is not over you have to do homework first thing we do is cassandra data modeling give me a moment wasn't that here nope it was not here oh yeah that's that's it that's it i found it good so that's the first link you have to do absolutely it's prepared by our colleague uh and it's uh really great so um steve holiday thanks to him very a lot he put a big effort into that so first thing is this one cassandra data modeling the basics then second thing we have to do is the one i've shown you briefly is it where is it where is it not here data modeling by example i got i got a link i've just lost it there is my data modeling by example sorry for the delay date modeling here we go so order not this one okay lost it again sorry [Music] yeah data modeling by example that should be this one great so in the data modeling by the example you can choose any you like the most there are seven examples we work to add more if you're more interested into the shopping cart okay like e-commerce we have shopping cart data modeling if you're more interested into the messaging we have messaging data modeling uh particularly interesting example simple but also very important to understand sensor data modeling and more upgraded one is time series data modeling yep and finally the one i was talking about is order management data modeling so i'm inserting the link here oops i was took the amount i was to copy paste the link [Laughter] yes coupling address good don't forget to do a practical part there is uh when you start when you finish the theory there will be a button pump scenario here comes the practical magic so you do the statements uh please share the links i'm sharing links in the youtube chat there are two links already that must be enough and uh as a result you just simply need to uh submit the screenshots of a complete scenario there is a nice fun fancy page like scenario completed then you start scenario do the steps you have the scenario completed so basically you need to submit at least two scenarios completed how to do it very easy admin on linkedin again link is here a link on the youtube chat add me on linkedin and send them as a message we are to establish uh the platform what you can use a form for now we are working on it it's not yet deliberate so please just catch me on linkedin and it will work for you okay uh if you have any questions find us on discord beat.lee slash cassandra dash workshop and basically that's what we have for today homework cassandra data modeling and order data management or sensor data modeling or maybe you would like to do more than that's great developer resources i'm sorry alex real quick um james is asking i i thought about or i thought about ah i thought craig get issue because we did that in the past are we saying linkedin is the new way or get issues uh you see a git issue is great but the only point we don't have a git uh workshop we don't have a repo for this one we don't have a reaper for this one because you do all the practical step uh maybe do you think uh do you think we need to create a repo basically it will be empty just the slides i would say yeah so to answer that james um you're correct for other workshops where we do provide the git repo then yes you can have a issue but i would say for this one why don't we stick to what alex you just stated and uh send it to him over linkedin yep uh so for this one uh you get a linkedin page and we tag you on linkedin and promote you as our workshop achievement complete achievement unlocked and your cassandra upgrade complete for data modeling workshop good so materials hands-on learning can and datastax.com ask share at thecommunity.datastarts.com follow us on youtube twitter twitch and so on and more materials are coming uh on the our places and this court is very important to us um main place for us for the learning is data stacks com slash dev or academy.datastocks.com they are different yeah linkedin is the easiest way to contact him then thank you we are done for today there we go and i'm just i'm just providing i'm popping some of the links down here for folks in discord there we go all right that's it thank you great so thank you so many questions it looks like uh you kept uh david busy all the time i only would you're hearing david clicking clicking clicking clicking like your fingers must be bleeding yeah i need to uh i'm not going to show my hands this time because you don't want to see it's too gruesome but yeah i'll have to get that fixed great okay and again don't forget to ask questions on discord and see you again soon don't forget we run every workshop two times on wednesday and on thursday for different time zones and we will happy to see you in the next one that was david gillardy from the united states alex wolishnev well from europe and thank you so much for coming see you next time thank you everyone see ya bye on the same page [Music] so [Music] [Music] you
Info
Channel: DataStax Developers
Views: 1,406
Rating: 5 out of 5
Keywords:
Id: fcohNYJ1FAI
Channel Id: undefined
Length: 145min 10sec (8710 seconds)
Published: Wed Mar 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.