Introduction to Apache Cassandra!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] i found someone new now i have more room no more formats to change [Music] [Music] [Music] hey welcome everyone um hi stefano looks like we are live yes looks like we're live hi alex hello everyone welcome that's going to be amazing evening together at least for me and steph no it will be evening because he located in italy i'm located in germany so it's already 6 p.m at our place but yeah yeah really really dark but actually uh most of our attendees usually spread it all across the world so that's amazing to have you here thank you for joining doesn't matter it's night or early morning for you we're still happy to have you here with us today topic uh is intro to apache cassandra for developers developers perspective and that's going to be one of the best workshops our data stacks developers ever done so welcome yeah definitely welcome it's nice to see so many of you and yeah let's start because there is much to do and you will all get your hands very much dirty with apache cassandra and with this very cool nosql database don't be afraid because we have you cover it if you don't have any experience with no sql databases and you don't want to install anything on your laptop that's still nothing to worry about because we prepared everything for you right in the cloud and that's quite cool so uh let me start just one second i want yes it will be good like that and now i can see your questions okay and then katra man perfectly right this workshop is fully youtube live session yes but you will yeah you will see that in a moment yeah um so let's do let's let's say one thing first yeah we are live as you see venkatramantanigavelo we are live but uh you can always go back on youtube and get uh this workshop is will be there recorded for you to watch later how how many times you want so right actually all of our workshops are uh live but then available on demand so you can also attend our previous workshops and we have covered incredible topics with spring quarkus node.js and many other languages and topics what can be interesting for you so today if you uh your pilots please fasten your seat belts it's alex volchnev developer advocate lead and data stacks and uh my favorite topics are cloud distributed applications uh high performance heavy limited systems and many many others and uh today with me stefano developer advocate at datastax hi stefano could you please introduce yourself hi well i'm stefano i am developer advocate at datastax i like apache cassandra i love distributed systems and i like both sides of i.t let's say so architecture operations and development so and i like to teach right so uh but that's not all of that actually there is a hidden team behind the scenes which is still very important to us so all the developer advocates team data stacks are here uh making this workshop ready or answering your questions and doing many other things to make you successful with the technologies you are using um speaking of answering questions we do live stream primary on youtube and that's mostly probably there you are watching us right now yes but we have a plan b we always have plan b which is a twitch so if you have any problems with youtube you can switch to twitch please consider one limitation youtube is a primary platform so we watch uh comments we answer questions and so on but twitch is used just for stream only we do not answer questions speaking of leech uh you of course uh can ask questions on youtube and that's totally supported but as soon as stream is over our communication on youtube is interrupted and we don't like it to have uh to be able to communicate with you um afterwards we have a dedicated discord server data starts dot io slash discord and there that is a huge community already i believe like seventeen thousand of people or something like that so quite a bit more than seventy two more uh almost eighteen thousand now wow we might hit eighteen if you jump there and join jumping so and at our discord server obviously we will be able to communicate with you after the workshop meanwhile by the way if you won't be able to complete it today if you have other things to do or maybe you are from india and that's already dark or very even more dark than in europe um so you can totally make your lap afterwards and ask your questions on discord we will be there to answer and as an additional platform we use mentee.com we use menu.com for two purposes first we will ask you questions to understand you better and to prepare content for you better know what to focus on and second in the end of the workshop we play the game with the prizes so if i ask you questions you answer on main team you answer fast enough and you give the right answers you have chances to win prizes from data stocks shipped worldwide and that is truly cool i will give some more details in a couple of minutes yeah second part for us going as it's a workshop not just a boring webinar you are going to push some buttons you can do it live during the workshop with the help of stefan or you can do it afterwards if you first prefer to watch and do it after the live stream both options are fine for us just do your job theory without practice is dead so how do you do that it's a github.com datastaxdevs workshop intro to cassandra yeah and you can get there with using our exclamation mark github shortcut on the youtube chat then our dear nightbot will send you a link that's the workshop and uh other thing for the exercises you will use astra dbm which is a managed cassandra in the cloud so not to make you install all the things and deploy cluster it will take noticeable time and this workshop is a developer's perspective so we don't want you today to install cassandra on your laptop make it happen you will use cassandra deploy it in the cloud just as a customer completely for free and that's actually quite cool you will make a closer look at astra a little bit later but what you need to know for now astra is a cassandra as a service in the cloud astra db now there are so many people yeah so just uh just uh alex sorry just one thing we already we gave you the link to the github repository and you you are very much welcome to open it and have a look but don't fret don't rush on it because we will go through it together so maybe you better pay attention to what alex is saying now because there will be a bit of explanation before we start getting our hands dirty so do whatever you prefer but don't worry because we will go through the practice together so yeah and don't miss theory part because theory part is very important cassandra is not like traditional relational database it's much more powerful but this power comes with some costs you have to understand so if you miss the theory part we will tell you how it works why it works this way you may then later run into big problems on your production and we don't want that so pay attention to what we are showing and saying and uh hey i see van catraman already attended some of our workshops that's like it yeah we don't use git port um for this workshop not today uh hi nuno uh uh great to have you here so let's move on there are so many people doing incredibly great job attending uh our workshops we decided to um say thanks to all of them we decided to highlight their hard work and desire to learn something new so we created participation certificates or pages whatever you prefer and uh for this workshop we have introduction to apache cassandra page which has already which has owned it already by 346 people and you can be and should be the next and the overall amount of people who managed to complete all the workshops and get their pages for internal sequel intro to cassandra and many other workshops we do is already today just across the 2 500 attendees or 200 2500 pages issued live we are really proud of that that's like a little celebration today yeah uh we maybe later when i have to open a little bit of prosecco or something yes good so um how do you do the homework uh how do you get the page to get the page for this whole workshop you have to do three steps first attend the uh live stream or recorded stream uh and you are doing that already right now you know so it's almost done second step will be to complete the lab so the practical steps from this workshop as it explained it at our github repository for this workshop and then finally you will have to do sql and cassandra data modeling scenarios again as explain it on the github prefa they are short like overall time shouldn't be longer than 15 minutes for two of them together then you submit your homework following the link which is in a github and you will get your page within a few days uh submissions are processed manually so don't expect it the very next second it will take some time but you will get it by the end of the week mostly probably maybe beginning for next week we will do our best to review all of the submissions as soon as possible yeah sorry alex there is a question where do we see these badges once we've earned them well badges are verified badges that you can basically they are links that you can spread around and throw in your linkedin profile in your tweets and whatever in in your social contacts to you know break around and prove that you attended successfully uh one of our workshops and yeah that's maybe yeah uh okay alex you're showing very good you're showing exactly so they are available uh they will be available on pager.com and obviously you can paste it everywhere you want on linkedin facebook whatever social network you use uh whatever you need uh today we run into cassandra workshop okay here we go so it will be it will look like that i hope joshua won't be offended if i'm showing his page it will look like that so it's a verified page confirming what steps you need now it's required a little bit more steps so your will be extended this one is old one like from november okay so uh pages are great but so there is something more important than pages known as knowledge and let's proceed to the knowledge part of the workshop and you will practice your knowledge later first thing we do is up we go to main team what is maintain menti is a system we use for the game and for the quiz how do you get on maintain you go to manci.com personally i recommend to use your mobile phone for that because you will need both laptop or whatever and open menu.com on your mobile phone and enter code you see on the screen next to me one four five three three nine nine zero or scan the qr code next to stefano now uh we that's not a game that's not a quiz not a competition it will be done later okay in the end of the workshop but now just some setup questions we recommend to answer them to make yourself familiar with the main team important thing you can answer questions on the youtube chat but if you do it only on youtube chat then in the end you cannot win the prize so you better make yourself familiar with me it's anonymous and actually quite a cool platform and meanwhile i see the question by uh ariane uh pangal yes answer is yes you get voucher for recertification exam okay so let me see how many people join it already 60 and you know what that's that doesn't work like that i see over than 200 people watching us live right now and i once started as long as we don't have at least 80 people join it make him that's not serious just yeah because sorry there is going to be a game a quiz later and we use manti so please join the game and you will have a chance to win some prizes so 76 ic76 ic78 okay okay let other join later if you want to join the game code is available right on top of the screen one four five three three nine nine zero and you can get the link using exclamation mark maintain command so 82 percent join it already and we start and we start and we start few questions to find out more about you where are you from um so is it europe is it maybe at bitcoin uh south america north america scandinavia i've seen people from every place you mentioned in the chat right someone i've seen false writing they come from portugal uh africa usa do we have anyway do we have anyone from brazil let's see not yet oh yeah that must be brazil i guess we now work at cassandra day brazil so it will be epic it will be somewhere in the end of february or maybe march i guess so we will announce that separately good thank you for answering questions hey uh private alexei hello uh ecuador united states south africa okay that's quite malta wow no one from australia but i bet they all are sleeping at the moment good so let's move on um kenya wow cool yeah it's our life that's really global have you ever been to any of our workshops before hola sergio good okay we have a lot of new faces that's cool okay so for those of you who are answering in the youtube chat unfortunately no that does not automatically go into the into what you're seeing in the in the pie chart because you are supposed to answer through the mantic.com platform through the link we are giving you again so actually this is true here now and it will also be true later when we do the quiz if you want to get a chance to win the prize in the later questions you will not answer on you on youtube chat you will go to mendy.com and enter the code that you see at the top of the screen and on the youtube chat box that's the only way to for your answer to be computed so it's a separate platform for you know online uh questions i see a couple of people mention they don't see anything on phone on maintenance try to reload the page please or me and maybe meanwhile installment application on your mobile phone i've heard a couple of times it works better than that version uh good so let us move on most of you are new uh attendees that is quite cool so as we are not introduced yet we are data stacks developers and we run software developers workshop twice per week same topic just different time zones by the way if it's too late for you tomorrow we do the same workshop uh but a little bit earlier so it will be maybe more comfortable for you to join tomorrow for the second run it also will be live and let's move on then how much experience do you have with apache cassandra how much experience do you have if apache cassandra never used it okay less than a year a couple of one two three years that's quite cool so that's totally fine to have no experience with watching cassandra because well that's introduction workshop so perfect place for you to be right now what's your experience with other nosql databases um just to mention maybe maybe reddis uh maybe i don't know neo4j dynamodb and others so there's plenty of opportunities in the modern nosql world okay we have some experts that's cool good i have some experience never used that that certainly makes sense now next point okay thank you for answering that are you a certified nosql specialist maybe in the field of cassandra or any another nosql database no no no no no no okay hey we have some exceptions but overall distribution is quite disappointing see we have good news almost hundreds of people oh okay that's a hundred of people ask for injustice and most of them are not certified in no sql world and you know what i have a very special news for you data stock sponsors your education and certification as a nosql expert for real you can get your online courses for the nosql field and then pass a certification um completely for free i will talk about that in the end of the workshop but that sounds quite cool right yes it is i know i'm happy you love it and if you need more debt data if you need more details we will discuss it later so thank you keep this window open we will ask you more questions during the workshop and let us go on for those having problems with maintenance please try to reload maybe try to open it in your browser right now it doesn't really matter but on the quiz time it will matter because you have to answer fast to get more points uh it does not refresh okay guys i cannot help you with men here but i hope you can open it on another device maybe try to install minty up or maybe maybe you can try on your different on a different tab in your in your pc maybe there might be some glitch on the phone on the phone website but i just i mean looking at your reports that it might not work well i just tried with my phone and it seems to work for me so i don't know maybe they deployed a new version that has some glitches on some phones so try with your pc browser maybe but later oh let's start yeah let's start and now please full focus full attention all hands very important to be stirred first thing you need to know about apache cassandra it's a nosql distributed decentralized database and every word here matters a lot matrix mod a lot you can run uh cassandra on a single server we call cassandra server a node but uh that's maybe like for development purposes educational purposes it's fine in most of the cases you want to run multiple cassandra nodes because cassandra is a distributed database multiple nodes located together within the single data center built up a data center or we call it a ring and then multiple data centers working together build up a cluster and each node is a very powerful thing which communicates with everyone now notice as it's a decentralized database it has no kind of masters or slaves primary servers secondary servers right replicas read replicas nothing like that every replica is able to handle every request and that's very very very important so who uses cassandra cassandra is first of all use it on or on the big deployments with huge data on globally available deployments and companies like netflix which has clients literally everywhere maybe except of north korea let's say and also it's used by the companies who requires uh 100 percent of time not like 99.9999999999999999 i know 100 percent of time uh there are there are some health institutions in germany for example using cassandra not because we have big data like hundreds petabytes but because we need to know what the data will be available in any moment in any time for netflix apache cassandra is the primary database uh 98 of streaming data is stored in apache cassandra uh amount of operations they have multiple clusters but amount of operations on the most active one is 30 millions operations per second and that's quite a lot and overall amount of data is measured by dozens of petabytes which is really a lot but naturally not so much in comparison with apple because for apple apple actively uses cassandra and they have hundreds of thousands apache cassandra servers running together handling hundreds petabytes of the data and doing many many million separations per second on over the thousands of clusters so that's all quite cool and that's obvious what uh there is a lot of customers like instagram netflix ibm activision and many hours running capacity cassandra four-way data but uh why so big companies choose cassandra and not so other competition competing competitors for hundreds of databases on the market let's take a look there are a few uh highly important features you must know in order to understand how cassandra works cassandra is big data radium has exciting read and also right performance liner scalability we will discuss it uh highest availability i call it not high availability but highest availability intentionally uh incredible capabilities for self-healing and automation perfectly supports multi-data center deployments and geographical distribution platform agnostic and vendor independent what does that all mean first big data ready cassandra works over the distributed architecture distributing data over multiple uh servers you will see how it works in a bigger details in the next minute basically if you need more volume for your data you just have to add more nodes and data will be moved and spread over the new nodes automatically you don't have to worry about that migration cluster cares about this migration more data you need more servers you add and you are still fine that's how apple handles hundreds petabytes of data in their database uh read write performance take a look that's not a surprise but many databases can show you a very nice read performance for example scales perfectly for reading if you need to have to execute more read operations let's say select statements read data from your database many databases can scale well but when we speak about right scaling most of the databases nowadays using old approach with this master server secondary server uh rightly application replica approach and this approach simplifies a lot but it also doesn't scale so well and as a result with cassandra you scale great both for read and write every single cassandra node is a very performant and cluster consisting of multiple nodes brings for output to the very next level decentralization means what every node is able to deal with any request doesn't matter if it's a read or right cassandra is masterless database there is no concept of a write replica or primary server linear scalability what does that mean first of all there are basically no limitations for size or for the performance if you need to have bigger volume i told you already you have to add more notes but if you want to have better for output if you want to execute successfully more operations per second you also just add more notes and there is an incredible research and play incredible blog post done by netflix they did a special research on how cassandra scaled scaling from 50 servers within a cluster up to 300 servers in the cluster and take a look there is no surprise cassandra is highly performant but our question is take a look at this straight line on most of the databases these um scaling won't be a linear it will degrade over time more servers you add there is an overhead on adding new nodes and you will get problems at some point you add more servers you don't get performance boost cassandra scales linearly as long as your data model is right we will discuss it later today and as a result you get you can get more and more as long as you add more servers and you don't pay any extra costs you don't have any um overhead on adding new notes highest availability so replication decentralization and topology aware placement strategy take care of plus possible downtimes what does that mean first replication data is replicated to multiple servers even if one of them somehow is not available at the moment you always have other live replicas uh obviously multiple data centers grant you full tolerance features so if you have uh for example frankfurt region not available you still have other data centers having your data available network topology where data placement sander cluster is smart cassandra cluster will take care on how it pay places data it won't put all the replicas of the data on the same server or onto the same server rack or onto the same availability zone like you know there are issues multiple issues with amazon web services availability zones is smart so those replicas life replicas are distributed over multiple availabilities on single availability zone failure doesn't lead to problems and also client um like java client node.js client python client driver is smart so they have very smart reconnection and strong retry mechanism at this moment i can tell that highest possible level of availability for the database self healing and automation alterations for a huge cluster can be really exhausting and cassandra clusters are smart they are really understand a lot of how it works internally try to recover apply self-healing recover fail at operations recover failure data recover fail a data placement and many other and as an engineer you don't have to take care of that that's executed by cluster so most of the cases then with other databases you will have your monitoring will trigger alerts and pull you in and 3 am at night to fix something cassandra plaster can dodge most of well not most of the bullets but a lot of the bullets and recover after that totally automatically okay and of course geographical distribution one of my favorite features cassandra trademark is multi-data center deployments all across the world its exceptional capability for disaster tolerance and other features as long as single data center is not available you can get your data or store your data in others and all data centers are active there is no concept in apache cassandra world or to this data center you can write your data and to this one you can only read like it all often happens with relational databases all data center um available and active and finally platform agnostic apache cassandra is not bound to any platform any service provider for example dynamodb is great but dynamodb is available only on amazon web services you cannot use dynamodb all on google or on your own data center right apache cassandra is available everywhere you want to run it on your own data center it's fine you want to run it on amazon web services it's fine you can you want to run them same plaster simultaneously on multiple different clouds totally perfect it perfectly works and last thing capacity cassandra belongs to an open source non-profit apache software foundation you for sure use uh some of apache software foundation projects hadoop spark kafka maven zookeeper and many many hours there are hundreds of projects managed by apache software foundation and cassandra doesn't belong to any of the commercial vendors of course data starts does a lot for apache cassandra but we don't control it it's all done by apache software foundation okay you know what it all looks quite cool do you know uh do you want to know how it works because well that's uh quite a lot uh how does cassandra reach uh perform such performance and probabilities yeah i think i think our attendees are very curious on how cassandra works judging for the from the quality and number of questions that have popped up in the chat uh i don't know if there is time to address all of them so we will i try to answer them in in the chat but maybe some of them will find a better answer while we go through the next slides so how does it work internally cool uh if you maybe you want to voice some questions right now or later after the some internals well maybe one can be said immediately there there was a few discussion a little discussion about whether replication and i mean multi-data center replication is synchronous or asynchronous so uh it is asynchronous as many things concerning cassandra there is not even the concept of transaction as it would be in a relational database so most things happens behind the scenes asynchronously so no direct impact on on latency you do your right and just it's done that being said replication among data centers usually in a healthy cluster occurs within a few milliseconds on reasonable workloads at least yep uh i i would say uh sorry yeah yeah no go ahead i would say what cassandra has very many knobs and levers and buttons and it's highly configurable so in some cases you can make the distribution of data over multiple regions uh synchronous but personally i wouldn't recommend that because obviously this operation cannot be fast yeah so let's move on how does it all work there are a few uh mechanisms you need to know first one data is distributed so take a look uh in the traditional approach with relational databases with mongodb and others by default you have all your information all your tables or all your collections uh on the primary server if you want uh to um scale you can add uh some secondary servers read only servers and also all the full copy of the data from the primary server the server will be there right and that's a great approach but it has a problem first it doesn't scale well the more data you have you need to have more and more data more more powerful servers and in the end it's extremely expensive uh second problem again at some point your data doesn't fit into a single server even if it's the most important expensive server on the world and then you have to do sharding if you ever did sharding before then you know how painful it is and if you didn't trust me you are lucky because sharding on those databases is indeed painful i did it a couple of times i don't want to repeat this experience like i am crying with bloody tears i don't want to repeat it don't make me please uh cassandra was designed to handle any lord that means what at some point we cannot store all the data on a single machine we have to spread it and we better do it not like shared sharding because shardlink is painful but we do it natively take a look beyond the table on the right side we have country city population very simple table country city population in the city how does this data will be stored in the uh data center of seven nodes uh of uh cassandra let's take a look boom like that data is distributed over multiple servers so for example you have united states data over this server and friends data here japanese here canada here australia and india on this one uk on this one and so on that means what not a single server normally has all the data you have but data is spread over multiple servers that's very important to understand what this approach of course creates some complexities you will see soon but the more if you have to host more data more cities more countries you just add more servers and it you will easily be able to handle any size of the data or any requirements on the fro output and the performance point of view so let's take a look how it works take a look here data is grouped by counting that's quite clear because we use country as a partition key partition key is how data is grouped into the partitions so the rows having the same partition key will belong to the same partition for example new york and los angeles above from the united states uh partition they have the same partition key which is the field on the country uh by the way partition may be defined by multiple columns but in this example we define it by a single column just for clarity well then uh what do we do we hash the value of this table of this column with a more more free hashing algorithm and one more three takes whatever string you give it or whatever value you give string integer whatever you prefer and generates an integer value as a token and you can see now 59 12 45 as the tokens well that's a little bit of simplification usually those integers are very very very long but in this case we just simplify them a little bit now let's take a look at our data center of four cassandra servers it has we have four cassandra servers and they have responsibilities take a look first server will be responsible for token ranges from 1 to 25 second one will be responsible for token ranges from 26 to 15 then 51 to 75 and so on that means what sydney with the token 59 because hashed value of australia is 59 we'll go to this server server number three toronto and monreal will go to server number one server a because uh canada being cached via mobile free algorithm will be 12. so they go there and then berlin and nuremberg go again to server free and server four will say uh without any ud scanning such a quite uh sure quite short table now what happens uh we have for no data center and token ranges are quite short per server how does that work if you want to scale out and add more servers take a look the scaling out we add one more machine right now if you do know sharding you would think right now oh my god now i have to shard all the data it's such a lot of work i basically will have to rewrite part of my application to support it not with cassandra because in cassandra cassandra takes care of a partition what happens when you add a new fresh empty new server to the cluster cassandra sees that joins this server to a cluster it's empty it has no responsibilities first thing what happens cassandra cluster recalculates token ranges before from 1 to 100 it all has to be handled by four servers but now token ranges will be shorter because we have more servers so every server will be responsible for approximately 20 of the data you have right now take a look uh what happens then do you have to manually transfer data to those servers of course not again cassandra is smart enough after recalculation of the token ranges cassandra will stream data uh to this server so it will be uh distributed across more nodes and then pressure on each node and amount of data volume on each node will be decreased therefore now take a look most of the businesses doesn't have the same workload all the time there are some um seasons you don't have a lot of customers there are some seasons or days when you have a lot of them for example a black friday christmas sales and so on so obviously you don't need all the time to have maximum out of your servers and sometimes you want to do want what you want to scale in at this moment you don't need to have fires fire servers in your data center amount of operations decreases significantly so we go with scale in and we decommission to cluster to servers what happens then do we have to do recharging manually transfer data recalculate rework our application no cassandra will do that for you so again getting the request to decommission two servers to nodes cassandra will first recalculate token ranges so each server now will be responsible for longer token ranges for the bigger amount of partitions those partitions will be streamed from decommissioned servers uh to the servers which now responsible for those partitions and after that so it takes uh some time because they have to stream data sometimes it may be quite a lot of data you can take down those servers don't pay for them if you are with cloud or maybe redistribute this space and just distribute this virtual servers and your place for other purposes and no there are no downtime during the recalculation thank you um narasimham that's a great question uh answer is no there are no downtime on recalculation no downtime on scaling out no downtime on scaling in and all the operations executed in background and your data stays available all that time i see some person of experience here right and let's move on then [Music] uh so stefano are there any good questions to voice over uh right now well actually a lot of them i'm having a hard time trying to keep pace with all of them um but yeah go ahead sorry uh i wanted just to tell you call in more guys uh from data stocks to answer questions yeah i'll try to ask someone to help there should be not a single question unanswered we have to answer yeah that's what i'm trying to do so there are so many interesting questions there was the question whether partitioning is uses consistent hashing and the answer is yes that's a technical uh feature of the way rows are partitioned into these numbers that ensures that the minimum amount of rows are shuffled when you change the number of nodes and the answer is yes that is implemented in cassandra then there is there are questions about how do relations work in cassandra and i tried to say wait for it because it's going to be very much explained later but i couldn't resist starting a discussion with wen liang zhang and yeah there are no joints big surprise but there are cases where this is replaced by a very uh a wealth data model but in other cases as i was trying as i was writing right now if you really need some kind of analytics workload you can rely on tools such as apache spark on top of cassandra which is a very well working pair of tools of technologies because cassandra itself is is more oltp in itself so but there are actually other questions and i'm trying to recruit someone else to help in the chat because they are very high very high quality and interesting questions today good and um uh last question i wanted to answer before we go on and we are calling in more people to answer your questions so don't worry um john cuentas asks how very partitions affect the read write response time so that's a very good questions it's more about operations than developers point of view but doesn't matter i still want to answer take a look our recommendation is to never [Music] scale out for too many servers at once or never to scale in for too many servers at once if you have five servers we recommend you to add or decommission one at a time not a dozen at a time because obviously if you're trying to bootstrap or decommission a dozen half of your cluster at a time your cluster will be them busy transferring data of course uh but if you make it smooth one by one by one by one then normally it doesn't affect uh really read bright performance that's a quite important point good uh then let us move on because we have still a lot to discuss and i'll see more people from data stacks that jumping in the chat so all the questions thank you hi david thank you for joining that's quite cool so you know what data distribution and all those partitions are really cool they give you incredible super power to scale out scale in be able to handle any black fridays and meanwhile quickly scale back scale in with not overpaying for your powerful huge cloud infrastructure so it's all good this ability to scale and scale automatically linearly is really really cool but you know what that's not enough there must be some other cool features you need to learn from another method you need to understand another approach you have to understand is data is replicated by default take a look what is the replication replication means what we store every row or better to say we store every partition not a single time but we store it multiple times on different servers and if you have multi-data multi-data center deployment over multiple data centers replication factor is a number you define to specify amount of server responsible for each partition so take a look if you have replication factor 1 each partition each row will be stored only once in your data center which is um well let's say not the best approach if you want to have high availability right so i mean it works technically but it's not the recommended approach what happens if you specify the replication factor 2 that means what every server will be responsive responsible for two token ranges for two different token ranges and as a result the very same partition usa from the previous example will be stored two times on two different servers and finally if you use recommended replication factor free that means what each partition is stored on three different nodes right so basically if one of those servers become uh not available for whatever reason it is network outage power outage uh planet update also perfectly varied reason you have to restart your servers from time to time to apply operating system patches to upgrade your database and so on and so forth so plan a downtime no downtime uh server restart no downtime network outage no downtime or maybe you may need multiple data centers if it's a huge downtime a region level don't so how does that work that obviously means you have to update your data in the multiple servers so let's take a look then our data arrives to some node don't forget every node is able to handle every request in this case for example write request and we want to write uh data for usa and you know what you say you store it on those three servers right but somehow server b gets these requests um that's not a replica server for this data that happens not so often because normally cassandra driver is very smart cassandra driver will send the query to a replica which is for this token but not in this case something happened and our query reached kind of a wrong server not a big deal because each server is able to handle each request so this one will become a query coordinator query cardinator knows where data is stored and which server is responsible actually every server knows a data allocation over cluster and then it executes this query to all of our servers so they get this update and store it and we are fine all is good but you can ask uh what happens if one of the note is on fire what happens if one of the node is not available power outage stark attack or maybe aliens kidnapped it uh in short answer is then seeing what one of the servers is not available query donator doesn't drop this data just uh to nowhere but it stores uh persistent hinted handoff in the handoff is the first layer of defense in cassandra and the first tool of self-healing in cassandra then then this failed note or stolen note or whatever happened note it becomes back available then it notifies neighbors okay everyone sorry i was busy i was off i got to get pizza now please watch anything happen maybe um during me being not here not available and query coordinator all the other servers what got queries touching fixed with this node will send this data to it and it will recover and it becomes available again with all the data reached and that's just the first layer of defense out of multiple layers of defense of cassandra from inconsistence and from the situation when one of a node didn't get the update good now some of you may think of um some of you i believe those who worked with distributed systems or replication already know what replication is a great tool if you want to have high availability and how do you [Music] what's the biggest problem with replication that brings one huge problem i will give you a moment to think and meanwhile i want to answer uh the question of emma slovik what happens using replication factor three if a sixth note cluster dies three consecutive notes uh i once was asked what happens if we have a sixth node cluster six node data center and all six nodes are dead will we get our data short answer is no there is no magic if all of your servers are dead you will not get the data so you see if you request for some particular data in a six node cluster of free servers debt it depends on uh consistency level you specify we will discuss it later in short you still may get some data in this scenario if you are lucky enough and if your consistency level is not too high you will see soon in the upcoming steps of the workshop how it works good so let's go with the next step i hope you get your time to think of what is the biggest problem of replication and answer biggest problem of replication is inconsistency or tries to pay off the availability in short there is a don't blame me uh it all brings a lot of problem replication brings a lot of problem but it's not because of me but because of eric brewer created an eric brewer's theorem uh also known as cp theorem which explains what goes wrong in the distributed systems like cassandra or like any other system actually because it's not about cassandra it's about distributed systems in general so what's the story in the distributed environment you can have only two guaranteed qualities out of three in case of emergency what does that all mean simply explaining have you heard a joke what a job can be done quick cheap and good but you can pick only two job can be done quick and good but it's not going to be cheap and job can be done quick and cheap but it's not going to be good right i believe you met those situations like that in your life for sure so it's the same with consistency availability and partition tolerance but let's first talk about what is the consistency what is availability and what is the partition tolerance i will start with availability in this case it's the simplest concept availability means what you ask your database or whatever system you run you ask your data database for the data and you get the answer like how uh what's the state of my account you have i don't know one thousand dollars uh what's uh my public profile picture that's a url to your public profile picture and so on and so forth availability is simple uh if you get the answer your system is available if you don't get the answer mostly probably your system is not available right and don't forget the cp states what in case of problems right so uh it means you have some fire in your data center it's not obvious a lot of data databases we may be not available if even a single node is down especially for a master slave or systems or primary server secondary server approach that's all very often a problem with a single point of failure which is a primary server of course another quality we have to talk about is consistency what means consistency in this case it's a cross node consistency across server consistency what exactly that means take a look you just seen what we update at multiple different servers but if something went wrong and the server wasn't updated and hinted handoff wasn't deliberate then the server basically doesn't know about these update and if you ask two different servers one of them will give you uh most up-to-date press value correct value but one of them may give you outdated information which is not valid anymore and that's called cross server inconsistency cross server is consistency it's quite bad so you want your database to be consistent right you don't want to get stale data i don't want to i believe you do okay we understand availability and we understand consistency uh what's about partition tolerance partition tolerance is um like most uh not obvious and like it's equality what people misunderstood most of the time partition tolerance is the ability of a distributed system to survive network partitioning and what is a network partitioning network partitioning means what half of your data center cannot see and cannot communicate with the second half of your data center imagine you have a data center with five servers and let's say there is a network split between them there is group a of two servers and group b of three servers and they be inside the group they can communicate both groups are visible to clients but meanwhile those two groups cannot communicate to each other it's called network partitioning and that's a can be potentially extremely dangerous situation it may be even worse than downtime you can tell me alex what can be worse than down time when we have a down time my manager is angry cries aloud my customers are angry they change us to some competitors nothing can be worse than downtime yeah something can uh partitioning non-tolerance can be because if your data center is splitted into two different but operational groups then part of your clients may work to with the group a part of the clients may work with a group b and if the situation keeps longer than few hours or maybe even fewer days afterwards you will be absolutely not able to recover from this problem because you will have different updates on different servers it's called split brain and split brain is worse than downtime then you may spend weeks to recover from this situation it's worse than downtime okay so as we already agreed for partition tolerance is the worst problem and you still can have only two qualities available at a time you have to decide do you want to be available and partition tolerance do you want to go here or maybe you prefer to be consistent and partition tolerance in short this place is called ap available and partition tolerance and these place called cp consistent and partition tolerant and different databases behave differently some databases tend to be more consistent and more partitioned tolerant in this case you will nearly always have no stale data all the data is fresh and right but in case of emergency availability is sacrificed and your database is quiet you won't get your answer there are some other databases what prefer availability over consistency and they go to this site and they are available and partition tolerant but that comes with a price right everything comes with a price they are available they buy a partition tolerant you always get your answer but the price to pay is eventual consistency in some cases in some edge cases you may get stale data now there is a huge misunderstanding people all the time say what cassandra is ap available and partition tolerant but i cannot agree with this statement cassandra is configurably consistent as i said you before you have 1 000 or different knobs and levers and buttons to configure for you so your cassandra will work exactly as you want and using different tools you can set cassandra from cp to ap from ap to cp depending on your particular requirements so i prefer to consider cassandra is configurably consistent it's a setting and that's a decision for you to do in any moment of a time for any particular query you can change consistency level you're required to have it defines how many confirmations you will wait before the response is dispatched now take a look there are two main knobs to control consistency replication factor and consistency level let's say you set replication factor to one boom you now always consistent because you always have only one replica but the price to pay availability losing a single note just for a simple restart and data won't be available as long as your note is not running right you pay you paid the price having a higher uh replication factor what's recommended when obviously you may in theory get the stale data so now we understand the knob of replication factor it's actually quite simple let's take a look at the knob of consistency level consistency level you can set per query not at the table level not at the database level but on the perk query basis which particular consistency level you require to have for example if you work in java i believe there should be some people with java experience when you make a prepared statement with session prepare insert into blah blah blah blah blah blah you can set for this prepared statement consistency level one uh or working from command line in common cassandra query language shell you can get current consistency correct consistency level squirm and you can specify consistency all to set consistency level to all okay but what exactly that mean take a look consistency level means how many confirmations we will wait for rep from replica servers before we return answer to a client with consistency level set to one and replication factor set to free we will have three replica servers right uh and with consistency level one when we write data data will be delivered to all of the replica servers but we will wait only for one confirmation as soon as the fastest server responds or fastest answer receive it on the query coordinator client will be immediately notified about that right uh then with the consistency level uh quorum we will wait for the majority of the replica servers to answer for example with replication factor three quorum so majority will be two because two makes a bigger half bigger half sounds fair but well that's what i mean you know means two is a quorum of three or three is a quorum of five or four is a quorum of seven right that must be quite clear now let's take a look at which consistency level are available to us most consistency levels we work with is one two three are not so often used but you can specify them of course quorum and all and there are some more available when you have multi dc consistency level so local uh one or local quorum requires a confirmation from the closest data center to your application if your application runs in europe you want it to work with a european uh database cloud database data center not be fair australian one right it will be uh too long way uh so latency will be too big local means just closest data center uh with the data and each is a big thing each requires uh to all of your data centers to answer with confirmation what biggest part of the replica servers just got this update and obviously if you have data centers in the united states australia europe china i don't know new zealand maybe north pole then obviously that's not going to be a quick operation so as you see working with consistencies you can specify what consistency level you want to have right now specify consistency level one you have a chance to get answer faster because whatever replica node answers faster you will get the answer immediately you go with consistency level quorum you wait for two confirmations and it may be not as lightning fast as one alpha normally under normal circumstances that's more or less the same time and then you can go with consistency level all take a look you have consistency level knob you turn it to maximum and now you have consistency level all that means what you will have for all three replicas to answer with the same data and here comes a little problem cp theorem is still here watching you always watching you and that's a trap so the point is when you turn consistency level to maximum you go to where you go from ap to cp right now you require full consistency and that's totally fine it will work as long as all of your servers are available but what happens if a single server is not available here like this server is down and you require consistency level oh you will get an answer sorry cannot reach desires consistency level uh you are done uh we you won't get any response data is not stored we cannot guarantee it's all good sorry no no no no database is down you cannot get your data well technically database is running but it won't answer why because you require consistency level now um let's take a look how do we get to the uh actually we discuss it already i want to have my data consistent everyone wants and i still want to be able to survive at least single server failure and still be able to get all of my data right how do we get there there is a thing in uh cassandra consistency levels what more or less allows us to get to this sweet spot we want to reach let's take a look how it works it's called immediate consistency how it works take a look when you write data with consistency level quorum it means what we will know what at least two out of our free nodes got the update right quorum majority this server didn't get the data or it didn't we don't know because we dispatched answer to a client what we got his data after first two servers answered we spun we don't know yet maybe it got update maybe it didn't maybe there were an issue in theory if it didn't if it wasn't updated and we know about it this server stored hinted handoff as i told you and this this one when we will recover hinted handoff will be dispatched back but it may take some time so we are for sure the data was updated on those two and maybe on this one now imagine we are reading the data with consistency level quorum with consistency level quorum we will wait on read time for two at least two answers from the server side client asks a question query coordinator will ask all the node and in this case it's a replica node so it stores the data and now there are two options this server will answer uh first or maybe this server will answer first if this cell server answers first query ordinator compares the result see what results are the same right because we are both updated we know about that and immediately returns data to a client all good but if this server answers and the data is the same same situation just discussed but what happens if this server answers and query coordinator sees what data is doesn't match uh query a coordinator as a replica node and this replica node has different versions of the data what will happen query coordinator compares data sees it's different and then compares timestamps of this data every row every value every property has a time stamp when it was set due to that very coordinator sees what it has a fresh value and this server has stale value then two things happen query coordinator will return data to the client because it knows the right value refresh with the volume and it will initiate a repair so for data on this server will be recovered and this server will dispatch a new value to v-square this server and it will be recovered afterwards so now all three replica servers have most up-to-date value in this story so that's how immediate consistency works uh idea of immediate consistency is very simple with consistency level right plus consistency level read higher than replication factor you have immediate consistency if you do it with consistency level one on right time and consistency level all on read time one plus three four four higher than three uh you get immediate consistency but it's not recommended way because read with consistency level all is obviously dangerous and brings you to cp side too much which won't give you an answer if a single server is not available so recommended way to reach immediate consistency is uh write forum read quorum who it was quite cool and final point uh before we go to the exercises data is distributed globally what does that mean we can have multiple active data centers in multiple regions as we discussed it already but also those data centers can be on different platforms it can be amazon web services google cloud microsoft azure on your own data center and it all works as a charm when do we use cassandra uh there are a few main use cases you need to know first everything what belongs to high for output and high volume of data and scalability a lot of write operations a lot of rate operations internet of things event streaming logs any other intensive time series everything what puts a lot of pressure on your database cassandra can handle much more than all our databases then availability if your data set is mission critical you cannot afford losing that you cannot afford any downtime that's a perfect case when we speak about distributed distributed data because you have clients all across the planet totally fine we have global presence with cassandra and finally uh with a cloud native thing you can have all the modern cloud applications available in the cloud if kubernetes for example cassandra perfectly runs on kubernetes with whatever you want but you know what talk is cheap so let's take a look at the code uh hi stefanon hey lan hi so uh looks like it's your turn very good i will switch to your screen in that moment our friends in the chat are scratching their hands ready to you know touch the keyboard and do something and right right so let's see uh let me ask nightbot to refresh our memory on the url of the github repository and one more thing today you will do cassandra exercises with astrodb which is a managed uh cassandra in the cloud i told you cassandra has 1 thousands of lever knobs and buttons and you need someone who knows how to push them you have it because well we are data stacks and we know which lever to push and which button to push so uh you go you can go with you can deploy cassandra on your own that's totally fine but we prefer to go with astrodb which is a managed cassandra based database in the cloud so uh and you will use it today for our exercises don't forget you have a free tier on data stacks astra so you can use it even for small production workload and it still will be free for you only when you exceed like quite 20 millions reads or writes per month or 80 gigabytes of data when you may be asked to pay well for production workloads that yeah totally cost yeah but if you get that kind of workload probably you are running a business already right so you have some money already yeah hopefully so okay so i'm ready to switch to your screen okay i'm ready with my screen here okay thank you there we go okay so oh sorry this was me okay so you have a link in the in the youtube chat to the repository let me quickly show you how it is structured because you might want to do the workshop exercise by yourself and this is absolutely totally possible because all instructions are here for you to follow at your own timing so there are explanation on how to do and submit the homework we will skip them because you will do the homework later so we will first create an astra rdb instance and then we will create some tables and put and read and change and delete some data on the tables that's the first contact with cassandra and we start without further ado we start with creating the instance so there is a link here you see section one create a raster adb instance alex already did a wonderful job explaining to you what astra db is so i will not spend much time it is a database offered to you as a service in the cloud so you don't have to get uh busy with operations and maintenance it's just there for you to use with a few clicks and it has a very generous free tier so okay you nightbot kindly gave us the link to go and i i asked you to click on this astrodev 12-15 link which will bring you to such a screen where you can register to astrodb that's totally free there is no no no credit cards at rest it is just a very uh easy way to get started with cassandra without all of the operational burdens so please you can register with github google or name email and password i will give you a few minutes to you know create your account obviously i do already have one big surprise so i will cheat and just jump on this tab browser tab where i am already logged in the free tier is supposedly available forever as far as i know and you will have these 25 dollars free credit every month renewed so you will actually be able to run a small workload a small production application forever for free as long as you stay within these limits which are pretty generous because as we said it's 80 gigabytes storage and uh 20 million read write operations per month okay so and believe me it's pretty hard to use them in a month uh i see a very hot topic question is cassandra affected by the latest log4j security flow well there have been a few blog posts on that as far as i know no the few versions where that was uh vulnerable have been patched but most of them were not even using the version that are known to be vulnerable and guess what if you use astrodb you do not even have to worry about that because that's all handled for you as a cloud offering so yes 25 every month but this is not the cost you have to pay you don't have to pay anything that's the other way around astrodb gives you 25 dollars of free credit to use every month i'm afraid we have to start uh already a little bit out of time yeah i know but that's important so okay i assume you have registered to astrodb and you will have a button create database actually this is something that you sh you should see here on the left in your database dashboard which is presumably empty so please go ahead click this create database button and i ask you to use the names we suggested to make everything easier for you so i have them here you please it's all i'll paste them in the chat in the youtube chat but anyway it's all explained here in the readme please use workshops for the database name and check sandra for the key space name so let me do that my database and chat sandra is my key space name that's a funny name and that's because the sample exercise we are going to do is based on a small toy chat application then to create your database you will have to choose a provider because astrodb is a multi-provider you can choose whatever you want let me go for aws europe and ireland so you choose any provider any area and any region and then you hit create database now as i click this create database button behind the scenes a new database is created for for you within uh datastax cloud account and it will soon made uh available for you in these dashboards so this is my um astrodb dashboard and you see i already have some other databases but the important thing is that there is a database here called workshops and okay i made my my web page a bit big to to make you easy make it easier for you to read but you see there is this pending status that's fine and good because the database has been is being brought to existence and that will take a few minutes maybe a couple of minutes now uh please go ahead create your database and uh as soon as you get to these stay to this dash as soon as you see your workshop database in pending state please give me a thumbs up in the chat so we have an idea how are you doing so let while we wait let me go back to the created database screen okay someone okay yasmine andreas rocksheet very good so maybe i'll paste again database name and key space name okay i see many more thumbs up giuliano nice to see you again ciao okay very good so okay so while we wait a few words there are many more explanations in this github repository in the readme and by the way the slides for the presentation are there for you to use at your leisure so in the readme you will be able to find much many more many more details let me go ahead because uh enough also okay so my database my workshop database apparently has become active you see here now it is jumped in top position it's active that means that i'm able to start working with that um i bet many of your databases also have become active in the meantime so i know we are a bit short on time so alex what if i start with the next steps in the practice so i think so sure we will have to do it anyway now or later so it's totally fine okay so uh uh in the readme in the in the github repository you see the next step chapter 2 is create tables because as you saw a cassandra database is based on tables now there is a little explanation here as to how does the data modeling process work in cassandra that is a whole topic in itself very interesting and very surprising to some of you probably uh judging from the questions we have seen in the chat we just jump in and see how if that surprises you okay so it looks like there is something weird going on here some glitch no but looks everything is back on track okay i see my workshop the db again so it should be good okay so first thing we have to get to uh interacting with the database so um you see there is this workshop database here if you click on it you will get to the individual database dashboard which has a few tabs here one of them the one we are going to use today is the cql console let me tell you one thing cassandra and astra db alike has have very many ways to interact with it there are rest apis documents layers on top of it provided by something called stargate there are various ways there are there are grpc a lot of different ways to interact with cassandra maybe sometimes through different layers on top of it today we are going to use i would say the very fundamental way to speak to cassandra that is a language called cql cassandra query language and that has the shape of commands i type into a console so if you click on this cql console tab here you should be able to you should be able to reach this console here you see it is it is in the browser but it is a console where you type commands and you run them and this is what you are going to do so let me go back to the readme first thing we want to so okay one thing uh this console commands in the sql console should end with a semicolon and uh if you you can if you get stuck you can just ctrl c as i'm doing now and you just interrupt the previous command and start afresh that's just to to get you started so if you manage to get your cql console running and connected first thing we are going to look at the key spaces that are there so okay now sorry um desk short for describe desk key spaces that's a command i end with semicolon i hit enter and look i get an output these are the key spaces that are there now what's a key space it's just a way to logically group together tables so there are some system key spaces you can see here and one of them is our good old chat sandra here now we have to test the sql console we are starting to work on that so i will type use chatsandra like this semicolon enter now you you notice the prompt has changed telling me i'm working within that key space which is just an empty container so far that will contain tables let's start creating some tables so back to the readme where every command is there for you to examine later if you want now we want to create tables so uh you create table with comma so let me let me say one thing cql looks like sql to some extent i would say very much but the underlying database architecture is different anyway the syntax of a create table command is not so different from what you would see in sql so this is your create table statement we are now creating a table called users because chat application will have some users around and the columns of this table will be email name password and user id okay the important thing is that you see this statement primary key email that's exactly what this line specify what is the partitioning of the table so we will be assigning every different value of email to a different partition in the table which in principle implies that every single email value will be going to a different node because that's how partitioning works you you saw alex has been talking about hashing and distribution of data oh thank you mike g and try to [Music] go quick because we don't have much time left but so please um take this command create table you try to paste it into the console and you hit enter and in a split second the table should have been created for you in this chat sandra key space so we have a table very nice a table for users now we can do we can ask cql console to tell us which tables are in this key space uh well msloex73 the scripts are all uh the chat will not allow to paste long sql snippets but you find everything in the github readme so i'm just following the readme which is there for you to follow anytime to re-watch and play with and do whatever so okay good so back to the sql console there is a table called users that is the one we just created so let me go forward we created our table and that's the table with users but we need another couple of tables for a chat application tables that will contain the posts by the user so what they write in the chat and uh well that's a subtlety connected very much connected to how uh the data modeling process works in cassandra since our application will be a we we want our application to be able to do two different queries on the post we want all posts by a given user but at the same time we want all post to a give a given chat room because there are several chat rooms uh surprise surprise that implies that we will create two different tables with the same content partitioned in different ways you remember this partitioning point the the the trick is that we never want a query to have to access several nodes because that would be slow so if we want all post buyer for a given chat room we want to partition by chatroom but if we want all posts by a given user we also want to partition by user and that means two different tables uh alex i think i did a good job at summarizing data modeling in 22 seconds so i will now jump ahead you see just you will later have a closer look at these two create table statements they have the same columns and they just different name of course and just different partitioning one is partitioned by user id and the other by room id and they will be really two different tables cont and we it will be our applications uh responsibility to make to ensure rights happen on both table in a in a parallel way in an aligned way so let me paste this to create stable commands and done i think we now have three tables in the key space so test tables will tell me there are three tables very nice guess what the tables are now empty uh i've i see very nice very interesting questions in the chat but unfortunately i will not pick them up because we are short on time i will answer answer i will thank you and i think we will all be delighted to uh carry over the conversation on discord for even after the workshop has ended in the following next days so please we can go there uh there is also a question about how the primary key syntax is structured yeah partition key is not composite but there is also something with within the partition called clustering column and uh partition key can be composite uh it can be but it is not in this example yeah because the double bracket syntax specifies that only user id only room id is the partition key but we will come to that later okay so alex do we want to stop here because we have all tables before starting to do the crud or shall i just move on to the data insertion uh as you wish i'm fine with whatever works for you okay so i let's satisfy the curiosity and just insert a few data and read them just as a small taste of what's there to come okay we have three tables posed by users users and post by room and of course they are just all empty so it's time to start putting some data in them which is the set of operations that i would say all databases have to support called crud create read update and delete and we start with create so we insert data now in this fictional chat application i will have three different users and they are identified by three different uuid which are which is a kind of universally unique identifier and let's insert the first of them so there is here uh that's written in a way that makes it easy to read all of all of it it seems a bit verbose but there are actually some more some commands but anyway how do we insert a row in the users table insert into table fill the field field name and then values value value value so this is mr all once you see his uuid is one one one one one that's absolutely a fictional thing just for the sake of illustration usually you would generate them programmatically or have cql itself generate these ids but anyway let's make it easy so with this insert statement we are ready to insert the first row of data into our table users table uh nothing different to create user a lot of five and user a lot of nines so i'll just copy and paste them here for most of you paste like control v will not work in the cql console on the in the browser so you have to right click and click paste maybe you noticed me doing that okay so we have inserted some data maybe i will just i invite you to later have a closer look at the insertion that comes next which is the insertion to the post tables and that's a longer set of insertions but they are they all look the same they are insert into blah blah blah values blah blah blah so let me just blindly paste all of them this is insertions into the post by user table but you remember we have different tables two tables hosting the same pieces of data just partitioned differently okay so ctrl shift v seems to work very good so maybe the shift i missed the shift or it might be a different operating system or browser or who knows what so okay thank you for the remark anyway um so we have to we inserted a bunch of chat items into the post by user table let's do the very same in the post by room table because it is we are the application now we pretend to be the application and we it is our responsibility to keep the two tables aligned so i have pasted another bunch of inserts and now my two post tables are filled with posts by the way the insertions are very similar the only difference is in this case the name of the table because they are really the same table in a sense okay i guess you might be very curious so let's just start with the reading part because we inserted a bunch of stuff and i will just show you how do you read post from a table and you just do it with this select command so select splat from table name semicolon and if you run this surprise surprise i didn't copy it so uh shift ctrl shift v no it doesn't work for me so i have to click and paste okay so this is my select result you see my posts are there to be seen user id post id room id and text so this is the table post by user and indeed you see the rows are grouped by user all rows by user one are together then all rows by user five are together and the one row by user nine is there so this reflects the partitioning in the table and alex before going forward i'll jump back to you if it's okay and we will start we will resume here with more reading of data yeah sure so let me give me a second to switch back to my screen and good okay so let's go on let's take a closer look on the data it starts with us with tables and partitions uh by the way most common mistake in the nosql world if you ask developers is nosql database schema-less most of them will answer yes they are schema-less but not you because you are the best that not all nosql databases are stimulus there are strict schema databases and cassandra is one of them now let's make a deeper dive into their tables we start with a cell data structure a cell is very simple thing intersection of a row and a column it stores some data value right uh then we go to a row moving higher and higher and rho will be a single structural data item in a table keeping usually multiple values why because there are some columns but before we go to columns again there is a very important thing which is a partition partition is a group of rows having the same partition token partition is the base unit of access in cassandra important but stored together all the rows are guaranteed to be neighbors and that matters in a distributed system so when you create a table you have to designate a partition key which will identify the partitions what's the typical mistake of people working with cassandra who has a relational big databases background they think in the like database table uh column and that's a common reason of mistakes uh working through data working with cassandra and misunderstanding a partition or underestimating a partition may lead to a big problems so in uh cassandra we work like a database or a cluster then key space then table then partition and only then rows because partition is the base unit of access and then a table is a group of columns and rows storing partitions so far so good you've seen a little bit of that already at the lab with stefano when we take an overall look on that we have a key space as a group of tables then tables within the key space and partitions within those tables let's take a look when we create table actually i believe you did that already there are few things you have to specify like key space the table will belong to and table a name which it will have some columns with names and properties and we have not only text data type of course but much more and primary key and primary key will consist of partition key and some clustering columns let's uh and before getting to that also when we create a key space we have to specify two things we have to specify a key space name of course and there are some settings most of them are defined by default so you can skip them on the creation time but one thing you always have to specify is a replication strategy replication strategy must always specify a classifier strategy and replication factor per data center with classes it's easy you have network topology strategy class and simple strategy class and simple strategy is again simple it's good for your laptop and bad for anything else network topology strategy is great because it understands uh network topology obviously well i'm quite obvious today of your network of your data centers and that means what replicas of the data replica servers or token ranges will be allocated the way what each replica will stay as far away from each other if possible for example you don't want all of your replicas to be within the same server right because single server rack has the tendency to fail together for example because of a power outage if you speak about clouds and availability zones you want to have replication you want to have your replicas again in different availability zones so out ages like resin outage with aws won't affect your data in this case i specify network topology strategy primary class for all the production deployments and united states west one data center with replication factor three united states is the two data center with replication factor five just to show what i can have more than three good so now let's get to the key primary key partition key clustering column what it all means first primary key is an identifier for a row its primary and the main responsibility is to ensure uniqueness uh take a look here we create a table users by city storing data of our users last name first name address and things take a look here i have a primary key ct last name first name and email what happens if i do not specify email as a part of a primary key if it will be only city last name first name without email do you know that there are a couple of dozens of john smiths in new york or pretty obviously every next john smith will be a very bright surprise for the previous one because cassandra does absurds and basically the date of the previous date of the next transmit will overwrite the data of the previous uh john smith there is a tool to avoid that call it lightweight transaction but i'm afraid we don't cover it today okay so responsibility of a primary key is to ensure uniqueness of the road it defines uniqueness of a row now there is a partition key which is a part of a primary key essential part of a primary key as you guess partition key defines partition in this case for the table users by city we group our city our users by city they leave and partition key may consist of a single or multiple columns but it always must present you cannot emit partition key then clustering columns are optional they use it for two purposes uh first purpose as i discussed already city last name first name primary key will be not unique and data can be overwritten you have a primary key city last name first name email data will be unique that's quite clear but there is one more responsibility data is sorted in tables by default on a right time in memory based on your clustering column it's a very cool feature if you design your tables properly then on the select time you don't have to sort your data because it was sorted already on right time in memory not on disk that's much cheaper and much faster so that's quite a cool feature in via in this case storing comments for a video on youtube we specified in the primary key not only video id as a partition key and comment id but we also specify a created time so all the comments will be sorted by the created at time and when we want to pull them to show them to customer to client we get immediately them sorted with no any additional efforts required to sort them so uh there is a great question by uh van lang zhang if no index how cassandra optimized the query if the query columns is not with a partition or token key you will see uh we have dedicated part of that on cassandra but actually there are secondary indexes there are secondary indexes cassandra but that maybe we prefer primary indexes with the normalization you will see now team take a look those next slides have slide of a year award why because those are the most important slides you are seeing or have been seeing in the year 2021 they are really really very important if you want to fail with apache cassandra doing wrong partitioning is the easiest way to fail but you know what i don't want you to fail we want you to succeed so please pay attention now now take a look it gets a little bit longer when expected sorry so please stay with us till the end of the rules of a good partition part if you have to leave on the scheduler time when please jump in and watch the recording later or just join us for tomorrow run of this workshop so essential the most important slides of this presentation partitioning is everything and what are the rules of a good partition how do you create a good partition design first store together what you retrieve together what does that mean we store in a single partition things but usually retrieve it from database together for example on youtube you open a video you get a bunch of comments right so uh in this case we group our comments for the videos on our youtube let's say youtube competitor by video id that means what comments for the video will be grouped together in the partition and then you want to show to a visitor some most recent of them this query will do its job very easy reaching some particular replica node responsible for this particular partition it's very simple the point is cassandra drivers are very smart they know which server is responsible for which partition then you make a query give me everything give me last 100 comments for the video id one two three as this video id one two three uh tokenize that fifa more free hasher it will know your driver will know which particular server to ask for this data doesn't matter if you have one server or 1000 servers it still will be a very quick operation next avoid big partitions do you remember why do we do partitioning after all because we don't want to put too much data on each server we want to be able to scale out easily and then you have to beat partitions you just basically get to the same problem as you are trying to avoid take a look when we are defining uh comments uh by virion you may say there are some videos on youtube with millions of comments which is technically right but take a look each comment is actually quite short it's a text it's a user id username timestamp uh whatever video id and that's it what is it a couple of uh bytes uh well okay a little bit more than a couple of bytes but still not so much so this partition may be more or less fine um but if you want to group your customers based on country you know what they will be some countries like san marino or vatican and they will be some countries like china or india uh with really a lot of people you see that's important for you to have even data distribution in your cluster and then you have some partitions like that and some partitions what they don't beat your monitor that's not nice that's bad for the cluster so it's very important to keep partitions not too big uh what means uh we do not recommend to have more than 100 000 rows in a partition but it's a recommendation we do not recommend to have more than 100 megabytes in a partition that's a recommendation i've seen clusters so we have seen clusters with even with partitions of a size of more than a terabyte but it's definitely a bad idea you don't want to have it on your production cluster there is only one hard limit not more than two billion cells per partition the cell is an intersection of the column and the row quite clear and uh one more thing we have to avoid not only big but thinking ahead constantly growing partition so that means what in the beginning your partition may be small enough but if you didn't think ahead it may grow over time take a look when we have a table storing information from ir from our iot sensor network and sensor network reports all the time in sensor id and you have billions of sensors time stem and some value temperature maybe humidity maybe something else door is open lock is closed uh things like that iot um then we have the table designed like that partitioned by sensor id will it work in the beginning it will work because you don't have uh too much data but as a re recording for a sensor report where state every few seconds it will grow over time and within few months you will have a dozen of huge partitions uh there are a couple of ways to handle that by the way you can think of time to leave when storing data into cassandra you can specify time to leave per row and even per a value per column for this row and the data will be deleted after that time automatically that's quite cool but what if you don't want to delete the data after some months or years what if you want this data to be or if you may be required to have this data available after many years we told you already what partition keys may be compound composite and they may consist of multiple fields take a look there is a thing to save you it may really save your production cluster and it's called bucketing bucketing means what you may have first of all multiple uh columns within your partition key and then partition key will be defined not only by censoring as a result our partition will be based on sensor and month of a year and then the month is over partition will be closed and we will never write that again because new partition will be created and this way every month in this case one of the year can be an integer or a string and it will be like year over date within last two but you know what first so it still should be oh among 40 people always nice i no idea it was interesting okay anyway we have down so i just explain small partition keys and then you can scale you up or skip we will quickly run through uh data moving in this reality from more learning if you want to find more subscribe on our youtube channel twitch channel uh go to datastarts.com and we have a lot of interesting uh scenarios on datastax.com learn and so on now we are getting to the sweetest part like that's my favorite slide or implementation and there were questions on this point right at the beginning so it's good that we come to this yeah i hope the one who asked you a question uh survived their youtube outage so did you know data stock sponsors your education and certification i believe you do already so get your voucher and become a certified nosql developer for free now take a look what you get a voucher doesn't mean you already certify a developer we want those certificates to be really valuable so when you show it on the job interview i am a certified apache cassandra developer we want people to value that to take it as a real value that means there is a real exam with a proctoring with quite hard questions and to get uh to pass this exam you have to learn for it one single two hours workshop is not enough to pass the exam so you go to the academy. you choose your path that's mostly probably going to be developer certification or maybe you're more interested in the administration for apache cassandra operations like and you go through this path it's going to be free courses 101 to 1 to 20 for developer 2 101 201 to 10 for administrator and after that you will be ready to take the exam and pass it well uh but before that you have to grab your certification voucher how do you do that you go to data stacks dot io slash workshop dash voucher this link dtsc dt6 dot io slash workshop dash voucher and do it right now because form will be closed soon like okay you have few minutes but don't wait for too long and take a course and you get the voucher take a course uh uh sign the certification for you there are more information about that at dev datastax.com certifications uh pass the certification write about it on linkedin tech stefano thank me we will be happy to answer your questions totally and we will be happy to congratulate you and actually that's going to be quite cool experience that's what we can do and what we how we can answer to you i'm trying to jump in on twitch so i want to paste you links to our uh okay i cannot sign in on twitch okay find us on discord it's complicated but please do reach us on discord good and a few last words next week we have a workshop on building an e-commerce app with apache cassandra you know already cassandra basics but it never hurts to earn one more page and get some practical experience building an application with apache cassandra so on 21st of december and 20 uh we we do one workshop next week right or two i think it's only one because of the holidays coming yep i hope this time zone it's a european time uh 5 p.m for you it may be different but anyway check it at datastax.com workshops next month we run a bootcamp and that will be much bigger first workshop basically will be the same introduction to apache cassandra when we go for a data modeling when we go for building a full stack back-end application and then finally we discuss microservices in the last workshop of this series so for january plan to spend some time with data stacks developers it's going to be a very interesting show we do workshops weekly don't forget to like and subscribe it's a hard feeling to ask you to like our youtube stream then youtube stream is dead but normally it works it's the first time for a last in one year or so yeah yeah uh join our discord community uh it has over than 17 000 developers to talk about different id things not only cassandra stuff yes uh that was uh stefano lattini and alex walesner thank you so much for joining sorry for all the troubles and inconsistencies finally we are done and you rock you guys just best of all you are heroes just because you stuck with us so far so thank you it was great so you next time see you next time a little bit of music for you in the end and we are done for today thank you for coming and see you next workshop or see you on discord soon and glad it's bye with my ex but we stay a cause you're my index seeing clearly now don't need windex [Music] [Music] index index you
Info
Channel: DataStax Developers
Views: 839
Rating: undefined out of 5
Keywords: DataStax Enterprise, Data Stax, datastax, DSE6, DSE 6, 6.0, cloud, cloud database, databases, nosql, no sql, data modeling, software development, Apache Cassandra, cassandra, spark, Apache Spark, Solr, Apache Solr, graph, Gremlin, TinkerPop, Apache TinkerPop, real-time engineering, software architecture, DBaaS, customer experience, help, academy tutorial recipes, how-to, step by step guide
Id: YlVNrgFz4NM
Channel Id: undefined
Length: 129min 25sec (7765 seconds)
Published: Wed Dec 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.