Advanced Data Modeling in Apache Cassandra™

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone so if you can see us uh if you can hear us set the thumbs up because we are going to start one of the most incredible workshops in our history advanced data modeling that's alex volchniv developer advocated data stocks and today with me a very special guest mr arthur batcom come sorry i'm doing mistake here as well uh yeah um so it looks like we are good hello everyone nice to be here thank you for coming yeah okay so yes we are all good and sound quality should be good i see some people from yesterday meet up at polska data ops uh hi uh mr ribeiro um great okay so 120 people overall watching us right now and that's great that's a good time to start for us i think so um let's then proceed with the first first things first uh our uh speakers for today our main speaker for today is artem that's solution architect at datastax author of data modeling methodology for apache cassandra one of the best experts for data modeling in the world we've with dozens of years in the database research and things and many of the publications uh in international venus uh regarding all of the data things so that's great and i guess you know me already i'm developer advocated data stocks running a lot of workshops and educational events for you and before we start a little bit of housekeeping so uh what's uh we as we usually do we have courses um on youtube and mostly probably you are watching us live at youtube but if you have any problems with youtube streaming any delays or whatever you may switch to twitch that's our plan b so where you can get another stream important thing we don't monitor to reach comments so if you want to ask questions let's proceed to the second point questions easiest way to ask questions in the is the youtube chat but that's not the most correct one let's say because as soon as youtube stream is over chad's chat is gone and we recommend you to use discord to ask questions we have our own discord server with around eight or nine thousand i believe people there and you can ask questions and communicate with data stocks experts and also cassandra experts all over the world to do the homework notice i say homework and do all the exercises on your own we suggest to use catacota at catacoda.comstacks we will send a better link later on so first thing is important to us is how cassandra organizes data this workshop is one of the most important workshops we are doing because you cannot be you cannot deliver successful and efficient application working with cassandra if your data model is wrong cassandra offers great features like highest possible availability incredible performance but if your data model is bad it will not work efficiency here is the shared responsibility you need to not only install and run cassandra but also design your data model that's what this workshop is about we will speak about how cassandra organizes data that matters a lot so first thing we start is the key space key space is the group for is a grouping of tables with shared configuration settings in this case we create a key space call it library and then you create a key space you must define at least one configuration property network which is a strategy for replication so replication is going to be network topology strategy there are two uh kind of strategies here simple strategy which is good for your laptop and it's bad for anything else and network topology strategy which is good for the production or staging environments it's called network topology strategy because it's a layer of network topology so it understands your servers server racks and data centers in this case we create a key space group for tables the for data center vest with replication factor 3 and data center east with replication factor 5. so for every table for every key space we may have multiple tables of course and we store data in the tables in this case we have in the key space library we have table artifacts and venues by year so that's our key space as described ago above the group for uh tables and here we go example how do we create table so create table key space table name venues by year and then the description for the table so how it's going to be structured how it's organized so we are going to have here integer name text country text homepage text primary key consisting of year and name the name is a clustering column and here is the partition key therefore for our table venues by year primary key will be here name and partition key here so we will be partitioned based on the year that's a partition is a very important idea partition is the base unit of access everything in cassandra is organized with the partitions and if you don't understand partitions there is a very little chance to build an efficient data model so we will speak a bit more about that today as for this example we created this table venues by year with a partition key here therefore we will have as much partitions for this table as much years distinct years we have in this table in this example we are having those for year 2019 and year 2015 therefore we have two partitions in this example and how those partitions are distributed as said we are do you remember our key space definition dc west 3 dc east 5 we may have more data centers in this cluster it doesn't matter as they are not mentioned here we will not have them delivered for we will not have these key space tables in these data centers because we kind of ignore them so we will work only with those two then data is distributed for the data center vest every partition of every table in this key space will be replicated three times accordingly to the replication factor free so we have year 2015 one two three and we have year 2019 one two three uh the same idea is for the data center east but as it has replication factor five we will have very for one two three four five uh for first partition one two three four five for the second partition and obviously we will have some notes nodes storing both of those two partitions good how do we operate with the data using cassandra query language cql there are two main parts of that data definition language and data manipulation language so what's the data definition that's everything what we use to operate the shamma for example so how we organize structure our data here we have create key space create table must be pretty of those create index create custom index when we are here when we are creating secondary indexes create materialized view when we are creating materialized views for the data manipulation we have all the standard operations what you want to have the normal data select insert update delete so no big surprises that still create read update delete typical for every of the databases when we speak about a cassandra query language create table we have to discuss following things first when we create table we are obviously going to have a table name that's pretty obvious when we have to define columns what will fit in this table cassandra is not shameless so we need to have clear the table definitions so we have column name column type like integer or blob or whatever you prefer in this in your situation then it may be static static brings a lot of questions usually it's not so widely used option it allows you to share the same value prove a wall partition so you set at once you have it for the wall partition then the primary key we can define first part is the partition key and second part of the clustering columns and finally we can define clustering order so you if you i hope you remember clustering columns we use clustering columns for two purposes first one is to ensure uniqueness if a partition key is not enough we have to add some fields we have to add some columns to make uh unique uh primary keys so we will not overwrite data with our apps upserts so when we use clustering columns to make to ensure what primary key is unique and second option when we need to use clustering columns if you want to establish sorting order for example when i'm doing uh when i'm managing comments for the videos on youtube or let's imagine we are handling youtube we have video we have comments to this video when i open a video i want to see comments well usually i don't want to see comments they are mostly awful but still let's imagine i do want then i want these comments to be sorted most recent first with a clustering column i also establish not only uniqueness but also sorting order raw ordering the thin web partition then as the partition key and partitions in general is the base unit of access and everything for us basically everything it's all data model is about to design good partitions let's say or maybe 80 percents of this work is to design good partitions then um what's it about we can have two two kind of a partition single row partition and multiple row partition what is a single row partition if i have as a partition key an unique value which is never repeats itself for example in this case i have table users vfid is universally unique id have you seen those strings before like very very very very very very long strings let's call it universally unique identifiers they are very openly used by cassandra and cassandra based applications when i have name text and email text with this design create table users id name email primary key id id will be the partition key what does that mean i will have as much partitions as i have users and that's perfectly fine i don't know why many people are worrying about that like oh i'm going to have so much partitions that's bad no no no that's not bad that's okay to have a lot of opportunities don't worry that's fine you may have if even if you have billions and billions of users you may have billions and billions of partitions that's fine don't worry then our second option here is the multiple row partitions um multiple row partitions means what i'm what some of my rows will share the partition key value in this example i am creating a list of artifacts by venue with venue text year tech integer artifact text title text country text static and primary key will be venue here in this case i'm using great technique call it a composite partition key so i have com partition defined by venue and by year they combine it together to define the partition so the same venues but from different years will be allocated artifacts from the same venues but from different years will be allocated to different partitions because i want to have my partitions smaller so in this case we have these data stocks accelerate year 2019 19 and a list of the artifacts of them and the country united states so they will share the partition and another value in the same year so that's will be different partition because of a composite one having here multiple artifacts within the same partition that's very important to understand because all the rows within the same partitions are stored together as a single group and therefore if you want to retrieve them together we store together what we want to retrieve together they're stored together if i want to load all artifacts by here by venue it will be a one request and one simple query instead of trying to pull data over uh all of the servers in my cluster i may have a lot of them so that's very important to understand things like that um there is a great question will there be a performance problem if we have a billion partitions for users so in terms of partitions we can have as many as we like um essentially we will just have to scale the cluster at more notes so they can handle those billions of partitions but in terms of rows inside of a partition there there are some limitations so you you usually don't want to have more than 100 000 rows in one partition but it also depends on the size of each row how many columns do you have is it just two columns or is it 200 columns so depending on those things you the the situation will change but in terms of number number of partitions you can have as many as you like yep so amount of partitions has no um that doesn't anyhow decrease performance and um you can have as much partitions as you want but if you have two big partitions that's bad because at some point it's getting hard for clusters to for cluster to handle them so you tend you want to have smaller partitions than sorry bigger partitions um and one more question was about uh static columns artem could you please yeah so it so it's important to understand from all of this introduction a couple of concepts and and if i have to name just two i would say the the single row partition tables with singular partitions and tables with multi-row partitions so the difference is the uh cassandra table looks like relational database table but this primary key definition is different because it consists of partition key and clustering key so if clustering key is not there then you have single row partitions so which means each partition can only store one row because the partition key uniquely identifies that row partition key is a primary key then for multi-row tables with multi-row partitions you have the clustering key present so which means in a partition you can have multiple rows predefined they they only can be defined for for partitions with multiple rows and what they mean the static column describes the partition the whole partition not individual rows in that partition so just on on this slide that you have right here you have the venue and year being uh the partition key so that's what define defines a partition essentially this is like a conference data stacks accelerate as a conference and it's identified by the the venue name and year so in this case the country is a descriptor of that conference not of the individual artifacts stored in the partition okay if we don't make countries static what will happen is that each artifact will have country value so there will be artifact a b c and so on and each one will have a country valor which is not a big deal but if you need to update that value you will have to update all of them and it takes a little bit more space but in this case it's a static column so if you want to update the country you're going to update it for all artifacts for the whole partition so this is what static column is and we will see more examples again yep thank you so don't be afraid to ask questions it's a kind of unique chance to communicate and ask them yeah yep i see more questions about partition size in terms of partition size it will sometimes depend um the recommendation will depend on the version of cassandra as well so um prior cassandra cv the the common recommendation is to have 100 000 rows per partition that's the the kind of upper bound 100 000. or 100 megabytes so those rows can have different sizes so it's one of those so the size of the partition 100 megabytes no more or 100 000 rows but uh with cassandra c the um you kind of getting a lot of improvement so it sometimes it can be several um several 100 000 megabytes or rows per petition but it's easy to remember just rule of 100 100 000 rows 100 000 megabytes yep there is a question do you use s3 simple storage service for cassandra my answer is only for backups maybe it doesn't relate any home okay so let's proceed we have limited time and we still have a lot of things to discuss next point with the things as we speak about data manipulation language we have to discuss some important points and some limitations you have to consider here we speak about a very simple example so sql select one of the most widely used operation over data to show it and what's important here uh well select table name select what some selectors columns aggregates or functions including user defined functions from table name that's must be familiar to you here comes first very important limitation where restricted only to primary key columns so therefore two partition keys and two clustering columns when you create a table like in this case we have some table we have some properties parts of a primary key partition key in clustering so we see here here and then you are the partition key parts and artifact is the clustering column what does that mean by them we we we can use them in the uh where clause so we can filter by them and search by them but there is also a title which is text it's a so-called data column so we cannot search by data columns as long as we don't have secondary indexes on them and secondary indexes is a dangerous ground uh which i believe we don't cover today maybe artem you would like to say something about indexes but first i want to finish on the sql select thing so where clause limited only to primary key columns you cannot use data columns here and that's very important you always must provide a partition key if you don't provide a partition key what happens then when you have a partition key in your select statement what happens it's being calculated into the token on your client side and then as cassandra driver is smart it knows shema allocation data allocation over the cluster it even knows which particular servers to ask so cl your driver client cassandra driver will go to the replica node to get data directly from it so it's the fastest possible way to exclude some additional steps required by the remote coordinator node which doesn't have this data stored maybe decrease query time let's say but if you don't specify the partition therefore driver cannot calculate a token and driver doesn't know which node to ask as long as you have three servers three nodes it's maybe acceptable when you have thirty three hundred three thousand notes you are deep in troubles because in this case your select statement must go literally to every node keeping this table and ask hey maybe you have the data for this query it's called full cluster scan and that's a very very very very bad practice it's so bad practice so by default cassandra drops denies this request you don't specify a partition key i will not do that sorry it doesn't work this way but this is a very bad statement downwards a low filtering you can force cassandra to do so and there is a big big red label danger it's a bad practice if you are responsible for continuous integration for example i do suggest you to include check for a low filtering in the code of your application because then you may find out what the bad performance of your cluster is was just delivered by some junior developer who added uh carelessly a low filtering to his code or her code and that that's how cluster now works badly so focus on these kind of things that's important to avoid them yep so allow filtering allows you to use data columns in or avoid partition columns in these um yes hello partition keys in their statement but it's a very bad practice and if you are have to use a low filtering that exactly means what your data model is wrong when we can group by again primary key columns order by clustering key columns ascending descending limit and well about a low filtering i said everything actually already maybe you want to add something here i think keep going uh i i may have a short summary or maybe ask maybe answer some other questions okay good so when we have some sample cassandra query language queries uh working with the table we mentioned here artifacts by venue and select all from well in general select all it's not the best practice let's say so mostly we recommend to specify the particular fields you want to retrieve but for the demo purposes it's fine we can ask where venue is equal and here equal something that's we are going to get a wall partition we specify all the parts of a partition key here and then you does that work yes perfectly oh also we can do it like that then the annual equals something and here equals something an artifact equals something will it work yes it will because artifact is a clustering column so that's fine and we can also do venue equals here equals and artifact more than artifact less than that's again fine because it's a clustering column so it's sortable order by artifact okay again same it's sortable so it will work then let's think let's watch on the bad queries what's wrong with them and why should you avoid them select all from artifacts by then same table but the venue is why this query is bad answer is simple our partition key consists of two parts then a and year and we specify only half of this key where the four token cannot be calculated and that's the same like to ask without any kind of venue we have no partition we have to do full cluster scan and then that's what you definitely want to avoid then we're winner and artifact situation is more or less the same we have winner we have artifact but we still don't have years so tokens still cannot be calculated whenever artifact is more an artifact is less we can't we can sort by artifact but we still have no winner and no year so it will not work and venue and here and title why this one will not work venue is given year is given what's wrong then we have our partition key answer is simple title is a data column we cannot uh sort we cannot search by title because it's a data column you see like data text nothing and where country is it's the same country is a static field we cannot for asset like now works like a data column it will not work we still need to calculate the partition so the most important implications for data modeling you have to consider first primary key define data uniqueness if you have something like city first name last name for the user's list it will work well only as long as you have a completely unique person per cd but i guess we have some dozens of john smiths in new york or something like that so in this case it will be it will work as an absurd and every next john smith will override data of the previous john smith what will make them both angry obviously we have to avoid that there is a way to counteract it with lightweight transactions but we don't cover them today i guess partition keys define data distribution so how data will be split into groups and therefore distributed over your cluster also partition keys directly affect partition sizes and you do not want to have too big or too hot partitions with big partition size it's uh should be clear and with hot partitions i mean when you have two partitions one is not used at all and second is being used all the time that's also the thing we are trying to avoid and have them more or less more or less equally loaded when clustering keys define row ordering and that's uh has been said already if you need ordering within your partitions then clustering keys are the easiest way to organize that speaking of queries primary keys define how data is retrieved so we use primary keys and components of a primary keys to retrieve data from them uh from the tables then partition keys allow only and only equality predicates if you have your partition key as a year for example you can query data like very everything of partition year 2020 but you can do it like partition key smaller than 2021 this will not work inequality is not supported for the partition keys for clustering keys although we allow inequality predicates and ordering and only one table per query no joins now that looks like a huge set of limitations you use it not to have on relational traditional multi-purpose databases like i don't know postgresql for example there is a for result there is a solid reason for that cassandra is designed to be globally available with multiple data centers handling petabytes of data and still delivering answers to you within milliseconds and under those circumstances and it works perfectly in these circumstances if your data model is fine but where is the price to pay like you cannot have everything at once and for free and price to pace there are some limitations if you work with petabytes distributed globally over united states europe and australia and india and still getting your answers with milliseconds you have to pay something for that okay so um to close up the first part uh maybe you want to add something yeah there were quite a few questions um on different things um there um i don't know if i can answer all of them but um in in terms of uh um so what was one if you can actually share my screen yeah sure give me a moment would be would be useful yeah it makes sense in terms of when using when to use limit or not to use limit um the limit will return you partial result of a query so if it's okay then by all means use limit and you will speed up processing but most of the time you want to receive the complete result so you do not use limit in production in those cases but one very valid use case for limit is when you want to for example receive the latest comment of a user for example the most recent uh order or something like that that that's in in this case you have the your orders uh sorted based on the clustering column based on timestamp and then if you retrieve from that partition and say limit one you're retrieving only one row which is the most recent so this is very valid use case for limit in production also i'm on this slide because there was one question what if again about static columns what if my venue has multiple countries then in this case data stack excellate 2019 usa 2019 europe in this case the partition key will not be just venue in year the country will have to become a part part of the partition key because uh data sex accelerate 2019 no longer uniquely identifies a venue because it there are two different venues in different countries so you have to revise it but country is still a descriptor of the venue not of the artifact okay that's what i wanted you to understand in terms of other things secondary indexes rarely but there are use cases for them the even allow filtering sometimes there are use cases valid use cases when you when you no partition key you go directly to that partition you know that the partition is small very few rows there that you can filter through those rows and we do support timestamps and all that but you can find a lot more information about those concepts uh in our cassandra fundamentals course on datastax.com but now i would like to continue uh sorry uh may i answer there is a one very good question i want to answer before we proceed so um of course santosh asks so cassandra alone won't be ideal for olap or data analytics my answer is yes so cassandra is not a multi-purpose database cassandra is more oltp specific so online transactions processing and then there is a solid question when how do i do any kind of analytics or a lab queries over cassandra simple answer is cassandra is uh by a law alone indeed not the best choice for the alab but of course uh there is a great solution there are multiple solutions actually one of the most popular is to use apache spark alongside with apache cassandra so spark will use uh spark will work with the data store it in cassandra and it will allow you to do any kind of analytics you want to do any kind of a machine learnings viva patch spark machine learning capabilities to train your models and even you can use structured query language over cassandra we follow the foreign keys and joins and so on in spark so it's a spark ql spark sql this works perfectly and you can have analytics over uh cassandra storage data one thing you need to have spark of course so that's one more things to run and second point is obviously in this case you cannot count on microseconds answers let's say so it will take some longer time to get your data as it's almost always require full cluster scan okay so um that's uh that's your turn now yes and there are more questions uh for lim limit is never bad use limit if it works for you uh in a query what is bad there so the question was will using limit with um to retrieve data from a large partition is it a bad idea it's not a bad idea the bad idea there is to have a large partition okay so just make sure it doesn't have millions of rows there or it has a couple of gigabytes of data okay then large partition is something that you don't want to have okay so let's talk about um so the the the review we had is basically allows us to give us uh some idea uh to get some insights of how data is stored there but data modeling is more than just that it's it's more than just schema design and understanding how this how those tables works data modeling is about collecting business and data requirements somehow documenting them identifying entities relationships objects and and how they interact they define access patterns which we usually think about as queries but can also be updates and search transactions batch statements and all of that and then finally doing some organization and structuring your data for a particular database like cassandra and specifying database schema okay so the the so far in that introduction we mostly talked about the database schema right and and how those how to retrieve data from those tables another optimizations and and indexing techniques and all of that is very important because uh it will affect data quality it will make sure affect the completeness of data that you're actually storing that everything that is needed will affect consistency and accuracy of your database you don't want to have uh in one table one value in another table another value and things like that uh it will also affect data access uh like efficiency scalability is kind of straightforward but also queryability if you don't have a specific table to support that query or you don't have a material view or secondary index and you will not be able to retrieve that data at all okay so it's very important and in a nutshell there are all of those things there all those things there but in a nutshell there are four objectives that you are looking at you want to understand the data uh you want to identify access patterns understand them um you want to apply the query first approach that's commonly we say how to data model how to design tables is is based on queries so this is you need to understand that creative first approach and you need to know about optimization optimizations and how to implement that now those four objectives they're kind of again make make uh as easy it's easier for us to understand those four objectives four steps but they actually need to be documented it's not like well how are you going to understand the data and once you understand it how you're going to share it with somebody else so here we have four models that support those objectives conceptual data model application workflow model logical data model and physical data model and those specific models have certain tools that support them that repre that can be used to design those modes so represent visualize them like entity relationship diagram for conceptual data modeling application or float diagram for application workflow model chipotle diagram for logical and physical data models and of course cql that allows us to implement implement data database schema at the end uh besides these four objectives four models tools uh there are two transitions and those two transitions are usually what is the most difficult part right how you gonna go from one model to another and we're going to briefly touch some of those we will see some examples but again this is this is a large could be a large topic and you may want to take a course to learn more about it so the first example we're gonna do it by example and alex if if there will be any question please let me know um no not yet okay well there were a couple of questions but they answered already so let's proceed we are at the end of the first hour and we still have a lot of things to do so let's go on yes yeah so the here's uh the first example of sensor network or iot data smart home type of data so where we have uh uh sensors and they they are grouped together into a network and and they we collect the data from those sensors and we need to store that data into cassandra and then so that's oltp part we are gonna store it and then later maybe we analyze it and and we do all that part we spark so that was relevant related to the questions before now you can find all the diagrams for this use case uh based on that link you don't need to go there right now but you know where to find it so in terms of conceptual data model we're using entity relationship diagram here it's independent of database used for relational databases as well so it's technology independent and i'm gonna simply read it for you and and we'll explain a little bit what those uh elements mean but i will read it for you and you just follow uh follow see how how it works for you how try to understand how i can read it so the rectangles here represent entity types so entity type is like a collection of entities so um for example entity is is like an object which is a specific sensor that i can physically hold in my hand but collections of those similar sensors will be called entity type okay so we have any types network sensor and temperature and then we have relationship between them and they those relationships are diamonds and now i'm going to read this diagram so the network sensor network can have many so this is what this n stands for many sensors and i will read in the other direction the sensor belongs to at most one network so this is wha where this one what is doing it this one and end call cardinalities okay and then i'm reading the other part sensor has many can take many sorry sensor can take many temperature measurements and each temperature measurement must is taken by exactly one sensor okay so this already tells us some information about how data is organized on how it's modeled and now network has uh attributes or descriptors name description region number of sensors and name is underlined here meaning it's it's a key uniquely it identifies a network and then for sensor we have id location and characteristics id is a unique key location on the other hand is a composite attribute consisting of latitude and longitude so that's something that we can model in cassandra translate it into cassandra you will see how and then characteristics is is double line there so it means it's a multi-valued attribute what it translates so there can be multiple characteristics for the sensor like maybe uh the precision of the sensor maybe the range uh the types of range of measurement that it can take and so to when we translate it to cassandra it's going to become a set at least a map and maybe we can use uh user-defined times types as well now a temperature has timestamp and value but now timestamp here is a dotted line which means it's a partial key so the temperature nd type is a weak entity type and the rec records sensor records temporary records is identifying relationship what it means basically that temperature uh doesn't have a key uh the the attributes that belong to temperature do not form a key so the key has to be formed based on timestamp and strong entity type key which is id so the temperature the measurement one measurement will have the id or key will have the sensor id and timestamp and then you will have the something that uniquely identifies one temperature value one temperature measurement now relationship types also have keys and they are not shown here explicitly but they are easy to uh to reduce here so um and and basically this gives us a unambiguous picture of what kind of data we have what kind of relationships we have so the next step in in that methodology is we need to understand how that data will be used in an application and this is done with application workflow it's it's very simple notation here we have the tasks or you can think about them as microservices or some something else some kind of computational process and each one because it's a data-driven application each one is supported by equity or access pattern in this case all of them are queries in some cases can they can be other types like updates now we have show all sensor networks that's the task and the query that it needs is finding information about all networks and orders by name of the network then after that the based on this workflow we can go to the task display heap map for a network or we can go to display all sensor sensors in a network and then following by the show raw measurement values the road temperature values for a sensor and again there are specific queries those queries are important because they will define how how our tables will look like but but again we have this diagram we kind of organize it we can share it we document it it's repeatable kind of process process okay and again let's look at those queries because we will use them to design tables the second query given network and and date range we will retrieve the average hourly average temperatures the third query given network we will get all the sensors the force query given sensor and date range we will retrieve all the measurements so when why is it important given the sensor that's a quality search and specified date that that can be equality search but if you if you do the date range then it will be inequality um it will be a range search and then you you you will have to use clustering key to represent date so based on that information based on the data we have on the conceptual data model based on the application workflow we're basically replacing those tasks with with tables that support those queries and this is what what chipotle diagram looks like and we have network stable which has partition key name and has all the other columns that corresponds to the attributes from the other diagram we have the temperatures by network network is partition key date is clustering column and and it is in descending order descending cloud has descending cluster in order now our is another clustering column and sensor is ascending clustering column so there are two more tables and it should be easy to understand what they are right the table name columns these characteristics columns is a map and we have a partition key network sensor clustering key and we have two regular columns okay and and those tables are designed based on mapping rules like and and mapping patterns that we're not going to touch here is we we don't have enough time but you just see an example of of the final result of mapping conceptual and application workflow to this logical data model okay and the physical data model does so it again the second transition from logical to physical and it does two things it defines data types it defines whether you need a material like we're going to use secondary index or materialize view in this case we're not using any but that we do have all the data types for all the columns so for example these characteristics is now we can we define it as a map of text and text and it does use some optimizations so in this case we use three optimizations that are kind of common optimization so let's go back to logical and uh understand how why those optimizations are even needed and and how i how we can come up with them so first of all when you look at each table that you came up with to support the query you uh look at the couple of things first if this table supports how many partitions is it going to retrieve so our query one get all the networks that we have this table has a name of the network as a partition key so to retrieve all the networks we will essentially will three we will receive all partitions in this all rows or all partitions in this table it's a table with singular all partitions so whenever you want when so whenever you see the situation when you achieve many partitions for query then you try to optimize you want to retrieve as less partitions as possible preferably just one partition per query okay so this is where the optimization comes from what would be the new partition key how do i put all the networks into the same partition so i can retrieve them all at once from one partition we come up with this bucket but it's a basically dummy column that will store it's a new partition key and all the networks will be part of that partition there will be separate rows in that partition so we made that single row partition table to table with multi-row partitions so that's the first optimization the second one is a simple one we can always take the date and hour and and come up with a timestamp use timestamp and we can easily combine them together so we don't need two columns we just have one column with data an hour and another less trivial is um another type of analysis is for large partitions okay so in this case if we if we look at this uh table how large the partition will be how many rows will i have in that partition so i need to have some assumptions about my data so for example so the each partition is essentially identified by the network so uh in my sensor network one individual network how many um sensors will i have for example if i have 100 sensors that's my maximum for example or average but usually you you want to look at the worst case or maximum so if i have 100 sensors uh so how many valuers temperature values for example will be generated by that network okay so it will be every hour i will generate 100 rows 100 100 rows every hour so in in one day there are 24 hours so i'm getting hundred um failures and then that twenty four hundred rows in that one partition and then because my network is deployed somewhere and i'm not touching it and collecting data and maybe it's going to run it's going to collect data for many years right so the partition will grow the size of the partition will go there will be more and more rows in that partition so and it will at some point become too large so i want to limit the growth how do i do that how do i limit the growth of the partition oh how do i split the partition into smaller partitions i change i either change one of the columns to become a partition key or i introduce some new partition key in this case i introduced week so the base on the the week is now part of the partition key it's now composite composite partition key consisting of network and week so for every new week i will start a new partition for that for that network and that is going to cap the size of how big the partition can be it's going to be just seven days and each day i'm collecting 2400 rows yep i want to step in here for a second uh with a real life example we had very similar story with one of our clients i will call no names but for the similar task we didn't do bucketing and having only sensor and timestamp for them data storage and in the beginning it all was very good but as every sensor reports it states every few seconds over months it become too big and partitions became too big so with this bucketing with weak number or big date in this case it's perfectly fine you are your partitions will never be too big because as soon as a week is over boom you have a new partition and you are good so with automated regulation on the partitions which is kind of cool yep so actually this temperature by sensor table if date was not uh if it was not a part of the partition key then sensor would collect those measurements indefinitely and the partition will keep growing and growing and growing so that's why date is there as part of the partition key but also because the queenie asked the um that the they ask quiz specifically wanted the date to be equal to some so support equality search okay there are other types of analysis which may be less common but for example duplication right because we're duplicating the data here the latitude and longitude mentioned here in this table but also mentioned here in this table okay so um we have a hands-on for you which essentially what it does it has these tables created in um in cassandra and those queries in express and cql there will be sample data set as well do you have any questions before we proceed um not really there are more general questions regarding other databases and guidelines for the partition size when partition is getting too large okay so this is where you get so the catacotta.com latest text in 10 fab slash 2021. somebody please post it in the chat this is where we have the uh our three examples that we plan to cover today if we will be able to do that space on time but this is the first one let's look at the first one if you don't want to do it that's fine um you can just follow the this short demo that i'm gonna do it for you but also you can you can actually if you want you can uh uh spend time start your own instance okay so uh the first thing we like we said in the beginning all the schema will be part of the key space all the tables will be in some key space we define in key space with name sensor data okay says already exists did i run it previously let me are we sharing the same key space or not i mean check okay so i'm going to create these tables that we defined there on the diagram and you can see the the going from the diagram to the tables is quite easy right it's basically the same column names data types uh the primary key definition contains contains partition key and clustering key okay let's load some data with a script and then select the data and see how it looks like so this is what this is the table networks we have all the networks in the same bucket which is like a dummy pocket with identifier all it can be something else so we have some forest network and forest fire detection network and volcano monitoring network in this data set and let's look at the second table the it's a temperature by network this one will be interesting because we introduced this week as part of the partition key so how are we going to create this table queries introducing something to partition key will also mean that the quiz will be become a little bit more complex because now they may sometimes access multiple partitions sensors by network this is how they look like this is that map which has accuracy and sensitivity to keys to key value pairs there and then temperatures by sensor given uh for example this sensor we can find all the dates and time stamps and values okay now the first video is going to be simple we are going to retrieve uh all the columns from table networks where bucket equals all and we get the results so we're going to work with forestnet which has three sensors okay so this this query now retrieves one partition to answer your um the the to answer the the question that we had right now this one is a little bit more interesting we are going to find hourly average temperatures to every sensor in network for a snap and and specific date range which falls under this week of 2020 0 five so this is how the query looks like right we have the we specified the network we specified the week that's the partition key and then we use range search we inequality predicates here for the date and hour because those are clustering columns we can do that we cannot do inequality on weak on network those are partition key columns okay but what happens if the date range that i choose is going to fall it's not going to be within a specific week but maybe span two weeks so this is exactly the second one the second uh question here there is a one range but it falls uh within within the two weeks that start with different days so you can do either two separate queries where you specify two different week identifiers which is just the beginning of the day of the the date that starts the week so you can do that okay and then you can also combine into one query and in one query if you do that you will simply use in in operator with weak and you specify the week has to be either that one or or the other one so which is which is less work and this is an example where you can use in right sometimes we say don't use in because it's going to be multi-partition create this is a valid example where you will use it because we used pocketing we used weak as a bucket now we're retrieving data from two buckets it's a nice query and then the the search query get all the sensors in network forest nets quite easy the queries are easy if you design your your tables correctly the queries will be very straightforward and then the next one get all the raw measurements for the specific uh date and specific sensor so we are getting for this sensor s one zero zero c we're getting this result and of course this is a toy toy data so it's a small result but all those rows again retrieving from one partitions well from one partition each time we retrieve data we want to access as smaller number of partitions as possible and most of the time it's just one partition okay and we finish this one right we finish this one and we can start next scenario but uh that's gonna be something that alex is gonna yeah yeah so do you have any is any questions about this example not exactly um so there is a question uh pretty painful question i would say uh give me a second to switch off so yes uh so what is bucketing there isn't yeah that's a first question what is bucketing and i believe that's uh we will repeat it in a second a better question if there is a white partition is there any alternative over then redesign in short if you found out what your partitions are too big already so in general best way to avoid big partitions is to think ahead that's like just prepare for that and design your data model properly that's not always possible uh patterns are changing they touch data is changing uh business is changing and then you at some point you find out yourself with partitions to make and you want to do something uh with that uh term what would you suggest uh in this case well you have to for so large partitions not white that's a different term large partition um is a partition with over 100 thousand rows or which is over 100 megabytes so think about this we are designing oltp database we are going to serve queries and it probably will return the result to somebody maybe to to end user or maybe to somebody internally and and but most of the time we design an application that is user facing are we going to retrieve 100 megabytes of data or 100 000 rows that means something and give it to the user that's not going to be an oltp type of right so you kind of thinking about those queries that that uh in advance and and uh if you need that many uh rows and it's probably not an oltp application if you already got into uh into a situation where for some reason that what there was one worst use case you didn't consider some user so you're storing all user comments in one partition and then some user decided that he he made a goal to create hundred thousand comments on the on the website and and achieved it and you have a large partition in that case then you basically have to split that partition for for that user and splitting partitions meaning that you had one partition now you're going to create multiple partitions and that's what sometimes those smaller partitions are called buckets and that's why what what packeting is right and then how do you split well you you make that partition key more selective okay so it's it's not going to be just user id anymore but maybe it's going to be user and video id or maybe it's going to be user and date so the the user or months a week so the the time component is used frequently for bucketing but in some cases it can be even artificial value it can be one two three four five it can be uuid but you will have to if you're going to use artificial one you will have to have a separate table that lists those artificial values so it's user this user has these buckets so that you can later retrieve okay for this user i have these hundred buckets and then i'm gonna retrieve that just the first five the latest five or something like that uh but yeah you have no choice but you have to split split partitions and that's what bucketing is yep perfect so i think we are good to go to the next steps what do you think it looks like questions are answered many thanks to our helpers supporters on the youtube and in discord and when we can go on okay so then the second example we work on today is the investments portfolio data data well first of all about the domain we are talking about i guess nowadays everyone have heard of these stories of gamestop and over stocks who went very high to go when very low doesn't matter we are going to design a data model for the investment portfolio so the set of stocks and other options a person can buy and sell and own so we are kind of developing a data model for a broker who is going to deliver so here we have an example of some typical view you may have in these kind of a brokers we are going to operate with four main ideas here user well very simple it's someone who opens an account and one user can open multiple accounts and account has some traits done so any operation when you buy something or sell something is a trade and it can be or buy or sell and every trade involves one and one insta only one instrument and instrument can be a stock a mutual fund or an etf share exchange trade fund share so in in shorter user can open account and using this account he or she can make he can buy or sell some instruments and this instrument can be stock or mutual fund or etf what the typical ideas for the user is a username and name we will not cover anything additional for this example that should be enough user opens an account account has account id cash balance because you may have some not invested cash balance on the account may have some investment value and total value they are ephemeral and tend to change very quickly so we will not store them in cassandra and they mention it just for clarity here it means it's overall value of my uh account with all the um instruments vote and cash balance left and this account what is the trade is an operation of buy or sell what includes every every trade consists of id shares price and date amount and date and uh instrument so something what i own as a result of a buy is a symbol so um let's say id and amount of them like as much of them i have so as said it can be or stock or mutual fund or etf and i have multiple instruments on my account and i can have multiple accounts when we have to cover the ways and workflows how we are going to use these data first one we have to show accounts of a user so i'm a user i'm logging in i'm going to have to save a list of my accounts if i have multiple so we have to find information about all investment account of a user that's the first one i'm logging in i want to see my accounts very clear second one when i'm selecting one of the accounts clicking on it we are going to display all positions all what i own in this account so all the instruments i've bought and didn't sell before ever all my assets and last point we want to highlight here i want to display account trade history as a customer to understand recent operations to review recent operations i want to display account trade history first to case are very simple i doubt if there will be any users having hundreds or thousands of accounts usually you have one maybe two three but not more not not not dozens let's say then uh display positions uh in an account well that's pretty clear this can be a big table that can be a short table for those who prefer to invest in stocks it can be like a long long list of different stocks if you are using etfs that may be like some of them like i don't know 5 10 depending on your strategy but there can be some they are limited there aren't billions and billions of them usually at least and for the display account trade history it's a longer and more complicated operation well not operation but more let's say use case because i want to see the history based on different scenarios the first scenario we cover is find all trades done for an account order by trade date most recent first so that's a simple scenario i want just to see all my trades more stress and first but that's not the only one we have to cover we need to find all traits for an account and a date range order by trade date so i want to filter operations for the specific date range for example show me all my operations for the december of year 2020 pretty simple when i want to find all trades for an account date range and transaction type buys or sells order by trade date again so it's the same but now i introduce a filter of arranged operation type finally i want to find all trades for an account date range transaction type and instrument symbol so now i want to see all operations regarding one particular stock or one particular exchange trade fund and finally of the scenario number 3.5 find all trades for an account date range and instrument symbol ordered by trade date so there is a pretty lot of data access patterns but they can be grouped somehow let's take a look what we can do with that talking about logical data model we start to think about the following list of um entities accounts traits positions uh that's how it looks using chabot code diagram and let's see how it looks for us so for the first use case account by user idea seems to be very simple username as a partition key it really makes sense i will mostly probably never have my partitions too big because i doubt if customer will open billions more than 100 000 of accounts and if if he is or she is even in this case we may put an application level uh limitation on that so we will not allow to open more than 100 accounts for example when account number that's a clustering column cash balance as an integer well that's more for physical data model but we at logical data model we don't think of this kind of things and name uh just the name of the account uh so a person could name them somehow just for clarity again uh for the positions by account we are going to have a table to show first account uh as a partition key then symbol of so what particular stock or etf is stored here and the quantity how many of them we have and finally we are talking about the four uh units for us for tables for us to access tables by account and date trades by account type and date trades by a symbol by account symbol type and date and so on they are very similar so all those tables are just to show the history of all the operations within the microseconds we want to get them very very very quickly and that's kind of that's again how the normalization works so we are going to when we store when we store data we are going to put it into multiple tables and it kind of makes our work harder when we store data but it's not so big deal because anyway we are going to execute those asynchronously that's fine and but in the end when customer opens an account we are able to show history of the trades in within microseconds dispatching response immediately and they are presorted already to understand how they are presorted that's another story we have to understand right path and read path of a cassandra i will not touch that right now that's closer to vibrations part but that's what good for everyone to know everyone using cassandra what data is press sorted on insert time on right time and then in read time i don't waste my time sorting if i'm using those clustering columns sorting order so as a next step don't forget from logical so first from entities and workflows we go went to logical data model and then to physical data model using the same chabot code diagram it's kind of a weird feeling cartoon to explain how do you use chabatko diagram with you on the workshop but i came um so uh and now we start to think on the physical data layer so what does that mean physical data model layer uh of face data modeling methodology makes us think already at the level of particular data types we are going to use so as you see we are using our already real types and the second thing is the switching from logical proceeding not switching from logical to physical makes us able to apply some improvements and make some optimizations for example here we think what for every uh operation in the history we need to have date when it happened date time when it happened and also trade id and we need them both because we every trade must have a unique id but also we need to know when it happen always up to many microseconds as a result we apply here physical optimization we are having trade id only one column of the time uuid data type tem uuid is a great thing so you can have universal unique id this long long long string you've seen that already but it has the time integrated within and you can put time into uuid if it's a time uid of course you can extract time from time uuid and you can sort it and it will be sorted by time so that's an id with time integrated and that's great that's a very convenient tool for example for um account i don't need that because for account account number will be just text i don't need to sort accounts by time it makes no sense but for trades i definitely must be able to sort them by time and that's here i have account text as a partition key to group them by the accounts and trade id as a time uuid as a clustering column it gives me uniqueness all together they built a primary key which is the unique but also it gives me this certability ability to sort them based on the time which is essential for my reporting tool because i can jump in and see all my operations sorted by time and they're pre-sorted mostly the same idea goes for the other tables for the reporting and we have here clustering uh column always by the account sorry clustering we have partitioning uh partition key always by the account sure but also we have a different clustering column so for type and date we will have type and trade id because it's time also and for symbol type and trade we will have very four simple type and trade as their clustering columns and so on so that's the general idea uh here we have uh here we have uh use it all of our queries identified before all of our workflow units we identified before we want to show investments accounts of a user we want to display position in the accounts we want to display account trade history q1 q2 q3 and in this page we see q1 that's our table to display accounts q2 to display the positions by the account q3 pom pom pom pom pom pom to display the history of the operations uh speaking of hands-on portfolio for investment it's going to be pretty easy there was another question alex that that is very interesting from chris kapoor so what happens um if i need if my access pattern my query is to find all trades for a specific instrument like specific symbol across all accounts so if we go back um to the diagram in just a second right so so we we just have a symbol and we need to get all the traits what do we do in that case um so the this is clear this is clearly for analysis you're retrieving that that type of data because you're going to retrieve a lot there will be thousand millions of trades per day right and we are not limiting the date right so what what what can we do what how are we to approach that query so the simplest idea could be okay let's create a new table where we organize trades by symbol and symbol will be partition key and then there will be the account as a clustering column or or trade id will be clustering columns things like that right but the point is in that partition you will have a lot of rows and then you decide maybe i can split it and so on but that's that's not the right approach that's the the straightforward idea but it's not the right approach the reason is because you're trying to retrieve a lot of data for analysis so and we're designing oltp database here so the right approach would be to first have two data centers one is going to going to be responsible for a or ltp transactions the other one will be responsible for analytics why do we want to have two data centers is because we don't want your analytical query to affect our performance of our oltp transactions right so we're gonna replicate data the same tables exactly the same tables into two data centers when we define key space we're gonna specify okay this is the uh application for operational data center replication for analytical data center and then what are we going to do there with analytical data center we will also have spark there and to retrieve all data for fall trades for a specific symbol we can do two things uh one we can simply take one of these four tables that starts with straights all of them have the same data but organized differently so take one of them and just use spark to retrieve and and filter all the data from the table based on symbol so that's one approach the other approach would be to use to still define a secondary index this is a good use case for secondary index to define it on a symbol column of for example the first table trades by a d to define the secondary index there and use that secondary index only in the um in the analytical data sensor the data center so when you retrieve data with spark the spark is going to use that index automatically for you because there is a spark cassandra connector that knows about schema indexes and so on so so i hope that answered your question maybe longer than you expected but it was kind of loaded questions yeah it's uh i mean yeah sure it takes some time but it's a very good question and good questions deserves good good answers so thank you so much um i have a suggestion we have only 20 minutes left and i would suggest to skip for now investment portfolio data modeling example on catacotta for a very simple reason we have much more important uh step much more important example to do and um i suggest you to do this investment portfolio step practice on your own using the link that provided and if you will have any questions find us at beat.lee cassandra dash workshop it's in the description of the video and we will answer questions there okay i think it makes sense to switch to the next one what do you think yeah i think that's that's gonna be awesome yep the the only thing i would like to mention while you're sharing my slides right uh so will you be able to share my slides in a moment yes please boom you're welcome yeah so the only thing i want to mention for the hands-on this second link that we're using here is for workshops specifically which allows us to have 100 concurrent users but the the first link where we refer to specific course is is the link that you can use after the workshop so most likely not all of you will use it at the same time and there will be enough capacity there go to enough infrastructure to cover all of you so the next one is order management example use case and essentially we're gonna track orders and the statuses of those orders why is it even interesting because when so you you have a status like order cheap delivered but you also have status like cancelled and you also have the uh a bunch of other statuses in between we are going to make it simple we don't we don't need to have dozens of those services but the whole thing is you can not for example cancel an order that has has been delivered already right so you will have to do the return versus cancel so that means um you kind of have to do the conditional updates there and they are called compare and set operations so they're called lightweight transaction in cassandra and we're gonna see an example how and when to use it so this is the use case we're dealing with but we actually this diagram is maybe a little bit larger than that we need for this particular example because we basically have multiple systems here we have user who may have multiple shopping carts and those shopping carts contain items and then user has payment methods and of course if if there can be many of them but each payment method belongs to one specific user so the payment method is going to be identified by both user and some kind of summary so for simplicity we just use summary details and then user has address and those addresses again can have multiple addresses and that the address has belongs to exactly one user and identify by summary and id of the user and they there is a delivery option and then how order uh uh happens right it's it happens during checkout process and this is more complex relationship type here right it has multiple entity types multiple entity types involved here so uh the shopping cart is used to place the order during check up process the payment method will have to be specified the address will have to be specified but actually two of them one for shipping one for billing they can be the same they can be different but you can see two lines and we actually wrote shipping and billing which are actually called roles role of an address in this relationship type so just easier for us to understand what those two addresses are for so the one is for shipping one is for billing delivery option and finally order order has id status timestamp this dash dash attribute types subtotal shipping handling tax total dash means it's a derived one so subtotal is computed based on the price prices of the items and their quantity and then shipping and handling will depend on the address and and the weights and the tax will depend again on on the on the allocation of the warehouse and so on and then there is a total which add them up so derived attributes on the conceptual data model are important because sometimes you may decide to compute them in uh like dynamically in your application and never store into your database and sometimes you can decide okay i'm gonna store them into my database and uh and that that will depend on how that data is used okay and then we have uh order status history which of course does not exist without order the we need to have order and there will be timestamp and status the uh recorded for that history and then each history entry status entry will have will be identified by timestamp of the entry and id of the order okay so in kind of quick terms um i describe this diagram and and you can read more um in in the link that we provided in the beginning of the this use case but um also i wanted to mention that the shopping cart we're not going to use it here but there is another example which uses shopping cart so it models basically this part of the diagram and as you can imagine the payment system the shipping system can be separate system they they uh like delivery system with maybe even external systems so those will have separate either databases or separate kind of people responsible for working with those now the application workflow will keep it simpler than the diagram we have we need to show orders placed by a user we need to display information about one specific order and and we will sort items by name in that order and we can we will show orders placed by a user containing a specific item so that that's like for a given user which orders this user plays over long period of time that contain for example uh a cell phone or in our case that there will be different types of products but um that would be just interesting maybe even just to see okay when i ordered this particular item uh and whether i needed to reorder or what is the frequency of ordering for me and things like that then show and order status history and then cancel an order so this cancel an order is is really cool because it's different than than anything else that we've seen before right it's the update uh access pattern and why is it interesting you will see very shortly right on this diagram so to support those excess patterns we have these define these tables on the diagram and we we can see that well the first one is orders by user user id is partition key then the other timestamp is just for ordering and the order id is for uniqueness right and those are other timestamp order id is a clustering key composite clustering key the orders by id is quite straightforward we have order id we have each row will will be a separate item so it will have a name for sorting right that was a requirement and it will have item id and then we have item description price item coin and so on but then we have many static columns and static columns tells us that we are describing describing order not individual item in this table right so it's going to be order status order timestamp subtotal order shaping tags so those derived attribute types we decided that we're going to store them because they are not going to change once they are computed they will not change we don't need to update those failures uh later so this is like a history of the order and then you can see the the we will store this billing summary and details shipping and payment and delivery they all describe information about order not specific item within that order okay and then order by user item orders by user item so we will use user id and we will use item id as a partition key and order timestamp will be clustering column order id will be another clustering column so for for given user and given item we will have a partition where each row will have the timestamp of the order for ordering and for sorting right and the other id which defines that which order was we placed that that that specific user place order for that item okay and you can see in this table there are no regular columns it's all of them part of the primary key either partition key or clustering key columns and then the interesting part the most interesting part i think for this example is we have update we need to cancel order but the order status needs to be updated to cancel in three tables right there is order status in here orders by user there is one in here order status history by id and this order status here in orders by id how do we do that how do we do that correctly okay we will see so in terms of physical optimization there are none we don't need to optimize everything if we do this careful analysis we don't have time to discuss each one but if you we analyze the user is not going to place hundreds uh the order was hundred thousand of items the user is unlikely to place in a lifetime hundred thousand orders either right this the number of statuses for for an order will be limited and then the number of times one specific user orders the same item will also be limited will be less than the number of total number of orders by the user so the this update is the most interesting part so i'm gonna show you quickly okay how this specific update is done so i'm going to create my tables and they look exactly like those tables on the diagram but the diagram is so much easier to read and explain to somebody than the sql code itself i will load some data so i have these orders by the user you can see different statuses here delivered pending i have this is by the way interesting so this is the orders by id there are many columns here so how do i display it instead of horizontally those columns i'm going to display it vertically using this expand on and i will turn it off but essentially this is one row 12 for example and these are the columns so this is row id this is the item name hazelnuts this is id this is the description price and all those static columns that uh have information about the order okay so you can you can use this trick to display uh information for the for the row where there are many columns for the table where there are many columns orders by user item so for example joe here ordered the same item three times okay three different orders and the last one the the history right so we have the all the information here so the history table contains doesn't update the status it simply inserts new status with a new timestamp but the other two table they only keep one status current status so they will update the status column so given the time limit we have let me jump to the the most interesting part to the update okay so we're gonna cancel order with specific id placed by specific user on that date and [Music] time so essentially timestamp of the order by updating its status from pending to canceled so it's important that we are only allowed to update it from pending to cancel we cannot update it from delivered to canceled for example or ship to cancel we cannot do that because that canceling at that point is not possible you have to return so to do that we will have to use um lightweight transactions lightweight transaction to make sure okay to make sure that the uh we're changing the order status to cancel okay where this is the id of the order if order status is currently pending so what why is it important is because well i i decided to cancel my order but the shipping department has decided to ship it and we're doing it at the same time right and because it can current access to to the same piece of data and what happens is i think i cancelled it but the shipping department didn't know and they override my cancelled with shipping and we have a problem there right the the state of the database doesn't reflect the real uh situation of what just happened so only one will be able to so if shipping department updates it to shipping i will not be able to cancel it anymore because the the the order is being shipped okay and if i update it from pending to cancel the shipping department will not be able to ship because it's no longer pending okay so this is done with this simple update with if statement so what happens if i click it okay so it says that we were able to apply that transaction because it was pending and now it's cancelled so uh what happened here i selected this table orders by id the main table as a source of truth table i'm not updating all three rows at the same time i'm updating only one all not all three tables but only one table and then after i succeeded with this source of truth table i'm updating the other two with regular inserts and updates update and insert so there is no likelihood transactions they are just regular updates and then i can check all three tables they were i i was able to update now if i try to do update again for example i'm running the again first my goal is to do to deal with a single source of truth table and it falls i cannot change it because it's no longer pending it's already cancelled so in in this case my application should know that it should not run these two updates either now the i anticipate a question why not we use a batch statement here because we're updating three tables and we can do it in in a start a begin batch and badge type of uh secure statement right so if you don't know if you haven't heard about badges you we have in cassandra fundamentals course we have information more information about that but i cannot and don't want to do it with a badge for two reasons i cannot are put all three of them into a batch because um this this page is gonna span three different partitions in three different tables so i cannot use lightweight transaction there so lightweight transaction within a patch can be used only for single partition batch and so i cannot do that but then uh how about using batch for the for the other two for the other two updates update and insert it's possible but usually i don't want to use batch for multi partition update insert so for multi-partition benches is usually something that i prefer not to use because of because they are not isolated unlike the single partition badges they are not isolated and they are also much less efficient single partition batches can be actually more efficient than just regular that just two separate statements here but um in our case this is this they post each statement updates different partitions so i'm just using them like that and uh if there is any problem with one of those updates or insert i will simply retry and there is no need for a badge here but there are valid use cases for batches if you're interested again you will be able to find in additional resources that alex is gonna describe soon okay do we have any questions um the only one left i would say first of all people are excited regarding this expand on statement but regarding the questions um does it benefit the input output costs to have limit one uh in the query even if we use the own on a specific order id it's on youtube okay it does benefit so it does benefit so uh with slim it basically the cla in in multi role partition basically the clustering key that uniquely defies a role is used as an index you can think about it it as an index and it it will benefit because the internally the partition doesn't necessarily will be um like like uh so it will still be uh stored possibly on different in different locations uh consecutive but different sectors on this or something like that so yeah if you use limit it's gonna it's gonna help i saw another question possibly if you have a large partition and you only retrieve an uh subset of rows using um inequality search on clustering key and that yeah that may be possible it's it's it it may cut costs um and and again with those sliced partitions um you have to its number of rows is not always everything uh the size of each row is important as well so uh the with cassandra c there were tests i've seen that there were over a million rows and it was still doing fine for time series where the the the size of the of values were not big okay uh yep the expand on is not a new feature yeah so it's uh you can you can use it and i guess that's apache cassandra not data stacks enterprise that's right yes yeah yeah so it's apache version not data stack specific correct um a couple of more questions thanks david so how do we pull oltp data using extract transform load etl tools like informatica how do you do etl with the oltp data of ours well i'm not familiar with informatica but essentially to so extract transform loan right so the the the first step is extract so you need some kind of type of connector is there a connector or is there a specific instrumentation informatica that allows you to specify connected to cassandra like driver java driver or dbc jdbc driver that will allow you to pull the data and then do the transformation within informatica or something else if not then maybe you can still if it's a large piece of data you can still use spark pull the data extract it from from cassandra then do transformation with spark and then [Music] you you you have specific tools for informatica to load that data or maybe you just use bulk loader to unload the data from cassandra into files and then you do some transformations using some other tool maybe even your custom script so there can be a lot of possibilities yes there but but it's not it's not it's not something that is not achievable it's definitely achievable but there are different ways to go some of them can be more complex some of them will be less complex uh to extract uh one of the tools was available one of the good tools was available only for data stocks enterprise was ds bulk but now it's also [Music] works with apache cassandra so you can use it for apache cassandra as well uh to get your data and then process it in any kind you want so it's not be um so eager storage of asks is a materialized view still not recommended so first i will give my answer materialized views are considered experimental but there turn it on in cassandra version 4 and you can use them there's a bit of a dangerous grounds because let's say i know some success stories with materialized views i know some not so success stories with materialized views um so it's definitely something you have to be careful about what's the biggest problem as far as i know at this moment of the time take a look when you have when you store your data when you insert the data and you have replication factor free and one of your partitions one of your replicas don't get the message for any reason doesn't matter what has happened there are a lot of tools to defend you from inconsistency repair on read hinted handoffs scheduled repairs and so on and so forth so even if you run into inconsistency there are simple ways to recover and simple ways to find this inconsistency but with materialized views between the base table you build the materialized view on that and materialize it for you there is no ways to recover consistency over then drop materialized view and create it again maybe i'm wrong at least that was the state some time ago maybe our term would fix me what do you think of materials that we use so uh like you said they're experimental and there are limitations and i actually displaying those limitations for you here again you can find all this information available on datasex.com there is a cassandra fundamentals course there uh and the problem is with materialized views well let me say actually advantages first so it's you can always create a separate table instead of materialized view but then you will have to manage it on in your application on your own so you have to insert you duplicate data you denormalize and you have to insert the same data into two tables like we did with with trades tables in investment example we have many trade tables could could have we could we use materialized view there there or not the answer in that specific case for traits we cannot because the materialized view has one limitation that you can only change one um one column in in the in the primary key so it didn't work for our use case we could not use materialize view there they do not apply but in case if they apply for example i have users by id table and then i have table users by name like in this example so what's the danger what's the limitation the danger is that rarely but it can happen that the materialized view becomes out of sync with the table and cassandra doesn't give you any tool to check this or to fix that so you have to fix it manually or in your application so is it a problem or is it not a problem well you either choose to maintain your own separate table and you do it in your application or you have to sometimes periodically make sure that your materialized view and your base table essentially are synchronized they have consistently the same data and and that's that can be done that can be done with with spark for example if it's available to you then it's easy to do just retrieve data from one table if you've data from ti's views compare we used to actually have even examples of how to do that in our analytics course maybe it's still there but uh also you can you have some suggestions here uh if you're using if you're not using local one or one and you use a local quorum or higher then most likely you will not even have that situation where they become out of sync and so i'm kind of um leaning toward so if you all so i would not be afraid to use materialized use that's my personal opinion and and data stacks enterprise actually uses materialized views in data stacks graph so i would not be afraid because i know the limitations i know what they can do for me on the other hand if i'm already if i have a bunch of tables that i have to maintain and they are not applicable that i cannot use materialized views instead then maybe it's not worth for me to create some materialized views and some tables i will just choose tables and and and have the same procedure for for all of my data duplication okay i'm done thank you yep that's an incredible answer actually that's amazing that's amazing you know what it's always a big pleasure to do the workshops like this one with you and we definitely have to do it more often for the topic uh for the topics our attendees may ask us to run or maybe you have some of your own ideas to do in the future that would be incredible so uh we are seven minutes out of time already and we still have to thing to do but very very very important thing so i'm switching back to my screen so very important place to know is slash dev datastax.com learn how to succeed with apache cassandra there are many great people of the data stocks and not only data stocks working of the examples and educational materials how do you use cassandra cassandra's greatest and powerful but that's not the easiest database to use okay so you have to learn some things to be successful that's fine because well we are getting well paid for that big data brings some problems but big data brings money when a homework assignment we suggest you to do today or tomorrow preferably today use the link provided on these datastax.com learn data modeling by example we have many examples and what we want you to cover is a messaging data example time series data modeling it's uh very close to sensor data modeling but expanded like a little bit wider shopping cart data modeling also can be very interesting but in general pick the topic you like the most because well that's what matters but we recommend those three zero two zero five zero six and uh the one we skipped today to be more on time is investment portfolio you can walk through it on your own then next point we run workshops cassandra related workshops weekly and some more events so at datastax.com workshops you can find more our more of our events not only this one but we have more and more coming by the way if you like them you have uh this gentleman to the left of me arjun and directly under him you have subscribe button and like button now that's your job to push them okay then uh giveaways as a thank you for joining us live we would like to offer you a choice of giveaways everyone not only top three but first option one you can choose a free or not free but data stock sponsored certification normally one certification exam costs almost 150 dollars as you attended this workshop you can get a voucher valid for three months with two attempts included if you are passing the first exam for the developer for example certification from the first attempt you can still use this voucher to use your second attempt for the administrator certification so you can become apache cassandra certified developer and certified administrator using the one single voucher and that's like a very very solid thing to do so we do recommend you notice if you are new for cassandra this workshop maybe and will be not enough because it's not so easy so we have uh some specific preparation for certification workshop but in general the most recommended way is to get a proper courses at the academy.datastax.com and one more certification coming soon is apache cassandra operations with kubernetes but it's not yet released i hope you release it soon after maybe you know any dates when are we going to release with one i don't know for sure so okay you cannot comment on that sorry we are working and we don't want to set the fixed date because we prefer quality particular date will not give you too much but a good balanced exam will give you a lot um second option if you have used voucher already if you are certified and if you don't need certification anymore you can get three hundred dollars astro code astra is a cassandra as a service it's a managed by data stocks cassandra that all the headache about the running cassandra keeping cassandra flying is on us and you just use that without any need to handle servers or cassandra nodes running no tool commands and so on it's on us so you can get three hundred dollars astro code and use it uh with data stacks astra cassandra as a service thank you we are done for today and um it was a great day thank you for so many great questions uh thank you for being with us today uh we have more workshops coming in and i think we are done for today yep yep thank you so much um i do see a few more questions about time series and um uh real-time trades cassandra is a great great database for time series used a lot for time series um and we have the time series uh specifically data modeling use case so check the check it out sergio i believe you had a lot of questions um when comparing cassandra to a database that was specifically designed for time series uh the only disadvantage is that the only advantage that the other database may have is specific operators that can be used with time series like compute automatically aggregates and things like that but we're kind of doing that in the example we have different precision of data and the with with respect to using kafka there please do check how we use kafka with cassandra we also have pulsar which was announced announced recently we have the connector for the pulsar and the pulsar has many advantages over kafka so i hope this this helps and uh very happy to be part of this workshop and hope to do it again soon yep i'm sending our linkedin links uh to youtube chat so uh feel free to add us we spent two and a half hours together working on the complex projects working on a complex thing so we are now almost a colleagues and you can definitely add us on linkedin well then um i see a question here it's just jumped out many attendees have very little of no knowledge of cassandra for a workshop label at advanced it would be nice to refer those attendees to other workshops so yes this that's an advanced level workshop indeed if you feel uncomfortable with this level that's pretty understandable we have a lot of introduction to cassandra workshops recorded and available at this channel so maybe i would ask one of our colleagues to throw in a link to our um workshops on github or better take our full eight by eight part course and skip the questions what you are able to cover the actual workshop topic more in depth that's a good suggestion so thank you for the feedback we will find a way to to do that great information nice one workshop every week keep up the good work you guys are working yes we do yes we are working thank you tom thank you everyone we are done for today it was incredible time together with you thank you and see you later then and now in the there in the end our favorite music for you to celebrate the ending of a workshop and you getting the vouchers and getting more acknowledge how to handle that so i have a moment good moment to say goodbye ciao thank you thank you so much bye bye
Info
Channel: DataStax Developers
Views: 1,962
Rating: 4.9523811 out of 5
Keywords:
Id: u6pKIrfJgkU
Channel Id: undefined
Length: 127min 48sec (7668 seconds)
Published: Wed Feb 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.