DS220.21 Read Techniques | Data Modeling with Apache Cassandra

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] in this video we'll be taking a look at some alternate ways to perform reads although the data modeling methodology we have shown stresses on making sure that your tables are modeled so that your known queries work there are times when it makes more sense to just make use of an index or if certain requirements or queries change it may be possible to enable these changes without having to completely overhaul your data model the first way we can perform reads is through the use of a secondary index this is something that needs to be manually defined by the user but allows datasex enterprise to create and maintain an index that can then be used to more quickly and efficiently retrieve data especially for columns that do not make up part of the primary key as this is a separate data structure that is created it does not have any effect on the structure of the table there are some limitations to using secondary indexes however which is that it can't be created for counter columns or for columns declared as static as mentioned the secondary index is a separate data structure that exists alongside the actual data each node will have its own local index which collectively makes up a particular secondary index and contains index entries for the partition stored locally on that node since the secondary index is divided up into local indexes it means that a query generally needs to go to every node in order to check each local index making sure that the entire secondary index is consulted this tends to be a relatively expensive query in terms of efficiency the most efficient use cases for a secondary indexes is when quarian for rows inside a specific partition meaning that the partition key is provided in the query this allows direct access to the node that has the data and only needs to check that local index on the node to retrieve the appropriate rows as far as how it works the client sends a query to a coordinator node if the query involves a column that has a secondary index and no partition key is provided then the coordinator will have to send the request to all nodes when each node receives the request it searches its local index to find the required partitions and rows and returns the results to the coordinator once the coordinator has received a response from all nodes it then combines the results and then returns that to the client now when we mention all nodes here that's just a general idea when there is no replication when there is replication being used you don't need all the nodes to retrieve the entire result set and so only a subset of the nodes is contacted whatever number is needed to make up the entire token range here we have some guidelines on when to use and when not to use secondary indexes small data sets are fine or whenever you can provide a partition key to make secondary index usage more efficient secondary indexes don't work with counter columns and they don't make sense with columns that are frequently updated or deleted since that is additional work maintaining the secondary index in the background as well a more complex topic is deciding to use secondary indexes depending on the cardinality of the column this refers to the number of unique values that can be found for the column with smaller unique values being low cardinality and large number of unique values being high cardinality with low cardinality having less possible values for the column generally means that there are more rows that are returned and makes it worth having to check all of the nodes to retrieve the query results in the case of high cardinality having a large number of unique values make it more probable that there are less rows with that value with less rows that means you may be checking all the nodes for a few rows that could have been found on just one or two that makes it pretty inefficient and there are other ways that this type of query could have been done instead this leads us to materialized views a different type of index that is more suitable for high cardinality data basically a materialized view is a separate table built from the original one with the same data but with a different primary key the new primary key particularly the partition key would be set to the column that you want to search on since materialized views are created from a base table there is a requirement that the primary key for the materialized view must include the primary key from the base table so that there is a 101 correlation between the rows and the base table and the materialized view only one new column can be added to the primary key with the restriction that it is not a static column in this example we are making a materialized view based on a user's table which has a primary key using the user id in order to be able to search for users by their email address we would need to create the materialized view with the email column as part of the new primary key from the create materialized view example the as select part determines which column we want to include in the materialized view the from specifies the base table and the where must specify that all primary key columns is not null since it's possible that the base users table has rows without an email it should not actually be included in the materialized view finally with the primary key for the materialized view we have to make sure that it includes the original primary key of the base table which is user id along with the column we want to search on which is email once the materialized view is created you can then treat it just like any other table in your query here we are trying to retrieve the first name last name and email the user in the materialize view user by email where their email is ilab bigdata datastax.com the result will return the row as normal as can be seen below as far as things to watch out for with materialized views you cannot actually write to a materialized view it is updated only when the base table is updated as well the updates from the base table to the materialized views are also asynchronous meaning that there is always some sort of delay when updating since the partition key for the base table is different from the materialized view that can mean that the node where the base table is updated is not necessarily the same node where the materialized view is updated in edge conditions the delay and update may be more noticeable finally note that read repairs are not a two-way street for materialized views if triggering a read repair when reading from the base table it will repair both the base table and the materialized view however when a read repair is triggered when reading the materialized view it will only repair the materialized view itself data aggregation is another type of read that is commonly used in databases and datasets enterprise does support some of the most common aggregates in dsc6 this means sum average count min and max are all supported note that prior to dsc 5.0 only count was available keep in mind that these aggregates are calculated at query time meaning that dse goes through the result set to calculate these aggregates instead of just pulling statistics from an index this should be suitable for many use cases but if you're looking to get the count of the number of rows in a table this would not be the best way to do so fortunately there are several techniques that can be used that can possibly get you the aggregates that you want some can be done purely through the cassandra core part of datasets enterprise one is using lightweight transactions to check and replace aggregates as rows get added or updated another might be to make use of the counter type to implement certain aggregates such as count or sum otherwise aggregates can also be done outside of cassandra either being done as part of the application logic or making use of dataset enterprises analytics or search features here's an example of using the counter type to implement aggregations for count and sum let's say that we want to be able to track the ratings given by users for a specific video in our ratings by video table we create two counters a num ratings and a sum ratings then whenever a new rating is submitted we can increment the num rating countered by one and the sum rating is countered by the rating given by the user there are some caveats to this technique though one being that updating the counter is slower than a normal write since it does do a read before a write in the background the other is that incrementing counters is not an item-potent operation what that means is that if an update to a counter does not return a response there is no way to retry the operation so that we know the counter reflects the true intended value in this case the value stored is a very good approximation but not necessarily accurate to perform an aggregate in an application you are basically performing calculations on the client side one common way is by retrieving the data needed for the aggregate calculation from the database then doing the calculation and storing it back in the database using our counters example from the previous slide we can make use of the num ratings and sum ratings counter to get the average we just query the value of the num ratings column and some ratings column and then perform the calculation on the application side if we want we can also store that average as a average rating column in the videos table

Info

Channel: DataStax Developers

Views: 670

Rating: 4.5 out of 5

Keywords: DataStax Enterprise, Data Stax, datastax, DSE6, DSE 6, 6.0, Distributed Data Show, cloud, cloud database, databases, nosql, no sql, data modeling, software development, Apache Cassandra, cassandra, spark, Apache Spark, Solr, Apache Solr, graph, Gremlin, TinkerPop, Apache TinkerPop, real-time engineering, software architecture, DBaaS, customer experience, help, academy tutorial recipes, how-to, step by step guide

Id: 12A7Fe9o8I8

Channel Id: undefined

Length: 8min 43sec (523 seconds)

Published: Fri Aug 14 2020