Secondary indexes in Apache Cassandra

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay let's start let's try to understand the secondary index in cassandra because this is the most misused concept in cassandra uh see the reason right if you are from traditional rdbms creating index is very common practice uh let's say you hit with some performance issue what you do is you tend to create a index so which can resolve that issue quickly now in cassandra you have secondary indexes okay you can create indexes on the table but in cassandra secondary indexes are there to give you convenience not performance okay you need to keep that point uh unlike rdbms in cassandra secondary indexes will give you only convenience not performance okay we'll see how how that works so the agenda is uh we'll try to understand what is query first approach in cassandra and we also see what is allocating how to create secondary index and how cassandra stores this secondary index which is very important and we also try to understand our distributed index and local index what is the difference between these two and we'll see what is the right path and rate path of secondary index and we also see what are the best use cases to use secondary index what are the shanty patterns and we also see what are the problems and limitations of secondary indexes and we also try to discuss one more type of secondary index which is sassy little bit about me my full name is anil mithana i'm passionate about cassandra database admin administration and you know i'm a certified uh cassandra admin and certified cassandra developer you can find me on the linkedin link so let's start what is query first approach so in cassandra before you start thinking about data model you need to understand what is your application workflow so again unlike rtb message in rdbms what you do is you create a data model first then you think about application workflow whatever queries you want to run on the tables but in cassandra it is a reverse you need to know your queries first you need to know your application workflow first based on that you create your data model and you create your tables so let's see you know on the left side i want to get the all the ratings left for a movie so that is what my requirement i want to know at any given point of time this is what my requirement from my table i'll ask you give me ratings for a particular movie to do that you need to create a table this way you need to have a primary key configured in this event title is your partition key because based on title only you will look up later right and here email is what email is a clustering column which gives you uniqueness to your primary key and also it keeps the rows within the partition in a ascending order by default so pay attention to the title column which is partition key when you create a table like this and when you insert data into the table this query is very much possible and let's try to query it i already loaded the data i created created the table uh with the similar same create table statement and i inserted the data also now what i want to query what is a client requirement at any given point of time he'll ask me what and all that is given uh for a avatar movie who and i watched and what and all the ratings it got okay now to get that information what i do is i'll write a select query select start from ratings by movie where idle is equal to let's see now when i query that information it's a single discrete what i mean by single discrete how cassandra stores this data when you create a table like this by making title as a partition key so right here title is a primary index basically you call it as a primary index also and partition key now your primary index right is a distributed index meaning what happens here is when you make title as a partition key which is your primary index now cassandra will keep this entire avatar data in one single place in one single node so your table data will be distributed in terms of partitions so now here i i'll call my table as a collection of partitions instead of calling collection of rows i will call it as a collection of partitions right see my partition is having rows ultimately now what what cassandra do here is uh it will keep this partition in one node this partition in one node this partition in one node that that's way you know that way uh you you you can have entire uh information about our movie in one single place isn't it and whenever you query it's a single disk because it you have to go and grab it from one single node because your partition data will never get shared with other other nodes here here i have three nodes here okay now whenever you ask for inter shield or information it will go and grab it from one single node it's a single disk very powerful right now let's say you want to get the rose you want to know the ratings left by a user so that is what my query requirement in at any given point of time i'll ask you this give me what and all ratings given by a user for what are all movies you watched okay if that is whatever requirement this particular table will not work for that particular query okay so you have to create one more table you need to duplicate the data by making email as a partition key here title is doing a question column it is giving you uniqueness and keeping it in order within the partition now you need to duplicate the data you want you if you have a different uh select query where you want to keep a different column in your where condition right now when you create this table and let's see how data will get created i already created that readings by user table and i already loaded some data same data whatever you have it in readings my movie i loaded here but the way cassandra stores the data is different now because i made email as my partition key now what happens all the ratings given by ram will be will you call it as a one single partition and it will get stored in a single node same same with the sharma sharmasi every time richard watches one more movie that data will come and sit in this partition and that partition will get sit in one single node okay so your table data is getting distributed in terms of partitions right now if let's see if i want to query all the ratings given by a user ram it's a single discrete because it has to go and grab it from one single node all good right i hope you got to know what is query first approach by this time let's see uh what else we have now now allow filtering what is our log built in now let's go back to your ratings by movie call table now my business requirements got changed when i started i told you at any given time i'll i'll ask you only uh give me information look up my only title but now i want to know what are all ratings like how give me all the rows where reading is equal to 8 okay clear this now what i'm asking now today is give me all the rows where rating is equal to 8 now is this possible no till now whatever we discussed you know i can only look up by title right so i can only look up my title on this table i cannot look up my rating isn't it yeah because in cassandra you cannot query on non primary columns correct okay let's uh file this query now you know this query will fail because you cannot look up by rating now if you under if you read that query i mean that output what it is telling you is uh cannot execute this query as it might involve data filtering and thus may this may have unpredictable performance right if you want to execute this query despite of uh performance unpredictability use allow filtering now what kassandra suggesting me is it is telling me use allo filtering now let's do that now when i say this it worked so allo filtering is working well i got to know all the ratings where you know it is eight right so can i use all of it right now the answer is a very big no you should never ever ever ever use all of using on your select queries because what just happened here is when you run a low filtering the sender it is all the rows from the table and apply filters and returns the matching rows so let's say you have a 1 million records in your table and out of 1 million let's say 100 rows are having rating eight now when you query this using allo filtering basically what just happened is it retrieves all the 1 million rows and it applies the allo filtering and i mean it will apply the filters and it will give you those matching 100 records okay now which is completely bad uh to do in terms of the sand outage so that's the reason you should never use all of your things basically when you do a low filtering in terms of rdbms basically you are doing a full table step here in cassandra it's a full cluster level scan so it it might uh bring down your uh node also if you do a lot of reference if the table is too huge okay so try to avoid all of this and then how can i how can i get that rose where rating is equal to eight so there comes your secondary index okay so we were talking about secondary index right like let's what i do is i create a secondary index on column rating for this particular table now when you have a secondary index on a rating column basically you can now run that query without allow filtering now see what happened you got the result now without all of filtering so if the index works this well why not go ahead and create indexes on columns whenever you know whenever it is required why we why you know do we need to duplicate the data or why we need to create one more table to satisfy that query requirement right so let's go back to our very second slide here initially i was looking up by title when i want to look up by email because i cannot do it here i created a duplicate table now you might ask me why can't you create a secondary index on email column so that you can look up my email column also we which i just did on reading column right so now now this is the place where things will become little complicated to understand so pay attention now when i wanted to look up a rating now i have two options one is either go and create a secondary index or create one more duplicate table by keeping rating as a partition key two things right now you need to understand how cassandra stores the secondary index data and how our regulator you know we know already however regular table data will get stored right so we need to understand how secondary data will get stored so first thing first what i do is i'll create a duplicate table for our understanding so that i can compare with the secondary index table okay now before even doing that uh what i wanted to tell you is when you create a secondary index let's say you just create a secondary index on a rating column cassandra creates a hidden table you know it looks like this so it keeps the rating as a partition key and it will keep enter primary key of your base table as a part of the primary key here and it creates this hidden table and it will manage it internally okay now now when you create a secondary index you got a hidden table right now we need to know how the cdn table stores the data correct now before that now what we do is like let's create one duplicate table by making rating as a partition key and we'll try to see how that how our regular data will get stored before we try to understand secondary in this table data now what i was telling you is like i am asking you instead of creating a secondary index on rating column try to duplicate the table by making reading as a partition key and see how data is getting distributed then go back and see the second secondary index hidden table okay now i already created this table and i loaded this same data exactly what we have in other tables and let's see how it stores c what happened here when you make rating as a partition key what what it what it does here is now your table data will get distributed in terms of partition so whoever gives you rating eight will be will will be in a single partition isn't it now let's say someone watched the movie and a given rating that the rating that row will come and sit in this particular partition now if i want to query well who are all given rating 8 what i need to do is select start from ratings by rating where [Music] rating is equal to 8 now we already know when this is a single discrete right that's what you get it from a base table isn't it now you ran the same query here also but here rating is a secondary index now pay attention now we know how base table stores the data or it stores you know like all the ratings given to all the ratings of eight will be in one place all the ratings of nine will be in one place right that's how your duplicate table store the data so let's switch to secondary index now how secondary index stores data very important now let's yeah this is your regular table this is your hidden secondary index table now this is whatever it looks like now here when you create index your index data will be collocated with the source data on the same node what i mean what i mean is this is whatever uh data on the node one and for author movie three people given rating eight three people given rating nine and you know three people given seven so only that data will be collocated in the same node in that hidden table right so the index data of the index data of this partition will be stored in this node right so basically all the ratings of eight are are not in one place but that is how it is in the original base table right original duplicate table where you create a ratings by reading table but your secondary index hidden table will not show that way it will not distribute instead it keeps the index data local to that node so very important now if you run that query select start from ratings by movie where rating is equal to 8 to give you that output what just happened here is because it is not in one single place and it doesn't know where it is it has to scan every single node so it is it scan this node it will come to know okay there are three rows here and also it will come to know so here also there are data so it scans the so every node will get involved when you query that when you query based on secondary index when you say that the select start from ratings by moving where rating is equal to eight when you said that what happened is all the nodes within the cluster are involved in that query you have thousand nodes thousand nodes will get involved in that query so which is completely bad isn't it secondary index when you use second in index when you query by secondary index you are involving every single node but when you duplicate the same thing you are involving only one single node here here right when you have a readings by rating where rating is your primary index primary index is a distributed index so whenever you query only one single node will be involved and when you create a secondary index secondary index will create a hidden table whenever you run it run this query basically it's a whole every single node in the cluster will involve now here i'll tell you it also depends on your replication factor also so let's say you have 100 nodes with replication factor is equal to 3 in that case you know you need to hit at least at 33 nodes now because your replication factor is equal to 3 if your replication factor is equal to 1 you need to hit all thousand nodes i mean all 100 nodes okay so it also depends on your replication but factor but still you have to hit lots of nodes to give you this output so you got to know like what is happening with secondary index secondary index is a local index basically so your in index data will be collocated with the source data on the same node because of this reason whenever you query all the all the nodes will get involved now you might ask me why secondary index data is getting showed in this way why cannot it keep all the ratings eight belongs to in one single place right why it is keeping only level data why can't it uh go ahead and distribute letting eight into one place rating nine into one or breaking seven into one node why can't you do that because if cassandra behaves in that way you will end up with white partitions what i mean let's say you have uh you have like in your table you have two lakh regards okay out of two like records imagine ninety percent of the rows are having rating is equal to eight now if you think in that way right like you know why can't it is keeping all the eight related data in one node what happens 95 percent of the data will be in one single node so this node will have a wide partition and it is it will have additional extra load also in terms of data so to avoid that to avoid the white partitions what it does what it does is it will keep the data to local and one more thing uh second point is uh updating index also so you this way when you have a local index it is easy to update your english data whenever you make changes whether it's a update insert or whatever you're doing updating index data will be very very much easy when you have a local english data right when you when your index is distributed then it is tough because it doesn't know where to go and update the data right so two things why it is keeping uh data local image data local okay two things these are the two technical reasons why a cassandra will not distribute your secondary index data because one is to reduce index update latency one more to avoid arbitrary wide partitions okay now we already know what is distributed index and local index so your partition key is a distributed index which is in your base table your secondary index is what you call it as a local index now let's see how cassandra write index data how it reads the data right now writing when what happens you have a whenever you create a secondary index you have an index meme table and you have a base name table basically whenever you flush uh you know that in this mem table and basement table will go into the same asses table again it's not it is there is no guarantee also in this behavior once the data is written in this ss table your index data will have a different life cycle uh whether it's a compaction all that stuff it will not uh go with the what you have for your base table data okay it follows a different life cycle now read path we already understood whenever you are reading by secondary index it is not quite straightforward isn't it what it does is first it reads the index table to retrieve the primary key of all the matching rows then it reads the original table let me see and show you this right see whenever you are saying it is rating is equal to eight it reads the index if it does one look up on your index data and it will pick up all the uh you know primary key information and it will go and grab that information again so two discretes one it will look up into the index and it will get the primary keys and it will go and get that information from your base table right so again secondary index will have a real latency because it is doing good too because it is not straightforward right so again there are like we are loaded with so many reasons to not use secondary index because clearly secondary index is not there to give you performance isn't it then you might ask me then what is the point of having second index in cassandra right so if they have if they are there in this in the product meaning they must have some reason to be there now what what would be the reason right now the best case is i'll tell you what is the best use case c so you instead of asking give me all the rows for where reading is equal to 8 which will do a entire cluster scan right instead of asking that what you ask is give me ratings give me all the rows where rating is equal to 8 for a particular movie right so see if i say select star from ratings readings by movie [Music] where rating is equal to eight see if you stop your query at this moment it's an expensive query but what if if i keep title is equal to avatar now this is this is what you should use this is what a real transactional query right now when you run this query what happens is that it it just go and grab it from one single node isn't it so you are limiting your uh query to a single node and it is very fast way of getting the output isn't it so if you stop here it's a entire all the nodes will get involved when you stop when you run this query ratings now you got this information by querying only one single node isn't it right because you in you kept partition key also in your select query right now that that is what you need to do see you have to limit your uh you know node hit so one way right this is one way which will limit the select query to one single node or you can even say limit is equal to let's see three now what you are telling instead of scanning every single node whenever you reach whenever you get three right you stop looking and give me that output right so this way let's say you have 100 nodes and the moment it reaches that output it will let's say this query hit the three nodes to give me that output so that's it right you got the output by hitting only three nodes so use limit or use partition key along with the secondary index so that is the best way of using secondary index that's what i was telling secondary indexes are there to give you convenience not performance right now what are the best use cases obviously you have to restrict the query to a single server how you do it by introducing partition key along with the secondary index and also limiting the nodes to a lesser number maybe you can use limit to do that oh and one more thing second indexers are very helpful in analytics uh workload like when you are using spark sender connector where you know your sla doesn't measured in terms of milliseconds so these are some best use cases to if you want to use secondary index now what are the anti patterns right and worst antibiotics you should never create a secondary index on a very low cardinality for example gender right no because you have a only two options i mean uh very less options right like a male or female so you will end up with the white partitions when you have second index on that gender column and you should avoid creating secondary index on high quality column also for example best example would be your user email because you get to have lots of distinct values for your index data isn't it whenever you query you might get that output by hitting one single node or it has to hit every single node maybe you end up with zero rows right so you have to avoid these two options again uh obviously if you are doing lots of updates on the index column you shouldn't create index on that column at all right you have to avoid the index a column where you update very frequently now what are the problems and limitations with secondary index see again it's a shocking news if you are from rdbms you cannot you can query only using equality parameter you mean equality on your secondary index column you cannot put range queries inequalities on your uh where condition on the secondary index okay and some the one more the problem on second point is actually a problem what what it telling you is because you were you will not compact your secondary index data and the base table data together it because it will not follow same steps right it has a different life cycle you might uh end up with uh like that that's what right independently compacting as a stable and indexes means the location of the data and ns information are completely decoupled so maybe your you you index data is pointing your research tables but that is tables got compacted then obviously you the you know it will not point that location on the disk right which was it earlier it is pointing so it's a problem basically with secondary index now basically now what is chassis index so in general there are two types of in the secondary indexes in cassandra one is what you what you witnessed till now native secondary index right where you know the limitation is you can only use equality uh operator cassandra will also give you one more kind of a secondary index you call it as a ss table i mean sassy index which is stable attached secondary index when you have this basically you can actually do in inequalities all right everything is similar but it will give you more flexibility you can do range queries again sassy is experimental you shouldn't use it in production production cassandra clasher now uh let's try to do some practicals on sassy index now what i do is uh on the same table let's create sassy index so i'm creating a sassy index on readings column and we'll try to do inequalities uh queries right so i created a sas index it is giving you a warning sas indexes are experimental or not recommended for production use right make a note of it now once i have it obviously you need to use partition key along with the secondary index that's how you should use your secondary index and you can use now you can use inequalities also right previously you can only query rating is equal to 8 now you can ask less than or greater than equal to no operators also now what is expensive is querying directly on secondary index isn't it basically it will involve every single uh node so those are some many qualities and uh you can also create a let's create a secondary index on a sas index on user location and we'll try to use like parameter to query the data right so what would be the real-time transaction query you have to you keep your partition key that's how you should use secondary index and you can get this right and what would be the expensive obviously without putting partition gain right now you can use limit let's say limit five right so that you can avoid uh hitting number of nodes now now you you you will think like okay second indexes are okay sassy is much better uh compared to secondary index right again uh that you need to understand again sassy index basically sassy attaches to each assistant and you know uh whatever is trying to do i i understand you here is when you create a second you know index what it does is it creates uh one it's a stable for you for your right now uh you for your english data so to understand that point what i do is like i'll try to say no to flash so secondary index is a hidden table right you it never have a physical representation but your assassin will have a physical representation right okay we'll see what is that so what i do is i'll go to my data location uh that was that is my question p then let's take an ltr because i dropped my tables a lot of time so i see lots of folders in cassandra when you drop a table you have to go and clean up the physical location also now if you see here sassy attached to each assistable it's immutable index file and also attaches an index to each time table so what happens here is every time you come back during the compaction the sassy index file also compacted together to create a new index files okay so if you decide decided to use assassin production keep in mind that sassy impact your compaction repair and right throughput because it is expected because sas index file follows assistable life cycle right so secondary index doesn't have a hidden like uh it has an intel but here you you have to deal with the access tables also so again that is also consideration you have to deal with the compaction right throughput latencies every single bit whatever the asus table will go through your sassy immutable files also will go through okay so two two things you keep in mind because they are still experimental only so that's it we come to an end i hope you understood out of this something what i wanted to convey is secondary indexes are there to give you convenience but not performance okay and if you want to use them make sure you keep a primary like you make sure you keep a partition key also from the base table or else try to use limit i mean the point is you need to limit the query hitting less number of nodes and you should avoid creating secondary index on very high coordinating columns and very low coordinate columns okay and very general recommendation is to have at l at most only one secondary index per table and most of the time you should avoid a secondary index completely creating them okay yeah with that uh i would like to thank you if you are still here and listening to the whole piece stay home stay safe thank you and please subscribe and like this video if you want me to do more number of videos on cassandra database administration thank you
Info
Channel: Anil Mittana
Views: 370
Rating: 5 out of 5
Keywords: apache Cassandra, datastax, dba, nosql, oracle
Id: ABFZj9_BuOE
Channel Id: undefined
Length: 38min 4sec (2284 seconds)
Published: Tue Apr 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.