InstaBlinks: All Things Apache Cassandra, a Distributed Database Technology

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey this is tim here at instacluster welcome to another episode of insta blinks today we are joined by instacluster senior software engineer jordan brayuka jordan welcome thanks for having me oh thanks for coming mate so today we're we're here to talk about distributed data technologies and in specific uh and specifically we're talking about apache cassandra so jordan you've spent a lot of time in battle with this uh with this technology can you tell me a little bit about why apache cassandra has become the go-to distributed database technology yep i think to start with it's it's quite easy it's quite a low learning curve on apache standard because you when you start to interface with it um through things like cqsh it looks very similar to a normal sql database um so you can you can get your hands dirty quite easily um but the main reason people love it is it's it's so scalable um so it has what we call linear scalability with the idea being that um three nodes will perform three times as well as one node and nine nodes will form three three times as well as three nodes so um all these massive web scale companies when we're talking you know millions to billions of writes per second they're not having to sync more and more money as they get customers it's much more of a linear input in terms of the running cost it also has really excellent sort of fault tolerance where you can lose a large percentage of those nodes and still work and obviously it's still partition tolerant as well so even if you do have network issues um it can handle that nicely and of course it's open source has such a massive community behind it um i think version 4.0 is due to drop this year so um it's been proven it's been tried there's not you know less and less bugs each release yeah okay and and so so when you when you talk about um kind of these these massive advantages like linear scalability cost efficiencies how does that kind of differ from you know technologies of um you know of yesterday like more more traditional tech like relational databases like you know sql and those sorts of things yeah absolutely so traditionally when you've got your traditional database um when you need to scale you generally scale up and by that we mean you put more expensive hardware on you put more cores larger machines obviously there is an upper limit to this though and the larger hardware the more it costs um so by running on cassandra you can run much more in commodity smaller instance machines and the cost is significantly reduced yeah okay and in terms of um you know i guess one of the big things that we've seen over the last 10 years as a transition from on-premise you know legacy technologies that are that you know are you know using these kind of relational data models to highly distributed systems that are based on cloud so how does how does tick like cassandra help organizations become more uh cloud friendly in the way they deliver i.t just it's significantly easier to perform operations like scale replacements um upgrades maintenance all those sorts of things instead of having sort of um single instance databases or active passives with this fault tolerance you can you know slowly upgrade nodes upgrade node sizes all those operations become a lot more easy because you can have downtime of instance ends without downtime of clusters you know i guess one of the big things that we we talk about an insta cluster when we're when we're talking to a customer that's either currently using cassandra or they're looking to adopt cassandra because they're reaching the upper limits of what the relational database can give them in terms of scalability the one thing we we always draw back to is around the data model right and that if you don't get the data model right you're kind of setting yourself up for failure down the track when you kind of hit true production scale so can you talk to me a little bit about the fundamental differences that people need to understand when um you know writing to a cassandra database versus writing to a relational database absolutely so a typical relational database modeling you design your tables around your objects so you might be tim you have a certain enemy of a first name whatever we would put all those objects in a single table and then when it comes to query time you basically query from different tables and pull the information how you need together and then return a result cassandra is the exact opposite so you don't design your tables around the real world objects what you do is you design your tables about how you want to query your data so if you want to you know get tim's and your address in a query you would put all of that information in a single table you wouldn't leave it as two different objects you'd bring it together and return so your data model needs to be designed based on your queries not based on the real world it sounds like quite a quite a big step change from uh from how developers are used to yeah exactly right it's different but once you do get your head around it it is quite simple um it's it's a sort of an idea of cassandra is it's cheaper to store something twice than to process it twice so the idea being if you do need the same portion of data to return into different results if you store it in two ways sure you might pay a little bit for another replica of the data but you're actually saving so much cost in processing power when it comes to the joins that's really offset yeah okay and so so when you're going through that journey i mean what's the biggest it seems to me about it seems like you've kind of got to learn another language right is it is it a big um is it a bigger transition to that say from going from speaking english to to speaking spanish or is it a bit more subtle and in the way it's significantly more subtle a lot of the nuances are sql like so you're still doing your select statements you know you've got your where's inserts that sort of stuff um it's more just in terms of that thought process of i need to get my requests down first and then do my data model it's once you accept that it's it's not too bad jordan the the sort of benefits of of apache cassandra they all sound pretty amazing but um we've also seen you know a lot of a lot of the other side of the coin where you know users of apache cassandra uh you know have gone down a rabbit hole and maybe done something you know and and then like introduce a cassandra antipattern or have got the data model wrong um are there any kind of things that that you would say you need to look out for any sort of you know pitfalls around um apache cassandra deployments that you think would be uh worth highlighting yeah absolutely i mean we've already touched on it but data model's the biggest one yeah standard databases you can you can get them wrong and they will work at small scale but then you'll scale up to a point and it will it will just fall over um so i think understanding your data model um and understanding your queries as well so i think if you know if anyone's watching this and they've got you know allow filtering at the end of some production queries that's a giant red flag you need to go back and think about that because you're not you know you're querying in a sql way when we want to be querying in a cassandra way so taking advantage of that distribution and that partitioning that enables you to scale we've seen customers sort of start small scale and their application works quite well and then all of a sudden at 2am it falls over because it basically reached critical mass and their latencies are through the roof and it usually boils down to a bad filter or a bad data what are some of the things you know that that say relational data models do really well versus cassandra i mean typically if you were to say a you know cassandra friendly workload what do they typically look like from your perspective um large large scale so we're talking lots and lots of writes per second we're not talking about say reads so if you do have a significantly smaller workload you might not consider the the benefits of cassandra outweigh the cons because they really come come to fruition at high scale yeah data consistency is cassandra can do high data consistency but generally if you're talking um much more consistent data people will put it in a relational database but again only because it's smaller scale and it's worth the trade-off lots of iot applications we find for cassandra so there's massive sensor data where we're just pumping heaps of information into it at once so iot sandra has some really good things based around what we call compactions but handles time series data exceptionally well um so things like that we really find cassandra is used for commonly one thing as well that that i you know that i commonly get within uh um you know having conversations to people also looking at cassandra you know open source cassandra versus other sort of cloud native technologies like um like dynamodb or cosmo and things like that and one thing that we find you know time and time again is that it's it's multi-geo multi-data center availability uh capabilities are kind of second to none and what that means is that its ability to be able to deliver you know always on um you know uptime of of the of the the database itself is as kind of um you know as second to none and it's hard to compete against is that kind of is from your perspective is is resiliency a big um you know draw card of cassandra in general uh absolutely absolutely i think there's a i can't remember the company but there was someone in the us who basically lost an entire data center during a natural disaster and by that we mean like it was just sort of gone off the grid and i think their application had sub-second outages because they had a second data center that all the data was replicated too like we can we through the replication stuff you can set up your data to be you know fault tolerant to any number of things so entire cities entire you know coastlines lose power your application can still persist through which is yes and again you know when we're when you're comparing that capability to traditional you know relational data data technologies you're you're also then talking about multiple uh you know license costs so the the old adage of if you want five nines of availability you need to add five zeros to the to the price tag um yeah and which is pretty amazing like when you when you contrast that to cassandra right because we're not we're not dealing in the world of licenses anymore no absolutely there's no licensing costs on an open source like this yeah so and you can get that availability with significantly reduced costs so you might need to duplicate or duplicate your um your your system or your cluster to say you know three data centers because you do want that availability but you know it's going to be triple the cost it's not going to be times nine and that's all comes down to the hardware cost of running that cluster look jordan mate i i really appreciate you uh coming along to join me on this i think it's been super insightful um and yeah i look forward to getting you back to talk about the next technology on our on our roadmap no worries thanks tim thanks jordan and remember everyone always be clustering [Music] you
Info
Channel: Instaclustr
Views: 213
Rating: 5 out of 5
Keywords: apache cassandra, cassandra, distributed database, open source, open source database, dbms, open source dbms, datastax, dse, datastax cassandra, what is apache cassandra, what is cassandra, nosql, nosql databases, cassandra nosql database, nosql database, why cassandra, why cassandra is faster, cassandra tech, databases, DBaas, cloud database
Id: ONkahMpiQvE
Channel Id: undefined
Length: 11min 0sec (660 seconds)
Published: Mon Oct 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.