Tech Talk: Cassandra Data Modeling

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so why is it the data modeling in time series fit together I'm going to try to explain a little bit why so wake up here are some interesting facts right well I'm just going to establish my use why I want to talk about this so the Internet of Things is very interesting new development it's kinda like it's almost getting to be like a cloud buzzword now item T was that I owe to you yes it is and but I think it's pretty realized whatever we define as regardless of what buzzword it is whatever some analysts as your gazillion dollar bargained for it still comes down to one thing we're catching keys to the internet really quickly um you guys know what that thing is up there it's right here that's the net you just when I said that in Europe we were like there was done that's a net right Google just bought nest for three billion dollars now do you think Google is in there a board meeting in there McMahon you know we got this cool driving car self-driving car great search engine Google Apps that's cool we need a thermostat now that is it what they said it and I have like one in my house to you yeah dude do a good knows we're here we need that data - they bought they bought three billion dollars of what the f you're doing yours creepy eyes it knows when I'm there it knows what I need he knows a temperature no Google wants that information air say I did this it was really fun trading I change the temperature of my house it's currently 64 degrees in my house 1 how cool is that and then that's that's what this is it's Internet of Things now did I ever mention once and when I talk about that's the Internet of Things look but it's a cool product it's a cool thing is moving fast CES January refrigerators talking here net telling your grocery story ground up these are all just things in future right and your car now Ellen talks with sink so all these devices are going to be talking so 15 billion devices estimate about forty billion devices by 2020 that's a lot of mouths to feed and that all that means pretty much one thing this all has to go in a database somewhere so let's talk about what that is and it's almost always going to be time series theater temporal data here's something that happened at this time the sander is a great home for that and it wasn't really designed for that initially but why can standard for time series and then I'm going to use this as a data modeling launching point so as you saw my last talk one slide proved it all and if you're not a Cassandra user you can prove it to yourself but Cassandra scales that's good thing for you IOT because there's going to be a lot of apparently it's resilient which means it is going to be online all the time and it's going to deal with all the crap that I throw at it it's great day tomorrow aha let's talk about that that's going to be a big part of this but then beyond that and this is the kind of a secret sauce a commander of wide time series data is so good is this efficient storage model now the efficient storage model as a programmer I really don't care too much about right that's not something I'm going to concern myself with as I'm writing code but I want you to understand what's going on so when you're writing code and you're looking at the data model to understand what's going on under the scene because let's face it what about that I mean that this is the power with what Cassandra could do so let's get let's start out with the data model example but I like examples good old weather station and classic right this is a classic time series temperature over time so I have a weather station here I'm putting that data into a into a Cassandra ring and I'm going to display it out in you know whatever format is Rd grabs whatever so over the over the whole day I graph all the different data points great right so that's a that's an easy one so how do you do what do you start with Cassandra date alone data modeling is a little different in Cassandra because what we're doing instead of taking the data and then building models we're taking our application and building the model so I'm going to start out at the top with what with my application going to need for a query how much the application is going to ask for this data I'm going to graph this data on that on my website what are those queries look like so the first thing is I want to get your my needed queries I want to get all my data for one weather station okay I want to get all the data for a single date in time and I want to get data for a range of date and times so all very similar around the single weather station ID weather station itself but a little different I want a single date time I want to get a range so how do i model that so by data model support these it's pretty simple I'm going to store data per weather station that's my first requirement my second requirement are going to store in time series order first event - the last of it so what does that look like so I'm going to create a table my create tables pretty simple I have a weather station ID which is just a text I use text because it's here's actual value down here it has some letters in it event time which is a timestamp so this is when it actually took the reading off the weather station and then a temperature in text now the primary key is the interesting part here this is what makes the data model rich and I'm going to show you why it makes a difference in a minute but this primary key is an important piece of your cable creation because what you're specifying here is the uniqueness that partition partition is here's all my data that I would order in a certain way and then our codomain together so much like how a relational database has a primary key this is d2 combined these two values are unique same thing but what if this is also doing is changing the way I store my data I'll explain that so here's the inserts look familiar right insert into and I have 1 minute 2 minute 3 minute 4 minutes so every minute is collecting data it puts a time in our temperature in there boom Bob's your uncle good to go so whenever I do a select weather station ID and event time and temperature from temperature my weather station ID equals is whatever it's 1 2 3 4 ABCD I got a table that's exactly what I was hoping that's what I want now if you look at the order of that data again I specified in the primary key before that so the order was I want the first thing that happened to the last going back to this the event time is that second part of the primary key the first part sets up the partition the partition is this weather station ID every element in that partition will be ordered by event time that's how I would describe that to you so yes these two are unique weather station IDs sets of partition event times is the order that I'm want that partitioning so whenever I do a select off of this I get everything in order without having to sort it in memory good so what happens on the storage I remember I told you I teased you on that a little bit the storage engine is the cool part of the story now this is the part where as a programmer you may depart and say long I'm good with this I'm moving on I'm going to write my code and ship it because it's an agile thing and I got to do a demo day tomorrow but let's look and see what really happens I think the most interesting part what's really happening on the storage engine is this it actually lines up that partition in a single storage row now partition which is specified by one two three four ABCD is actually the beginning of a five it's a file header and then every record in a row sequentially on the disk are the elements in that partition so what does this get you well first of all it's merged it's sorted and it's stored sequentially on the disk okay what does that get you when I start adding data it just depends into the end of that record that's great so everything is still in order because it's merged sorted in sequential this is great for storing data but what's the point why are we doing the point is whenever I do something like this now this is my this is my query where I'm going to ask for that range of time again keep in mind that that was one of my application requirements is that what I worry my dear I want to be able to grab a range of time in this case I wanted to grab it from one minute to the four minute not very exciting time period but that's what I want this is called a slice very or range query if you notice that it's the greater than or equal to less than or equal to this sets up these boundaries what does this mean based on how I store the first of all to get a single seek on the disk unfortunately Al's right next door telling you why this suck if you were to be over there in here at the same time you're freaked but that's cool but you would hear him say well here's the problem you're going to be running into disk problems I'm going to tell you yes you're going to run into disk presence trust me but the way to get around some of those issues is to minimize the amount of seeks you do on disk a seek on a disk Hertz because let's look at everything in your computer you know the CPU is in nanoseconds memory is in micro or even the a no seconds now but what's the disk do spinning is just seconds that's like orders of magnitude slower then we haven't really fixed that problem sort I mean we got we knew how new disk technologies that are now available but we still look at most most discouraged systems in production right now are speeding rust and they're measured in milliseconds if you're using a SATA disk 7200 rpm SATA disk you're looking at 10 milliseconds of seat time attention to that hey my query just turned into a 10 millisecond query just based on that one thing and if I'm chasing milliseconds that's going to hurt okay so getting more less than less than that one seek are you getting I'm sorry less than one seek you can do that but getting a less seeks is a good thing so that single seek on disk that's what I want I want to put the disc ad right at the beginning of my data and because what I'm trying to do here is get this range I'm going to grab all this data right here and I whatever I ask Cassandra to get it it'll move the disk there and sweep across the head and puts it right back into a CTL table for me programmer sees this they like this so this is what makes that data model really efficient because it's stored on the disk when you do a seek to that data it's really fast it's very efficient you know when I read Oracle the game was all about ions right because why everything's random all the Iowas random on a world on my sea floor any relational database do you have a DB file and it randomly reads and writes to the file so you better have a lot of I us because that means you can do love seats and maneuvers around Cassandra doesn't do that Cassandra writes out those SS tables one time once it never does a random right after yes if you if you like if you are in the right temperature that's a different aim level right now we're disorder because I've specified the data model is I only want order by by date I can specify an order by in his query but that's going to have to get sorted in memory my data model said I need to do a date slice if I wanted to do temperature I could change the data model around a bit there but we were fall when I for just a second for the opportunity for in Oracle what you do is you create an index on that and you never cover the query so that your temperatures would already be sorted yes you have to go to this but again one one sequin we the cluster so all the data together well you'd use a group by or you can order by but if you create an index on a second field inside Oracle it's not going to order it'll only make it more accessible so when you run the query plan optimizer it will find the data faster but it's still not sorted on this and it still has to do a random read on disk your solution X's or sorted is you can do a cover permits also is very about it but the data is it's my point the data isn't well you don't touch today do you read it right and that's you're going to read it out of the index but when you want that data you're going to do random read a lot because you're going to be putting these all done this data model similar ones out of the Oracle and you will limit out real fast and mainly because you run out items sequential operations are going to be faster a larger dataset a larger bigger the dataset you pop I have one other reason this is going to be faster as well and that's because you're going to distribute that many queries so let's hold off on the questions for a minute let me move on to the next topic and maybe that'll answer some of these questions but it's not just about this one thing there's a lot more to this I got more I got more secrets so what about some additional help on the storage engine so let me show you what a weed with the read path looks like and Cassandra in a nutshell when a client asks for digging it it goes to the key cache the key cache is like hey I have this that first part of my primary key where is that on the dips I want to know where that is on the disk if the key cache does it hit you get through one data file hit you do one seek you're done because your total seek time is going to be the disk latency times the numbers right we want to minimize that and we could measure all this in Cassano budget there is a tool called CF histograms which I did a talk about Google my name and you could cycle that's my talk but I did talk about this thing called CF histograms you can watch every single read see how many Peaks it takes the news this is a very important tuning technique if you're trying to get the low millisecond on every query and I did that many time 500 microseconds 95th percentile is very doable if you do this right so the next place is a bloom filter I talked about that my last talk bloom filters are just a probabilistic function and all they do is they say where the data is not so again we're trying to find out where the data isn't if you have thousand tables to look at I'd really like to narrow that down to a couple if I can so the bloom filter is yet another way to get that so those two things in conjunction will help get us closer to that minimum number of seeks but that's not always going to work out because in cases where maybe that row of data on my weather data is spanning over two or three columns families or SS tables I'm still going to do a lot of seats so the key to the speed in Kassandra based on that on that particular read model now like what how the reeds work in Sandra here's the keys to speed and this is actually this is important point number one the first part of the primary key that first part the weather station ID gives you location that's going to put you on a single server and I'm going to explain how that works in a second but that's going to put you on a single server so if you have a thousand node cluster that first part would put you on one of those servers in that Buster it will not be a random seek the second part that partition key will get you closer to the SSD and you get right you want to minimize that seek you want to get into that file as soon as you can you want to minimize and then third thing which I'm putting up here just for brevity but we don't we won't cover as much I do a lot of performance tuning type talks but the other thing is how fast you can find that data in the SS table if you have a three gig file you can use indexes to find that data immediately so once you do is seek you go to where the data is instead of having a move that diskette across the disk so there is a new feature as a 2.0 that will help you get to that data faster this is actually there for time series data and in those cases where I might have three files on the file system that could have potentially have my data I want to minimize how much seats I have on the disk so those same query um I have this I want I'm looking for data in between 1 and 4 I know there's a bug in my code I forgot to put an equal filter I get pointed out off the shark route but here's three files that it could potentially be in now this min/max hint what we do is when we create those files now we create an index an index built on top of that file that says hey here's the minimum maximum values for this for this primary key that your app that you've set up in your data model and you can use that later so how does that use so the minimum event time here is the index shows that it was a march 27th to 31st this one was April 1st to 4th and this was April 5th to the ninth that's the event time so if I look at my range over here when I ask that question is like hey I want that I would like a slice of that data please cut me off a piece of that pie I have three choices now for 2.0 which means I'm going to have to do 3 seeks to see finite data just make sure that I got right well with this index hint now I'm like well that data fits in here this is the file one signal time series data really can take advantage of this there are other data models it's not just for time series data that it works for but this right here is how you get to that class file fast this is how I do time series data models in less than a millisecond net mean a peyote you know when I'm messing around with my nest by the way nest uses Cassandra I want to make sure I'm changing the temperature of my house and pissing off my wife really fast I don't wanna wait so I'm gonna change a temperature really quick funny story about this little video our co-founder Matt filed he has a nest and he had a roommate Mike Blauman who runs apps in her team and when Matt was on the road at night he'd jack up the temperature Mike could wake up sweating us I'm here what's going on the thermometer or the thermostats broken so what did he do he called support hey this thing's broken it's like oh man that's terrible so they're all trying to diagnose as they're running all these things on they're trying to figure it out they they got back to him on email I think we found a bug here apply this patch you know do all this stuff and finally found out that Matt was the one turning up the temperature randomly the guys at nest found out it wasn't pretty but anyway just keep in mind you may get busted eventually fun so now I can't do it anymore so whenever I go to nest and now they're like oh say hi to match for us you me just do all that extra work so I don't real quick I want to go through some ingestion models right now that about getting that data there because this is the next question I would get um if you're if you're if you're at the point where you have a database that you do millions rights per seconds you gotta have an application they could do millions all right second you know this is not a trivial problem so one of the things that I like to point out is some of the methods you can get that in there because this is a dictum this is a new game right you're talking about internet things you to get a lot of data together so when I hear so much about right now in mana since I talked to this after that cop-killers project at Lincoln graduated it grew up in a big world of interest Apache as talking on the project is based on Scala cool queueing system it is big based of basically the same type of paradigm that Cassandra's horizontal scale yeah more delicate more scale lots of people using this now and it's a really cool project but I see it in conjunction with the food pools storm is the other one storm is a processing framework so if you're taking it a lot of stuff you know refrigerators all over the world tell you what temperature's is decided with refrigerant going through Kafka making sure that that gets once and only ones delivering storm will then roll it up and do different things as idea she's got a lot of data coming what are you gonna do with it you need a database that can scale up Cassandra seemed to be the place people have another one is one of my projects of flu I was recently at the Hadoop survey which is kind of funny for example personal and um one of the guys in my talk was Jonathan Shay who was the original author flew and as I'm talking about flume I see this guy like smiling in the back I used to work on this project so it was kind of cool seeing that back back in old school this is an old project that it's fused about four or five years old and it was originally built for congestive logs I use it for bringing in syslog events but it's a very simple program you create a source with a listener like syslog HTTP there's a channel which is like a bus and then the sync which picks up those events off the channel and then you can do transforms what I did is I built a system that basically took in HTTP events from our f5 load balancers and it parsed out the entire log string and then created 90 different inserts based on that one web hit so I created this massive like web amplifier right and it was the only thing I could I was trying to put it into an Oracle database he failed and then I just ran out of money really I tried to do it and the only thing that really could scale up was exadata and I didn't like spending two million dollars of this one little blog processing thing so I started using Cassandra it was when I first started and that was on point seven and it just worked so I noticed that this thing is producing tons of data I couldn't do anything with it the only other choices I could dump it into like s3 you know it doesn't mean anything I needed that data in milliseconds so Cassandra countries so let's talk about that data with speed and this is going to move us into a different dated modeling topic so if you're going to do a million writes per second do the math it and insert every millisecond or microsecond microsecond that's fast I don't know about you but I don't know a lot of applications could deal with that too much you got to work as a team if you're going to do is microseconds right everyone's got a kitchen you put that on one single server that is going to be the best microsecond reverse of that thing's gonna fit so there's other promise collisions these are real problems if you're in time series data collisions or an issue I did a talk at ING this is the Big Bang and they're pretty public they're doing financial transactions using Cassander they're also very public about the fact that there collisions and how they manage they'll be doing a really cool talk at our summit so you know this is a this is a problem the dealers right so what we how do we manage that much data flying in at the speed of mil microsecond so the first thing is primary key place isn't that no displacement and I'm going to teach you how that works because I think you need to know that primaries our case this is how David it gets distributed but it's also how you can do a lot of things and how your data model really depends on making that right also the rate of partitioning how your data can spread around that's a part of the standard and then we also have a specialty attack called a time uniting at time of your ID is a great tool for avoiding collisions with time since data let's talk about what happens with our weather stations I got a lot of weather stations to feed so let's quickly go into the replication story of Cassandra and how that this is your data model so the primary key that first part of the primary key that I told you about that is the important piece of figuring out what server belongs to so how do we do that so we have this really basic data month on our primary key is the first name of people which is really a poor primary key but for our limited data model it'll work so we have Jim Carroll Johnny and Susie that's our data points what we do in Cassandra with that primary key is we hash it md5 or murder it's a writ that is a consistent hashing operation so that if you put in the same string you'll get the same under twenty bit number up awesome and this is how many technologies work if you look at memcache back that far back that's how I got data distributed in a reliable way it was using md5 it would hatch it and then throw it out into a reliable way so all these keys are hashed now how does that deal how does that do with Cassandra in this regard and this is what I'm talking about the data locality like where is that data in relation to the physical layout of your server how's it finding so we have a four node cluster here node a b c and d and each node will store a range of data a range of tokens so we take that big 128 bit number which has a name I don't remember like a septa bike or something and we're going to break it into quarters so this one this is one quarter of the range on a 1/4 a range of be one quarter range on C and one quarter on D each one of those nodes analysis I'm primarily responsible for a quarter of the range of that 120 bit number meaning that when a one of these is hashed I will see if it fits inside my range and then I'll hope and no one else will take it so I take Jim's number hey it magically fits that 5e goes in between 4 and 8 awesome so now C is responsible for Jim's data Carol she goes to node D Johnny goes to know de now because md5 is somewhat random and of your mathematician I got grilled by a PhD in mathematics about other exclusion and it's not really random okay but it's pretty random and it's going to randomize the placement of that data around so as I did is getting set in there it's randomizing so you're getting good distribution but it's also going to different nodes now what about the replication each one of those nodes can also say you know what because you're my neighbor I'm going to hold your range too so if I change my replication factor that's all what that means is the nodes become responsible for its own data and then its neighbors data may be two neighbors data which is a lot better than my neighbor my neighbor would never put anything in mine in his garage but cassandra is more like working like a tea so whenever Carol which is the primary responsibility is Dee well a and B also say you know we'll hold that data for you as well so asynchronously replicates that David's not synchronous easy this is right to the exact same time and it'll send it to three nodes at the exact same time now the one missing part of this consistency which I'm not going to cover in this particular talk but consistency is how the client knows that it got to those three nodes or just one node you pick as a programmer you can make a determination on consistency and we just have to mr. consistency here are you going to do the next doc you're just dream cases oh yeah eventually you'll eventually be consistent yes Chris Chris house is out for Netflix he's a going to do a talk and eventual consistency which I think that will thank you we're going to painful painful consistency man this is not painful right now this is good right now right now so this is how data gets replicated now how does that fit in with the rest of it well whatever I'm writing that million writes per second if I have a very small cluster for nose that means that there's a potential later that means it's going to be a lot of writes per second per node now that's because each one of them holds a quarter so how does that linear scaling story work if I double the amount of nodes now each node is only responsible for 1/4 1/4 or 1/2 of the quarters 1/8 I do fracture math I have eleven year old so I'm doing a lot of this now so each one of them is doing an eighth of that if I double it again then they are only doing a 16 and so on 500 notes each one is worth a hundred of that set so as you add more notes each they're just responsible for less and less and less that's how it scales up save up storage say with rights per second you're also spreading it around with this replication factor of three if I have $1,000 plus worth it's still only get a replicate to three nodes not to one bird or something like that so that's how you keep those notes and keep those right spread out so the time UUID last topic here so timing UID is a timestamp to a microsecond which is cool that gets it's part of the way there plus a UUID now that UID is added at the end in case you have the exact same time Santa if you're putting data in your database like that then you want to make sure these valleys are unique that's a time UID so what's cool about these are also known as a version one UUID which means it's somewhat of a standard you can get driver support for it in Python and Java that's what I think they're sorta bull meaning that if you present a bunch of tiny new IDs they'll be in sorted order and you could reverse it so when you give it this it'll go I know what that is that's this date so you can figure out what day it is or what the timestamp is based on this this really really super small link over here is a webpage to make that which I used on February 12 at 6:18 p.m. so here's my last example where we use that time UID so again our data Molins how we date a month if I'm going to my needed queries I want to get all trades and symbol for it and add a crystals you're gonna love this example by the way get a trade for a single date time and then you get the last 10 trades ah the last 10 trades that's kind of a tricky one right how do I get the last tip so my data model support that is I'm going to store my data per symbol and date I'm going to store it time series in reverse order last two first and then I'm going to make sure that every single transaction is one percent unique so even if through the same microsecond so I'm going to start buying some Netflix talked a lot of uh actually sold a lot to so here's an interesting thing with I'm going to point out a couple of deals here on my day tomorrow so these are just stock ticker data typical type of stock ticker data where it's coming in a rapid pace so there could be potentially some collision but what I've done here are my primary key is a couple of things I've put this symbol and date together as that's the location that's my private part first part of my primary key that's going to give me my locality that's going to mean that for every date which is just like for one day and one symbol it'll always go to the same group of servers throughout the day so the next day it'll move to a different group of servers and then the trade is a time UID so if I have the exact same at the exact same second microsecond I'll still have uniqueness to them and it'll be ordered the next the last thing here is this clustering order by this is a very fancy way of saying I'd like to reverse the order store it on disk by the trade which is that tiny UID so that's kind of an interesting plan so here's the order thing I'm going to buy two thousand I buy three hundred sell 450 and sell three thousand okay so that's the order that I did them in when I go to look for that data notice it's reverse this is the last thing that happened at the top of the list the first thing that happened is at the bottom of the list so now I'm a reverse order based on the trade details or that trade time that UID gets parsed and reversed but right now I can't even tell you what times I'm sure it could so what does this look like on the disk though now that I've told you to reverse sort this is when it gets really interested because now what I did is because I told it to reverse or and merge now it actually does that on the disk too so whenever I go look for that data I'm going to say hey I want a limit of three limit three cool coolness in times what I'm saying when I say limit 3 is I only want three things returned but because I know I reversed it I'm going to get just three items the last three items the last three trades that happen boy this data model is used so many places for fraud detection for user sentiment for identity tracking on websites this is really a cool day tomorrow so when I say limit 3 it's going to go to the beginning of the list I didn't have to all the way to the other and then come back go to the beginning a list from here to here and you just bring those rows back very efficient now I just have the last three trades that have happened I think that is a cool way to do things like I said I've seen I've done personally done a lot of these data models in production using a reverse that clustering order by with a limit to create really interesting interactions with time series because all those things flying in as fast as they are you need to be able to do that yeah is descending 7 house going down you're right nope not enough answer test a shift everything well it's much shorter than memory first so it's a memory operation and then that file that the way that Cassandra writes out is a sequential right so it will write that file out there's a weekend we're getting into likes of operations things but later if there's two roads of data it will do a compaction operation as a normal background operation and then burn short those in memory again but that merge short memory it's a hashmap will sort that properly so it's not going to slow it down order is independent of how fast it means our data it's just the feature of how that works yeah more questions because I'm at the end of my slide so there shouldn't be a lot of imbalance unless you specifically put it well with me5 you're not going to get into I mean if you're using a stock rating Netflix month is a volatile stock right you would end up using the same potential yes in those cases way what and isn't knowing your data model if you know that's a potential case here's a great example is like if you're looking at URLs all right can you say here give me here's the row keys URL a day well all day long that row is going to be the hot spot there in that case what what I've done is just put like modulus in there or something else to make it more unique and then how to consistent way of managing that that's knowing your datum um but knowing what you just found out is pretty critical now you know oh if I do that maybe that won't be so good maybe I will have a hot spot I will just be writing to the same note of the same replica set over and over now what we hope for in that like with the stock trade example is there's a company called circa that collects all stock trades around world 2 million a second I doubt you're going to find a hot spot in that yeah you know Netflix stock may be hot at one moment but there's still zillion other things going on so it gets washed out the noise it may cause a little bit of a blip but with that many rights per second going on in your cluster that's not really a good consideration now if I was only looking at a select group of stocks and I knew that one was going to weigh more than the other okay that's a different game so something you have to consider for sure but if you look at the big picture distribution of that won't be a problem I think another variation the question concern is you're going to end up with some wine rose right is that a concern not really and the wide row concept all right so that partition how big can a partition be two billion cells now that's that's great it sounds like a lot but if I'm collecting like weather station data every second it's not a last you can do that's when you get into to what or if I could just collect all trades for one stock one symbol forever then yeah two billion is going to get eaten up pretty quick so that's when you probably want to think about what am i doing in my data model the reason I added date that date in there let's go back to this the reason I add a date in there is to eliminate that problem now that I know every single day will be a new day for a new partition that will help as many of that now do I think I'm going to get two billion in a day I'm saying though I won't the dis prevention of this data model you would have tension problem entities the day changes but if you're looking for limit Fritos if it's not going to what's great about the stock market it's one loaded forget all this time to create that you table yeah I know and that's the right maybe that's a different consideration in here so and that that day tomorrow I showed you with the WebP is that actually had a how'd it feel to sustain but what I did was you know we just had a little bit of logic in our system organize grab particles at those off moments the chances that somebody we did everything in ddd the chances and everything everybody being at work at midnight VT we pretty high so there wasn't even an entrance for us to and absolutely but we just had enough code in there to deal with that situation and our it was pretty efficient out we didn't really have any problems with that but something to consider right you're going to you're going to be transitioning to in our new partition if you are designing a data model bug this is then you still have to think about all the edge cases before you a lot of times yes and get to know data modeling I put up here the planet Cassandra oh there's my animation again because I'm so proud of it huh Lana Cassandra has a lot of cool interviews those five-minute interviews like use cases but also white papers things like that so I would hope you wouldn't have to reinvent a nice round wheel again you know people like myself I've done thousands of data models I just and now I'm kind of like the horse whisperer I just walk up you know because I were to get all the reasons we do things and you'll get to that point too I've worked with teams now for a long time um I work with Christmas's team quite a bit nobody's you know no one is really like digging too deep anymore because they get all the points and then they're just building the data models and sometimes you just it's like relational when I first started doing SQL I'm like what is this this is crazy talk but then after a while is is doing in it you know in my sleep it's the same way but what I don't have to do now is think about how I'm going to scale this thing once it's in production how I'm gonna keep it online how big my Oracle it's gonna be so there all right we're out of time it's 11 o'clock thank you very much
Info
Channel: Data Council
Views: 60,661
Rating: 4.765625 out of 5
Keywords: Cassandra, C*, Data Modeling, Data Science, Data Engineering, Big Data, Software Engineering, Tech Talk, Software Development, apache cassandra
Id: tg6eIht-00M
Channel Id: undefined
Length: 42min 41sec (2561 seconds)
Published: Mon Apr 28 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.