RailsConf 2019 - Database Design for Beginners by David Copeland

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(upbeat music) - Alright hello, thank you all for coming. I wanna start off this talk doing a Thought Experiment. So it's a Thought Experiment where you think through in your head to try to learn something about something and you usually play a hypothetical situation and it reveals information. So I want you to think about an application you work on that has a database and think of one that's very important to whoever it is that's paying you to work on it. So for example, for me, I've worked on a Warehouse Management System at Stitch Fix that managed our inventory. How much did we pay for it, who's got it, where is it, who do we sell to? Super, super important. So now think about what would happen if the source code of that application disappeared from the face of existence. Gone forever, never to be returned. No backups, no get repos hanging around anywhere, it's gone. How could you recover from that? So I think you could. I mean it would be painful, it would take a long time, it would be expensive but you kinda know how the app works, you work on it and you've got users, they kinda know how the app works. You could figure it out. The business people paying you, they know how they want it to work and you've got the database still there, you could piece it together. It would suck but you could do it. So now take that other way. Application is fine, the data and the database goes away for ever. No backups, no dumps sitting on S3 is just gone forever and ever and ever to be returned. How could you recover from that? So I think that is an Extinction Level Event, right, you would never be able to recover from that. Because the whole reason you have a database is to remember stuff that you can't remember, or to share stuff with other people who need to know it. And that's why it's there. So there is no other way to get that, you can't remember what the data looks like in repopulated, you would be dead. So what this Thought Experiment tells me is that the data our applications manage is actually more important than the source code that manages them. Which is interesting, because I don't know about you all, but I if I had to guess, we've all spent a lot more time learning about how to write code, then we have learning about modeling databases or managing data. I mean, I have two degrees in Computer Science and I didn't take a single Database Course there, they did not make me do any of that. So that's what I'm gonna try to start closing that gap on today. So my name is Dave Copeland, or @DaveTron5000 on Twitter. How I know anything about this, despite my lack of courses in college about it, was I was lucky enough to work with someone who understood Database Design, who understood the Theoretical Computer Science that like makes it a real thing. And he was very persnickety about having our databases model properly. And so he explained what that was, why that was important. 'Cause we worked on a database, it was very terribly modeled. And he gave me the language to say like, why it was so bad. And then I've since read books and papers about this to understand the Theoretical Computer Science that underpins what we're gonna talk about. And if I do a good job here, you won't have to worry about any of that, 'cause it is very hard to understand. So hopefully, we can sort of start to feel more confident about how to do that. Also, as a side note, while I have your attention, I'm the co-author of "Agile Web Development with Rails 6" which just came out in beta today, if you are looking to, if you're new to Rails, and you want a book to learn Rails, this is a very good one. This is a discount code for getting some money off. I will Tweet it later, so don't worry about that. Cool okay, so first, what is a Database? So a lot of things in life are databases, right? Redis is a database, your File Systems are database, Excel is a database. But these are not the kinds of databases we're talking about. We're talking about the kind of database that your application probably uses. And that's called an OLTP. That stands for Online Transaction Processing. And I had a whole bunch of slides to explain what that meant. But I'm just gonna cut to the end, because it's not super important. What it really means is the OLTP is the source of truth. It is the one place where the facts about the world that you care about go. So that database, I talked about, how much did we sell this item to someone for? That source of truth that is in this database of this Warehouse Management System. It gets copied everywhere, it gets cached for performance, it gets aggregated for Business Intelligence all over the place. But the source of truth is that database. So therefore, it stands to reason the Database Design is how we make sure that our database can be trusted to be the source of truth. So think about Excel, if we just put everything in Excel, well, you can put whatever you want in there. So how could you possibly trust that to be the source of truth? Because literally anything can go into it. So the design of the database in Excel does not really make it easy for us to trust it as a source of truth. So this then leads to the first thing that we wanna start thinking about is, what are the facts about the world that we want to record? Often were given a generic business problem to solve, or a wireframe to implement. And that's totally fine. But somewhere in there is an implication that we need to capture information and make it available to someone else to answer some questions. And so that's what where we wanna put in our database. So that's the facts that we wanna record, and we need to understand what those are. 'Cause as we'll see, the active database design is really about making sure you understand the requirements very clearly. And so the first step is to write down what those facts are. So we're gonna go through an example database that is gonna start out terrible and become better and better and more, we'll understand why it's getting better, 'cause it's very hard to understand the stuff without a running example. So I've chosen a domain that I think everyone is very intimately familiar with, which is, of course, professional wrestling. So we're gonna make an application about professional wrestling and store data about professional wrestling. So here are the facts that we wanna know about the world. So a wrestler might have a finishing move, right, that they used to complete a match, that they wanna win a match or they're finishing move, we want a database of that who's using what? A wrestler might wrestle on a particular show, Right, so The Rock might wrestle on Smackdown, Stone Cold Steve Austin wrestles on Raw, we need to know who's wrestling on what show? 'Cause not every wrestler wrestles on the same show. Now wrestler can also be cleared to wrestle on a particular match of a show. So we wanna have The Rock wrestling the main events, because that comes on last, we're gonna get a big ratings boost from having him there. So we wanna keep track of that in our database. But we might have Zack Ryder who's less popular wrestling the opening match, right? So that needs to be in our database as well. And lastly, on every show, in professional wrestling, they're fighting over a title usually. So on Raw, they fight over the Universal Championship. So we wanna keep that in our database as well. Now, you don't have to know anything about professional wrestling, to start thinking about how you might solve this problem, how you might model this data, build this application, right? You're good Rails and Ruby developers, you're gonna start looking at all these little nouns that show up, right? Nouns are the key to know what to do. We're going to have our wrestler table finishing move attribute a show table. Well, you can start to see it form in your mind, see an object model kinda form. So you know, you might do something like this. Okay, we've got our wrestlers table, our shows table will know what shows have what matches by joining them together. And then if we want to know what wrestler wrestles and what show what match, we'll have another table, this brings it all together. Now, if you haven't done a lot of data modeling at all, that probably seemed like incredibly fast and confusing. So from here on out, we're gonna go very very slow. For those of you that have done this before, you might feel like yeah, by gut, you know, we can just kind of do it. And I've done that many times, and you have a gut feel about how things should relate based on how you understand things. And this data model is not bad. It seems to meet our needs, but it's got some issues. So suppose that Zack Ryder wrestles on the opening match of the show Superstars. so we could store that in this model. Now, suppose that Superstars doesn't have an opening match? Well, that can also be stored in this model. So which is right, we have a data model that allows us to store two pieces of information that don't agree. So that's not good, right? We don't want to have that situation and that's the case, we can't rely on this to be the source of truth. So what that type of thing is, is called an Anomaly. So Anomalies is when the data model prevents the storage of certain facts, or allows ambiguity to exist, like we just saw, or requires deleting one fact in order to delete another fact. We'll see some more specific examples of these later. But the takeaway here is that a data model that has Anomalies can't necessarily be trusted to be the source of truth, because it allows you to store wrong information or not store right information. And so Normalization is the process by which we remove Anomalies from our designs. It sounds very formal and formality is kind of two sometimes. But it actually is gonna be very helpful, because what we're gonna see is a little bit of formality allows us to make very strong statements about the changes we're gonna make to our data model. And that therefore gives us confidence that we're improving the quality of our data model. So like I said, we're not gonna get into a ton of Math, but I am gonna use a few of the terms from this area of research mostly. So if you actually wanna know more about it, you know what the words are to start kinda googling and start your search towards the great depth of knowledge here. So Normalization is a process we're gonna learn about here. Normal Forms or Mathematical truths about the data model, that tell us what sort of Anomalies have been removed. So Normal Forms have an order lower to higher. So a lower Normal Form is open to more Anomalies, i.e it is a worst data model, i.e it cannot be trusted to be the source of truth. A higher more Normalized Data Model has fewer Anomalies, and therefore can more be trusted to be the source of truth. So Normalization is doing that. Promised is not gonna be a ton of Math. So let's get back to our domain here. So instead of trying to make some object graph and attributes and stuff like that, let's just throw everything into Excel and see where that gets us. So here's our data model, we got a column for everything that we wanna store. And we just put some rows in there to kind of keep track of this, right? The Rocks finishing move is The Rock Bottom, he wrestles on Raw and Smackdown. Raw is where the Universal Championship is defended and The Rock wrestles in the main event. Alright, so we can store some facts about the world in here. Now, probably nobody thinks this is a good database design. Like, but how do we know it's not a good database design? Yeah, we can say it looks gross, so it's ugly, or there's duplication, but we can make much more strong statements about it. And so we're gonna go through this process, right? So right now we have kind of whatever, we could just put anything we want in there, no rules to hold anything. That's why it's not a good design. So the first step is going to be to modify our table into what is called a Relation. I'll talk about that in a second. But a Relation is like the bare minimum of any kind of design of a database. And from there, you can start to make real improvements. And the way you make those improvements is actually pretty straightforward. And those improvements will lead us to what is called a Normalized Schema. And again, it will be in what's called Boyce-Codd Normal Form. Again, that's for googling, do not worry too much about that. The point is, it will have provably fewer Anomalies, and therefore provably better, and therefore more likely to be relied upon as a source of truth. Okay, so I talked about a Relation, what is that? It's gonna seem kinda, like not much. So Relation is a table that doesn't have any duplicate rows, or every field has a single value in it. And the fields of each column are all of the same type. This sounds kind of reductive, and not that interesting. But this is just the baseline, this is the bare minimum. And if you remember what our Excel spreadsheet looked like, you can already see it doesn't exactly meet these needs here. So before we see what our database looks like, as a Relation, I want to talk about Types 'cause I did mention that, right? They all have to be in the same Type. Now Type isn't like category theory, high school level specificity, right? This is a design. So you just need to kinda think about it a little bit. Think about all the names of wrestlers in the world. Well, think about all the strings in the world, there's more strings than there are wrestler names. So you can see the kind of just there. So something just really basic. I just want to point out the NULL is not a value. So therefore NULL cannot be in allowed value. So therefore no field in our database will allow NULL. Now we'll talk about what that means later, when we actually write code to deal with this data model. But until then, NULL is not allowed at all. And we won't need it, it's not gonna be a thing that we have to worry about, we won't need it. So we know what facts we want to record, we wanna get them in some form of some tables that we can call a Relation. So here's our Excel spreadsheet, we can see some problems, right? Universal Title, that's not the right type, it should be Universal Championship, so that's a problem. Two values there, right? We want single values not two, same deal there. So how can we resolve this? Well, we can fix the Universal Title, change that to Universal Championship. And then we can just duplicate every single row in here so that we don't have multiple rows, or multiple values, right? So same data, right? So we see The Rock is in there twice, because he's on two shows. So that's why there's two rows there. But this is a Relation. So the is no duplicate rows, every field has a single value. And everything has the same Type. You don't see The Rock in the finishing moves column, you don't see Smackdown in the Title column, it all kind of makes sense. And it's not good and It's not great. It's still open to Anomalies but it's at least a base that we can start to say definitive things about. And so this is what's called First Normal Form. Now, it is open to Anomalies, as we said. Kenny Omega is a wrestler, he doesn't work for WWE. But he does have a finishing move, we'd like to store that in our databases, our database should do that. But it can't because he's not on a show that defends a title or a match. If we were to remove The Rock and Kevin Owens, we would also lose the knowledge that Raw is where the Universal Championship is defended. So that's not good. Stone Cold Steve Austin changes its finisher to the million dollar dream, we have to remember to do it in two places, otherwise, it's confusing as to what his finisher is. So these are all the kind of Anomalies that our data model allows that we're going to now eliminate. But because we've actually gone through the process of taking whatever to a Relation, we can now actually use a well defined process to do it. That process requires that we capture the business rules in a certain way and it's very very lightly formalized, like very lightly formalized. We need to capture it in terms of two things that I'll talk about, Functional Dependencies and Keys. And so we can write the business rules down in terms of these two things, we can literally execute an algorithm to improve the design of our database. So let's talk about these two things separate we'll put them together, hopefully, it'll make sense. So Functional Dependency is when the value of one column unambiguously implies the value of another column based on the business rules. So what does implies mean? So looking at example, right? A wrestler has a finishing move, if I wanna know the finishing move of a wrestler, all I need to know is the wrestler. Therefore, wrestler implies finishing move. Same with Show and Title. So things that are not Functional Dependencies. A show can have more than one match, So therefore there's not a Functional Dependency there. Because if I have the name of a show, I can't necessarily get a single match, I might get zero, I might get one, I might get two, I might get five. So there's not a Functional Dependency there. Also know that Functional Dependencies don't necessarily go the other way. We saw in our example, data that a wrestler, multiple wrestlers can use the same finishing moves. So therefore, finishing move doesn't imply wrestler. If I have stunner, well, I don't know what wrestler uses that 'cause there's multiple wrestlers that could use that. And again, this depends on the business rules. What if we decided in our world of wrestling, no two wrestlers could use the same finishing move? Well, if that were true, then there would be a Functional Dependency. So you can see how this is about capturing the business rules. And so the way that you do this is, you write them down in this format and then you talk to your business partner, like is this is this correct? Is it true that if I know the wrestler, I can know the finishing move? And you can talk through and it's a way to capture what the actual requirements are. So I also wanna point out, there's a difference between the data itself satisfying the Functional Dependencies and the data model doing it. So this database, this data model is an Excel spreadsheet, more or less. And the data that we have right now does satisfy the Functional Dependencies. But that's not that important. The data model is what needs to sort of adhere to this, because this data model allows us to do this. Now the data does not satisfy the Functional Dependencies. So as a statement of our design, right, a design in which we can execute a query based on a wrestler and get an ambiguous result back about what they're finishing move is, is open to anomalies, because by the business rules, we should not get an ambiguous result if we query for a finishing move based on just the wrestlers name. So hold on to that for a second. Now, let's talk about Keys. And then we'll bring it all together and I hope it will all make sense. So you've probably heard of Keys before, Primary Keys, IDs, things like that, Foreign keys, right? So what we're talking about now, a Key is simply a set of columns that uniquely identifies a row. Said in another way, no more than one row and our database can have the same set of values for those Keys. There can be many Keys in a Relation based on whatever the business rules happens to be, the business rules to say, what forms of duplication are and are not allowed. Pointing out that all columns of a table do form a Key because remember, Relation is single, no duplicate rows. So therefore, all of the columns of that table are Key. But the business rules revealed the other Keys and that's what you have to do. So according to our business rules, a wrestler cannot wrestle the same match on the same show more than once. Even in wrestling, that would not make sense. So therefore, wrestler show and match is a Key. There should be no more than one row on our table where wrestlers show and match is duplicated. If I have a wrestler and a show and a match, I can get an ambiguous answer from our database. That's what a Key means. And so you have to find out what those Keys are. And you have to do that by asking questions of the people that want you to build this application to verify. Yes, that is true. No wrestler can wrestle the same match on the same show more than once. They will tell you that, and then you will then determine that that is a Key. Okay, so the data satisfies this Key. In terms of the data model, satisfying the Key, we simply just need to indicate what the Key is. So you can see the little Key icon there, we'll see how to make sure that gets enforced in the database a little bit later. But for the purposes of modeling, you just sort of need to state this is the Key. So now you got Functional Dependencies and you got Keys, now we can bring them together. And remember, we did this by understanding the business rules, looking at the data can help you see what these might be, but this is a way for you to verify that you understand the domain that you're actually building. You understand what you're trying to accomplish, what problems are you trying to solve? So what are the implications of these two things? So to get unambiguous information, we need a query that gives us zero or one rows. Therefore, we need a Key to do that. That's the definition of a Key. Now our Functional Dependency between wrestler finishing move tells us that we should be able to get unambiguous information about our wrestler's finishing move simply from just having the wrestler. But the wrestler is not a Key, the wrestler is part of the Key but it is not itself a Key. So that means that if any query that is just based on wrestler cannot be guaranteed to give us unambiguous results. It could give us ambiguous results and we saw that right? I have the wrestler and what is The Rock's finishing move will depends on which row we're looking at. So this is how we know our database is allowing us to store data that violates the business rules. The design of our database, because it is allowing us to write queries that return ambiguous results, when it shouldn't, does not satisfy the business rules. So how do we solve that? So this was all the hard part for you doing design, understanding and capturing and looking at the requirements. To fix this right, what do we need? Well, if we had a table and wrestler was the Key of that table, and that table also had the finishing move of that wrestler in it, then therefore, we could query that table to get an unambiguous result. Therefore, the Functional Dependency would be satisfied if that were the case. So let's make that table. Let's just create a table that has that and see what happens. So make a table that has the left side of our Functional Dependency, which is wrestler, that's going to be the Key of this table. We're gonna populate the other fields of this table with the right side of the Functional Dependency which is finishing move. Now because we have this, we don't actually need the finishing move column in the original table, so we'll just remove it. If we want to connect everything up together we can. So if we put our data back into these two tables now, it's starting to look a little bit better, and if we wanna connect everything right, the wrestlers names match on both tables. That's how we can join them together. And we could even join them together to create the original table as a view if we wanted to where we can recreate exactly what we had now. And if we repeat this process with show and title, which works exactly the same way in this case, that starts to look more like a data model that we might have come up with by our gut instinct. We've got, there's no duplication, it's looking pretty good. So let's kind of talk a little formally though, we can see where the keys are. Now if we put the Functional Dependencies on top, we can see that all of the Functional Dependencies are dependent on a Key. Meaning we can get unambiguous information out of our database, which we couldn't before, by definition, we can't have The Rock in the finishing moves table more than once and so therefore, if we wanna know that his finishing move, we query that table and will definitely tell us. That's cool. What we just did is underpinned by a ton of Theoretical Computer Science and Math. So this isn't just me showing you a thing you could do this is like a actual algorithm proven by Math to do what we just did. So we just turned our original Relation into what is called Boyce-Codd Normal Form. And what that means, is that our data model has no Anomalies based on our Functional Dependencies. Meaning based on our understanding of the business rules, our model does not allow us to put data in, that violates those business rules. And that is, not like a gut feeling, that is provable, like that is, you can't like say that's not true, that is 100% true. The Math proves that we did that. And now that we have the structure, all of that data that we wanted to insert or delete or change that we couldn't before, now we can. We can see Kenny Omega is in our finishing moves table, even though he doesn't work for WWE. We can store that fact now, we couldn't before. We can change The Rocks finisher to the People's Elbow and we won't worry about changing multiple places, like we can do that in a way that is safe and reliable. We can add Ricochet who has no finishing move but he wrestles on Raw. We couldn't store that before, now we can. So now our data model based on our understanding of the business rules can store everything we needed to store. So what we just did was we executed Heath's Theorem, again, is the thing you can Google if you wanna know more but the takeaway is that Heath's Theorem proves that what we just did, actually does remove Anomalies. It actually does make the data model better provably objectively, and that we didn't lose any data. So that was a very safe thing that we did, and there's no debating about it, we don't have to debate the elegance of our model or have gut feelings about it, it's proved that we made it better by doing that. So we wrote down the facts we wanted to record partly as understanding the business requirements, we made sure that they were in a Relation of some kind. In our case, we just put everything in one table but however many tables you have, they should be a least a Relation. And then we wrote down our understanding of the business requirements in terms of Functional Dependencies, and Keys. That's the hard part, that's all the thinking. Once you've done that, executing Heath's Theorem, you could write a Ruby program to do this, I can show you one that I wrote. It just looks at the data model, looks for places where the Functional Dependencies or not dependent on the key of the table there in, and redesigns a database to do that, it's mechanical, which is not my intuition when I was really first learning about this, it seemed like this fourth part was the thinking, what are my tables going to look like? But that's actually not it. It's more, what are the business rules? Okay, so this is kinda all theoretical, right? We didn't see a field called ID. I said the words Primary Key, but we didn't really talk about what that meant, and if you've done anything with Rails, you'd probably but, this is probably the only thing you've heard of, right? There's definitely ideas and that thing is called the Primary Key. Where is all that? And what is that doing? So let's talk about that. So a Primary Key is usually just the Key that we all agree to use most of the time. There's nothing particularly special about it. And in our case, wrestler show and match, that is the Primary Key, that's the main Key of that First Relation that we were looking at. That's how we get information out of that table. And you'll notice that that's based on the data itself. So that's called a Business Key or a Natural Key, 'cause it naturally arises from the data and that's why it's so important to identify because it captures the fact that you understand the business domain by understanding what data can and cannot be duplicated. People also understand that right? The users understand this stuff, 'cause it's capturing business rules. But you can imagine a problem, what if we wanted to change The Rocks name? Well, we're using his name as a way to cross reference all the tables together, so we have to change it in many different places, and even if they tell us, they're not gonna let wrestlers ever change their names, well, that's not gonna stick, they're eventually going to change their minds and their database is going to be really screwed up, it could be impossible to fix things if we're using the actual data itself, to cross reference in other tables. So to get around this, and this is what Rails does, we create what's called a Synthetic Key or a Surrogate Key. Surrogate it stands in for the Business Key, it stands in for the Natural Key. And the idea here is that the Synthetic or Surrogate Key is just a made up value and Rails is 123456, has no meaning at all. So we can use that to connect all of our tables and cross reference them, and then no matter what happens to the actual data, while our linkages between tables are safe, because those values have no meaning at all. So that's why Rails does this and it is a very good practice, you should always do this, there's no real reason not to, because these things happen, the data, the rules will change. We want our data model and how its implemented in the database to be as resilient to that change as possible. So let's add some Synthetic Keys to solve this problem. So we're gonna take wrestler out of finishing moves and replace that with the wrestler ID, will take show out of titles and replace that with show ID. We'll put the show ID and wrestler ID and our wrestler shows table, every wrestler has an ID, every show has an ID, so that should sort things out. And here's a good example of a table that has multiple Keys. So let's insert some data into our database. So this data satisfies all the Keys, satisfies all the Non-Functional Dependencies, but clearly something is wrong. I mean, The Rock shouldn't have multiple IDs. An ID should uniquely identify a wrestler. You could say an ID implies wrestler. So that's a Functional Dependency that we failed to identify. So wrestler ID implies wrestler, show ID implies show. So now that we understand these Functional Dependencies, we can see we have Functional Dependencies, they're dependent on part of a key, not the entire Key. So therefore, our database is flawed. This is why our database allowed that weird data to be inserted into it after we added these fields. Okay, the hard part is done. Now we know how to fix this, fixing this doesn't require nearly as much thought as it did to get to this part where we can fix it. We make two new tables that have as their Key, wrestler ID and show ID. We move wrestler and show into their respective tables, we remove them from the wrestler shows table and now we're back where we started. Free of all the Anomalies based on the Functional Dependencies that we understand. Provably, there's no way you can say otherwise, unless we missed a Functional Dependency. Now if we throw our other tables back in here to get a whole view of the data model as it stands now, this looks pretty reasonable. This looks like what we might have come up with it looks kind of squint similar to this thing. This is what we had first. But this we can make definitive statements about, no one can tell us that it's open to Anomalies, based on our understanding of the business rules. No one can tell us this is bad. We can't say exactly what this doesn't doesn't do. We can't say anything about this. And so that's kinda the power and taking a little time to write out what the business rules are, and use those to drive your database design. Okay, so we're still kinda in theoretical land, though. So how do you take this and actually use this in your Rails App, right? If you're used to doing bin/rails g migration, or Rails DB Migrate, what you're doing is manipulating what's called the Physical Model, the actual tables in the actual database. What we've been talking about is the Logical Model. And so everything we've talked about, your business partners, whoever you're working with, they should better to understand that model. They don't necessarily need to understand the Physical Model, but they should be able to understand the Logical one. So that's what we've omitted a bunch of specifics about the database. But now, let's talk about those 'cause it is important. So the same way, as you look at a wireframe as the design and the actual Rails views as the implementation, well, what we just did is the design of our database and so now, when you talk about the implementation, that's what's the with the Physical Database is. And so there's three concerns, you wanna make sure that your Physical Database enforces all of the Keys, not just the Primary Key that Rails creates for you, you wanted to enforce all the associations between the tables, and you wanna do some association or are some enforcement of the types of data that go into your database. So the first two are relatively straightforward. To enforce Keys, you make a unique index like this, so this basically says, there can be no two rows in the wrestlers table where the name is the same, this tells the database to refuse to allow that data in there. So this enforces that Key, you can enforce a Key across multiple fields like this , so this says that no row should have the same value for wrestler at show ID and match, and the database, again, will enforce this for you. Associations can be enforced using foreign key constraints. So for whatever reason, Rails doesn't do this by default and there's really no reason not to do this. So in context here, what this Foreign Key concept says is that if I insert into the show titles table of value or the show doesn't actually exist in the show's table, then the database will refuse to insert it, because that would be a dangling reference, we don't know what it means. And it also means that if I deleted a show that show titles I was referring to, it won't let me delete that show, because that will make our database inconsistent. So you always want this, you wanna rely on the database to get out things like this. And these are, these things are easy to do 'cause you've understood them from your model, you just need to kinda type them in in your Migration Files. Types are a little harder, and maybe impossible. Think about the allowed values for wrestler name. Well, there's kind of a lot, there could be an infinite number. But there's certainly fewer wrestler names than there are strings. So you know, something has to happen here. Now, we talked about NULL, and we'll talk about it again. But we definitely know that NULL is not the name of a wrestler. Maybe a wrestler could be named to the string NULL, but not NULL. So we can tell the database, don't let NULL be inserted. That's pretty easy. You should probably do that on most of your fields. But it gets a little bit more subtle, right? So these three strings, these are different strings, but they're all the same name, they all say The Rock essentially. So how do we wanna deal with that? What about this? Certainly no wrestlers name is like a bunch of spaces. Maybe it'll be a tag team called a bunch of spaces, but it wouldn't be a bunch of ASCII space characters. So this is certainly not the name of a wrestler. So you have to think through the trade offs here. How likely is it that the bad data is gonna get into the system in the first place? And if it does, how bad is it? If it's very likely, in very bad well, you need to do a lot to make sure that doesn't happen. But if it's not likely, or maybe not a big deal, maybe you don't need to over complicate your life. So as an example of things you can do, so this tells Postgres to make a case insensitive, unique index, so we inserted The Rock, and then we later inserted The Rock all capitals, this will prevent that from happening. We could also use a thing called Check Constraints that's very, very powerful. So this tells Postgres, before you insert a wrestlers name, trim the white space from either end and if the result is the empty string, then do not allow that to be inserted. So this would handle are a wrestler whose name is like eight spaces in a row. This you can go deep on this if you need to. Rails validation is, of course, you can do a similar thing, this is a little easier to understand, it's way more powerful, but of course, Rails has to enforce this. You have to tell Rails to enforce this, you can circumvent this easily so this won't actually prevent it and you know, uniqueness doesn't exactly work, but again, it's a trade off, how likely is it gonna happen? Now, what about NULL? So NULL it's still not a value, right? That's the way we did by definition. But when we write Ruby or Rails code, Nil is value, Nil a thing that shows up like we kind of have to deal with it. There's no easy way to to avoid it, so let's see kind of how that plays. So if we were to continue making our Physical Database, match exactly our Logical Model, we make this finishing moves table that has the name of the move, NULL false, and it references a wrestler. Let's find that work. Now remember Ricochet, who has no finishing move? So he just wouldn't have an entry in this table at all but what would it look like in our code? We can't do this because if wrestler is Ricochet then finishing move is NULL, we gonna know pointer exception. So we can use this lonely operator thing, you're gonna litter your code with all, that kinda sucks. So maybe you wrap it up and some methods to kinda save yourself from that. I mean, there's really no way to avoid the Nil here, it just there's no way to real way around it. And I think that matter. That matters to how you're actually creating the Physical Database. So if we don't need to know anything else about a finishing move other than his name, maybe we'll just put it on the wrestlers table and not make a separate table, and will allow it to be NULL. And will be nice and explain what NULL means even though it might seem obvious. That's a trade off that you might wanna make. This will make the code a lot easier, it's really not making a huge difference to have a database as implemented, and this is the difference between Logical and Physical. So, general guidance, this kind of summarizes what we just talked about. Create a unique index for all the Business keys, you've gone through the trouble of identifying them to figure out your data model, you should have the database, enforce them. Always use Foreign Key Constraints. I do not know why Rails doesn't do this by default, you should totally do this by default, there's really no reason not to and every reason to. Default to NOT-NULL, NULL is not a value, so you should default to it. Understanding that you might need to make a trade off or you might allow it, like we just saw, and if you do that, do it knowingly, do it with a good reason and write that reason out in the comments. And lastly, when you're thinking about Types, right, you have to do that trade off. And if you're using a powerful database like Postgres, you have a ton of tools available to really enforce at the data level, what's the allowed values can be. So if that's super important, then do that. If it's not important that's fine, don't do that. Don't overcomplicate your life. So in conclusion, I hope all this made sense, I really did. This took a lot of reading to figure out how to explain, so I really hope it's sunk in even a little bit, and not to make it even worse, but of course, Boyce-Codd Normal Form that we talked about, it's not perfect, your data model could still be open to other Anomalies that are very complicated to explain, but nevertheless exists. But Boyce-Codd Normal Form, what we just did, if you do what we just did always and your data models are always like that, you're in really, really good shape. Like that's excellent. So that's a good thing to aim for. When you're editing the model, when you have to make changes to the database, change the model and see what that change implies to the implementation. You saw what happened when we added those ID fields, it wasn't obvious at first, that we like actually created a real problem in our database that we were able to get rid of, by sticking around in the model before we got into it. And again, I've said the word business rules, the phrase business rules about 1000 times, but that's what this is about. This is about you understanding the business rules, you understanding what problem you're trying to solve and how you're solving it. And only when you have that understanding, start modeling your data. Because remember the Thought experiment, right? The data is more important your app in almost every case, so treat it that way. So that is what I got for you, thank you. (members clapping) Should do Q&A? We got a couple minutes for Q&A if anyone has any? - [Man] (mumbles). - Are those situations where you don't wanna do Boyce-Codd Normal Form. So making a data model to be the source of truth, I would say probably not a way you might not wanna do that as if you have a separate database that you've sort of, what they called the Denormalized for quick access. So you might copy the data into a form that doesn't require so much joining, but you always have that data somewhere that is kind of fully normalized. So I have a Blog post and many opinions on this, and I'll try to sum it up. My opinion is that Rails validation are for helping users not screw up, and Database Constraints are for keeping the data clean. So sometimes you use one and not the other, sometimes you use both. But I definitely don't make them mirror each other 'cause that doesn't seem like it makes sense again to me. But I also know there are some gems, you can get that kind of try to keep parody there. Some people do actually like Rails to mirror the Database Constraints. So it's a little bit of personal style. But for me, I don't know why I am making a validation? If I have the Constraints, and I'm not gonna Rail fact I'll Rail but, anyway, that's a very quick summary of my strong opinions on that. What resources to dig a little deeper? So the Wikipedia articles on this stuff are not bad. I read a book, I'm gonna have to dig up the name of it now. But it basically explains this in one level more but I think the Wikipedia, the Wikipedia articles are actually pretty good. And they walk you through this stuff with a maybe a better example than professional wrestling or worst example, depending on your thoughts. But yeah, that's a good place to go. And those will have links to some of the literature and other books on it from there. Okay, I don't see any hands. So, all right. Thank you all. (members clapping) (upbeat music)
Info
Channel: Confreaks
Views: 10,563
Rating: 4.9078946 out of 5
Keywords:
Id: 1VsSXRPEBo0
Channel Id: undefined
Length: 39min 25sec (2365 seconds)
Published: Sun May 26 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.