Data Modeling - Why Data Engineers Need To Understand It - An Introduction To Data Engineering

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
there's a lot being said everywhere about the importance of data modeling data modeling is a crucial skill both for software and data Engineers the problem is most of the time we kind of can be a little forced to move faster than we should and often we build bad data models or maybe never even take the time to learn why we have various patterns and methods in place to try to develop things that capture what the business is doing so in this video we're going to start talking about data modeling so you can start understanding what all these terms mean and also how you can be better at it honestly this whole uh series was inspired by two very scary data models that I've seen both on the oltp and olap side recently for two of my clients um and yeah it scared me so much that I feel like I have to make a video about it now if you're new to this channel hey there guys my name is Ben Rogan uh also known as aka the seatt data guy I have been working in data for nearly 10 years uh as a data engineer data analyst now as a data consultant helping companies set up their data infrastructure which often involves at some point a data model and again recently in the past few weeks uh on the olp side I some I saw something that looked a little like this where course uncore ID 1 id2 ID3 until like 35 and another one where they on the analytical side uh where someone had uh one essentially customer field for every customer great this is a customer table right no this is everything this is one big table done to the extreme uh where all the transactions are literally stored in a single column and you're scaring me people so what I want to do is go through a video of what data modeling is some of the basics and hopefully go over some design patterns and why it's important to actually understand what we're doing here I think many of us if you took a database course um did a little bit of some level of data modeling for a relational database but it feels like much of that can sometimes be given up because we're trying to develop products faster we're being agile we're putting out MVPs and we're rarely stopping to think about what this is actually supposed to represent um and as Joe re recently said um during uh the conference we had once you've normalized your data you can now denormalize it right an example is a Kimble star schema in order to denormalize uh you need to have normalized first so let's dive into it all right so in order to understand data modeling I think it's important to understand what that even means now if you ask Tech Target what they're going to say data modeling is is that it is the process of creating a simplified diagram of a software system and and its data elements uh that it contains using text symbols uh to represent the data and how it flows essentially acting as the blueprint uh for a new database or re-engineering of a legacy system or if you ask J ree uh he basically put it as um organizing and standardizing data to facilitate believable and useful information uh and knowled know for humans and machines and he added the end ad machines part recently because of things like chat GPT so basically the goal and the way I look at is data modeling is meant to take all of the data that you're trying to collect uh about your business and serve it in a usable way into systems and then also from there eventually into you know analytics right like it needs to represent what's happening in the business it is a derivative of what is occurring in the business that is what you're trying to model in these data models the data itself is actually the stuff that flows into them but that shell that outer thing you're trying to model is all of that stuff relationships actions things that occur in your business need to be tracked and that's what you're trying to do if you have a business that does transactions you're trying to capture those transactions understand who created those transactions who was involved in those transactions what product was purchased from that transaction was it a certain location you know there are layer upon layers of what you're trying to understand and the key is you're trying to conceptually grasp all that and it's very AR arly difficult to do so that's why generally when you look at the standard approach to data modeling there is three layers to it which is conceptual data modeling logical data modeling and physical data modeling but most of us just jump right into the physical we start writing queries we're like we already know what we're doing you know we we read a book on Kimble we know what we're doing but it goes so much deeper than that data modeling does not start data Engineers would like to think at the data warehousing uh step right data modeling happens at the business and then goes into the system the operational system and eventually some at some point ends up in the data warehouse so what you're trying to do again is break it down into those three uh layers of data modeling conceptual logical physical starting from the operational system trying to understand the concepts that exist in your data model so let's go over that so conceptual data modeling that's what we'll talk about first this is where you're really trying to focus on abstractly understand understanding what is going on in the business you know you have customers those customers purchase a product those products exist in shelves that are at stores those stores are often you know fed by Distributors you've got all of these big subject areas and big entities that exist and that's generally why you'll often see people use some sort of ER diagram or n relationship diagram to be part of this conceptual phase and this is where you know you can kind of just keep filling it out you're just constantly adding more more color to this uh model in the first part you're really just trying to make it something that if you hand it to a business person they understand what they're looking at right you're trying to make it so that it looks like something that they understand um there are arguments in terms of like how much information should be covered in your ER or uml type design but I think the key point is that the business needs to be involved and they need to understand what's going on so if you've got something that no one understands what is going on you know maybe you need to adjust your conceptual model it's just too you know too specific it's got too much you know it's got foreign keys in there primary keys in there and it's like let's get rid of attributes and let's just show the entities and what's happening in between them right like okay again customer buys a product um that product Etc sits on a shelf you can even write it to that degree and we'll give a few examples here of different conceptual diagrams now you get to the logical step and this is what you might call like the previous was what this is the how right this is how the system should be uh built not necessarily taking consideration your you know database management system that you're going to be using you should be able to just have this high level understanding of what it's going to look like it'll have the different relationships outlined you'll have more flavor kind of started to add in there like primary Keys foreign keys are there some self referencing uh moments where uh an entity needs to reference itself but in general again at this phase you're not adding in things like ins and floats and really trying to add in and the the code side of things just yet you don't know the physical side you don't know where it's going to live you don't know what it's going to look like in terms of data types you haven't written code yet so you're just kind of defining this next step and then the physical side this is where we all like to dive in you're starting to Define what different data types are you're starting to Define how it's going to exist in reality what's it going to be built in and and that's kind of this final step before you really start putting it into practice now as you're doing that you're going to realize you're going to take certain steps part of the reason that you go through this process let's give you some whys here is because you're going to one figure out how far you want to abstract things right when you create a table do you have a table for employee and another table for customer or do you actually abstract it further to person and that's you know it's just person and that's what it is right like what is the way you're going to abstract things how far how far is too far right are you just going to have a things table is there just going to be a table for things um or people or or some very vast generalization that's going to be very difficult eventually to work with right like what is the level of essentially generalization that is too far an example of often generalization that sometimes is required is the custom field uh generalization that you'll often see which is you know maybe you have an entity um that exists and on top of that you want customers people add in custom fields to that entity now instead of making them have to impact the database and add you know constant um new Fields uh to a table somehow in your process generally what a lot of people will do and a lot of the designs I've seen is you will have another table and in that table you will have essentially the custom field so you'll have entity ID uh that exists uh the custom field essentially ID especially if you're building that into a system where it's like that custom field is going to exist for all of those similar entities and then likely what that is right like and if you want really want to add in more into it you're going to have a center table in there that acts is kind of the relationship table where you've got uh entity ID custom field ID and then that essentially will act as um the gap between that and then the actual table and I'll put up some diagrams here um and some notes and so this is an example of like okay how when do you need to generalize and when do you need not to you're trying to capture some of information that is very custom and if you're using a relational database you're likely going to do this obviously if you're using something like mongodb or a different database it's going to act different but again different uh solution for different uh databases now that we've gone over those basic concepts um let's talk about another important concept normalization and denormalization these aren't just terms that you get asked about during an interview like what's the difference why would you do one or the other these are very important things to understand why they exist and what you know purpose they serve yes normalization exists to reduce duplication of data and increase the data Integrity in relational database models that's the goal we often use it because in transaction systems they tend to provide more performant results as well as minimize unexpected results in anomalies because you don't have data that is duplicated across multip tables which then requires you to update multiple locations anytime anything is done so when you hear the term normalization you'll often also hear things like first second and third normal form as long as well as with a few other normal forms but uh generally at the very least you'll hear to third normal form and basically each of these are steps that are used to reduce often duplication or risk of something strange happening generally let's first break down these various normal forms and kind of what you'll see characteristic wise so when you hear first normal form that basically refers to the fact that a single cell shouldn't have more than one value in it uh I've had this happen I remember I worked at a company where they had like project ID and they had multiple project IDs that existed in one cell and it was it was a pain to work with two there essentially there must be a primary key that will act as identification uh and you try to remove the need for duplicate row or columns now when you hear a second normal form um the first assumption is that things are in first normal form you've already normalized to that level the second goal is that there's essentially either no nonprime or no dependency of the values in those tables that really shouldn't exist I mean that's the simplest way to put it if you read this in a lot of places what they'll say is that there'll be no partial essentially a dependency on primary keys that is to say that if there's values in that table they shouldn't just partially be like part of this thing an example of this is somewhere where maybe you're trying to capture essentially uh an employee that works in multiple locations right like maybe you've got some sort of business that has multiple locations so you've got employee ID you've got location ID and then maybe the employees age in that table now you've got a strange uh dependency where yes sort of this EMP employee worked at that location and that is their age but age should really be pulled out into a separate entity probably just around for employee and then another table should be created for essentially the employee and the location that they're working at at on different days and for now we're going to kind of stop at third normal form you know we're going to talk about it right now uh but for third normal form basically the goal is you basically want to uh one again be in second normal form and then two the goal is that there shouldn't be essentially some sort of trans postive dependency for attributes that are not prime or essentially ones that aren't really related to key thing you're trying to focus on in this table the example you'll see everywhere I think again it's just the easiest to understand um is location data so maybe you might have information such as you know an employee uh ID some information about who they are which all depends on that employee ID great but now you might have things like zip code State and other information that's also added on there well the all that information information right can maybe be summed up by one of those components right maybe you know everything you need to know just based off the ZIP code right like maybe that's efficient enough to then know the state you're in to know the city you're in um and so that's can be put into a different table where it's more like addresses and why this is important is now you're removing the need for redundancy of the same information right like okay we're going to have the same we know that you know whatever that zip code is exists somewhere and we're very aware of which state it's in and we don't need to have it repeated over and over again you know one of the reasons that normalization exists is that it helps save space and so although this might seem like a silly thing like it's like well we could just put it in there anyways it doesn't matter you know how often do we update this information uh in this case you know when you're talking about location data what you might actually be trying to do is just not so much save in terms of like updates and weird anomalies but maybe what you're trying to do uh is avoid taking up extra space with that information right cuz that's going to take up a decent amount of space uh if every uh employee or customer has the same same data over and over again uh for their location of where they live so that's just an example of why people would do it now when it comes to speed right what is faster to join across two tables uh to pull in that information or just to get that ID and now you have all the information with you and that's why you'll often hear on the other side of this the concept of denormalization and that's why in particular you'll often hear this uh in data warehousing because we want to have some level of D normalization because those queries tend to run slightly faster and obviously again the trade-offs in these cases is we're going to increase redundancy which can lead to weird anomalies and increase your overall storage costs but it can make a lot of sense when your data systems are very read heavy whereas in general you know when you have operational systems they are probably heavy on both both read and write and updates and everything more than likely uh for your data warehouse the main thing is to write once and then read many times right like that's generally the thought or you're going to write you know maybe once every so often unless you've got real- time data um it just isn't as focused on writing and inserting new data really a lot of this is done by and those performance trade-offs that you gain are done by reducing the need for joints so what often happens in a normalized model is you'll have this relationship that in order to get to a field again like address it's going to be several Fields away right for some reason you've normalize it to the point where it's like okay we've got employee ID so we know with who the employee is okay now we've got like maybe their zip code and then can we go from there to the next level and then maybe for some reason they've even gone one level deeper uh right like maybe they created a state table so in order to even get like the state that this exists in you've gone three or four tables deep just to get to that information and that doesn't make sense as much when you want to read a lot and don't want to do all these joints so that's one of the main reasons you'll see denormalization occur is that you're trying to make reads easier it is the reversing of normalization it is something that can only be done if you have once at some point normalized your data so that's why when you see a data warehouse or even things like one big table these are heavily or at least slightly denormalized systems uh one big table to the nth degree right is is a denormalized table or denormalized way of showing your data um rather than having again a customer table that's separate all of your customer information is just in one one giant row as well as the transaction so every time you have a transaction you have all of the customer data as well which is kind kind of nice to a degree because you don't have to join it to anything but in other degrees right like if that customer moves or something needs to be updated uh in the past you need to Now update 30 different rows because there's 30 different transactions whereas in a normalized system you know that update occurs once in one location likely and so that's why you have this tradeoff and now this was a really long video I really just wanted to hammer the point of conceptual logical physical modeling and I also wanted to really discussed kind of normalization and we've covered a lot of those Basics now we can start looking at actual data models and how to do it right I didn't want to run into let's just go into physical modeling because we have no idea what we're doing let's really talk about these high level Concepts that then we can use to build out the next few layers and we'll cover those in the next few videos so first off thanks so much for watching this video I hope it was helpful if you're out there wondering how to data model um I'll also talk about some books that you can use but for now I'm going to sign off and say thanks so much for watching guys and see yall and [Music] goodbye
Info
Channel: Seattle Data Guy
Views: 31,464
Rating: undefined out of 5
Keywords: data engineering, data engineering skills, denormalization vs normalization, how to data model, data modeling basics, why data modeling is important, conceptual vs logical model, Data engineering must learn skills, data engineering skills 2024, Best skills for data engineers to know, seattle data guy, ben rogojan, analytics, data analytics skills, data engineering basics, should i become a data engineer
Id: nof9jYIhv54
Channel Id: undefined
Length: 17min 38sec (1058 seconds)
Published: Wed Nov 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.