Data Modeling Challenges - The Issues Data Engineers & Architects Face When Implementing Data Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there guys welcome back to another video with me Ben rotron aka the Seattle data guy if you've been watching this channel for a while you'll notice that well my background has changed we'll make it uh nice and spiffy later but I recently just bought a house so I've been dealing with that and we've obviously had to move Studios that being said today I wanted to talk about data modeling and specifically around you know the actual way we end up storing data right like how do we actually put it in tables and why do we do it really what I wanted to focus on this video is talk about some of the challenges we face when data modeling so generally when people talk about deed modeling the traditional way we often start is to focus on something like the snowflake or Star schema if you haven't seen it here's like a few examples and you'll notice uh they often kind of follow the exact same shape where it's like one table in the middle maybe four or five tables on the outside and it kind of looks like a star generally with like sales or orders in the center now the problem with this data model is it's similar to the way that when you're doing data science projects you often work on like the Titanic or Iris data sets these data sets are very clean and always work really well right like these are just data sets that always work and if you've worked any amount of time in the data science world you know very much so that that is just not the way the world Works your data is not clean your data is messy even if it's been modeled nicely but most of the time that's not the case so before going into like the nitty-gritty of like an example what I really want to do is talk about why it's challenging with real life examples of what makes data modeling challenging at various companies so let's dive into it first the probably maybe most obvious is that not everyone is having to data model the same way for a very long time I think a lot of people were forced to essentially data model in a very similar fashion where you often took a star schema snowflake schema yes there was some Kimball and imble Kimble and Nimble yes there are some Kimball and Inman arguments that went back and forth in terms of how you should date a model yeah it's interesting it's a bit of a conflated topic really um I I think I'll tell you what it's not right so I think all too often what we're trying to do is um you know say that data modeling is like Campbell or Inman or these kinds of things right I think that that's that's certainly a good start um but I can't say that that's right now I think we're at an interesting point in our industry we're conflating tactics with practices right and so um and sort of missing the bigger picture now with the cloud various ways that people end up storing data whether that's you know completely separating storage and compute using systems that only allow you to basically insert and not really update all of these things impact how we date a model for example when I worked at Facebook most of us might have heard the term slowly changing interventions and I've used a few pictures you can use some of these the goal of these slowly changing Dimensions is to capture information about change that way you can answer certain business questions which we'll get to in a second of why business questions are important but that way you can answer certain business questions you know if you only track the current state of let's say employees you can't tell how often people are promoted and at Facebook because of the system we were using it was similar to hdfs we really couldn't run updates and it was just made more sense for us to insert new data so this led to the data model where most of the time we would just have a brand new set of data every day you'd have a different essentially date partition for every day so you kind of have this image where again each kind of thing you can almost think of it as a file but it's all one table right all separated by the date now this means when you actually write a query I'm going to put it up here you have to both join on the ID as well as the date or another very common approach to how we would query this data is we would use a macro that would essentially select the law largest date for that partition and that's essentially what we would pick so instead of writing like select Max from this table you just put in the state partition and it did it for you but yeah we use the term dim in fact and some people use dnf whichever one you prefer all the time but they really weren't the dims and facts that you would be accustomed to and in fact in terms of data modeling I'd say there was wasn't that much in terms of other companies that I've worked at involved for many many reasons one of the reasons is this next problem that you will face at every company is integration now Facebook integration was easy because most of the systems uh joined very easily together because they were developed to talk together at the application Level so it worked really well in fact we had the opposite problem where we were constantly having to remove IDs so people wouldn't be confused later down the line if there were multiple IDs that could kind of do the same thing so for us we had to remove IDs now a much more common reality is let's say you have this common model whatever you've got the center it's got order you've got Dimensions like customer you know product and so forth the problem is this pretty picture almost makes you assume that oh customer must probably just come from One Source right maybe same thing with products but that's not at all how any of this works I wish it were but generally for a lot of companies you have multiple systems often maybe you've got like uh European uh arm of your company maybe you've got a us one maybe Asia they may or may not talk to each other you might be using different erps for all of them and yet when you want to do reporting you want to report on all of it and maybe you even have multiple instances of the same customer and again you're going to want to conjoin that over and make sure that you're looking at the same customer across the UK Europe Etc and you don't have multiple instances so this causes a major issue with integration and to be clear that's not even including when you have two different systems that you want to talk to each other and you have to deal with that issue the problem is you often again have multiple systems that you now want to join the data to together of into Maybe One customer table in some cases maybe you want to Union it together in other cases maybe you want to actually literally join it together so it's one row so you know customer from system a b and c come together and become one customer but how you know this this becomes the challenge and then also which ID is then the master ID you have a few options here of course one of the options is you just pick a Master System like let's say Salesforce that then populates every other system either automatically or manually and that way it auto populates and you don't have to deal with that whole problem later on as the data engineer this is why a lot of people reference like process as part of the way that you have to prove as part of the way you improve your data strategy you need to have processes in place to make data modeling easy because the other option is to create systems that then centralize and create a single ID but then track essentially this customer throughout the process both have pros and cons um right the creating a central ID can be very complicated if you don't create that Central ID based off of some way of being able to reverse engineer it uh there is some risk that you could lose that ID throughout the process which then could force you to backfill all of your data at some point in the future and the other way which is you know manually or automatically getting people to integrate it at the application Level requires a lot of coordination it requires a lot of time and budget often requires an expensive solution like soligo or some other iPass type solution to integrate all this data across your various Solutions so integration is a major problem and this is a little more at the granular level so I think a lot of people are interested in the granular side of things because again we can cover the high level data model but really that's kind of not easy it's very challenging but once you get into the Nitty Gritty that's where all the problems really start and that gets me to the last point that I really wanted to make here it's that I think some people assume that data modeling is a quick thing that you do right maybe it's because you know in interviews we give you an hour to answer how to develop a data model or 30 minutes like I I remember getting asked to do a data model on how to design a data warehouse for basically a college then you want to track like grades and which classes people went to and uh who which teachers had like the high satisfaction and you know it's pretty easy to do these are toy problems that are very arguably simple to come up with an idea pretty quickly of how you could do this but when you look at the traditional method of data modeling especially Enterprise data warehouses you'll notice that it often took months if not sometimes a year or more to really get to a point where you felt comfortable with your data model the point and the reason we use a lot of these toy examples is just to explain the different components the hard part of doing data modeling is getting all the requirements that from the business and figuring out what questions they want to answer what questions they can want to answer in the future and creating a model that can then adjust easily and quickly to these requests and when we did interviews at Facebook for data modeling we would want you to ask questions we would want you to figure out what the requirements were there in the interview because there was no way you could figure out the data model for what we were asking at the point I think I got asked something about an airplane um system in terms of like airplanes coming in coming out and so on it was either that or possibly was like an Uber app but I don't remember which one the point is we created it in such a way where you have to ask questions the hard thing I think with creating maybe even a video and I'm going to get a few hopefully other experts onto this whole concept here but the problem with making a video on the subject is that it's not just about how to data model in sense of like oh here's a box and here are the other boxes it's also what is the process of actually going from you know an idea and getting all the source systems together Gathering requirements and going through that whole flow and actually creating a model that then supports that and can adjust easily that is the actual hard part and it is the part that you will learn the most at this is why I think with data modeling the best way to learn it is to go work for a company where you have the opportunity to do it or or where you can see someone else's data model honestly at a lot of larger big tech companies sometimes the data model you're putting into place won't be anywhere close to some of these traditional data models and nowadays people are often pushing for these other Concepts you know I've heard activity schema one big table and there's so many ideas floating around that I think it's really hard to understand why people make decisions and which ones they're picking and so many people are so adamant that they have the right idea and how they should model then I think it can be a little bit hard so I think the biggest thing that anyone can do and I've always told people this right and I've always supported this is like yes read something like Kimball and that's a great start take some time to practice modeling but the biggest thing that I think is important is you asking yourself why am I making these decisions in terms of modeling my data the way that I am but through time that's been lost I think we're so focused on the implementation details and the physical and we'll get into this in a bit too but you know there's conceptual logical physical modeling and so forth and we wouldn't really I think starting and ending with the physical the modeling and ignoring the the higher level stuff which I think is actually the more important stuff to talk about and and do but yeah so yeah oftentimes when I am forced to design some sort of data model those are the times that you really start to figure out okay why am I actually you know turning this into Dimension do we actually need a slowly changing Dimension here or am I just putting it there because that's something that I read in a book once and all of that just comes through practice and working with people that are arguably more experienced than you and yes obviously I hope I can make a video where we talk about this more without actually I think going through the stress and all of the um deep thinking that is required to model data well I feel like it's always hard to convey everything that goes into a good data model that being said I am trying to convince Zach Wilson and maybe for Lucky Joe Reese to jump onto a live where we can talk about data modeling because I really want to get a ton of people's expertise on the matter because there's just so many different ways people have model data out there again depending on whether you work for a big Enterprise versus working for big Tech you're going to see all sorts of ways that people set up their data with it guys I just want to say for watching this video I really appreciate it hopefully I will set up my studio here in the next few weeks but if not you'll keep seeing this very white background thanks so much for everyone joining this video and I will see you in the next one thanks and goodbye foreign [Music]
Info
Channel: Seattle Data Guy
Views: 24,664
Rating: undefined out of 5
Keywords: data engineering, data engineering skills, how to become a data engineer, data modeling, star schema, how to model data, data engineering roadmap, data science vs data engineer, should you become a data engineer, what is a star schema
Id: G3m94JeuAtc
Channel Id: undefined
Length: 12min 51sec (771 seconds)
Published: Tue Jul 18 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.