Building an Enterprise Data Warehouse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
how's it going everybody now this because gonna be about building a data warehouse and we'll cover the technical details and the theory but I'm gonna do something a little different you see I'm gonna tell you the story of an actual implementation but not just any implementation this was the most challenging project of my career though this is a project where I served as both project manager and lead architect so by the end of this video you'll have a good idea of what it takes to build a data warehouse so let's go [Music] [Music] now let's start off by giving you a little bit of background about the company I work for now the main goal of the organization is to achieve growth through acquisition which basically just means they buy other companies or brands and look to run them and so I don't work for these individual brands directly but rather through an operating company so each time a new brain is acquired the operating companies job would be to integrate the systems and run the day-to-day business now let's take a closer look now as you might expect my job is to oversee the BI data warehouse environment for all the brands and when a new brand is acquired it will be my job to analyze and consolidate the new brands environment into our existing environment and the set of work is rewarding but also challenging as you have to consider data applications and not to mention the users that depend on these systems to do their jobs and it's particularly challenging when it comes to BI because be I typically touches every department in an organization and a lot of times you'll have many different data sources to consider and more times than not an organization will have more than one BI or reporting solution and it may not necessarily be something you're familiar with so with this type of work you're constantly getting exposed to different technology new and old now it's also important to note that certain acquisitions can come with hard deadlines and failure to meet these deadlines can cost the company big money so you want to make sure your project is wrapped up on time and that was the case with this project C we just acquired a new brand however this brand was owned by another company and typically in this case there'll be an agreement in place that states that you have a certain amount of time to transition the company out of their environment and onto your environment I guess you could say this was the first challenge time now this was a large company and chances are if you live in the US you've heard of them but C they didn't have a true enterprise data warehouse or bi solution so now let's take a look at where the story really begins I guess you can say this story really begins when I receive an email from the newly acquired organization CIO now attach this email was a document and this document provided a high-level overview of their current bi environment now upon reviewing this document a couple of things stood out one of which was this environment wasn't supported by IT rather was a legacy environment that was built by business users and it started out as a small access database that over the course of 20 years had grown to over a hundred access databases the other thing that stood out was how they get the data into these databases see they didn't have any ETL software to handle data transfers so all the data was downloaded from the host systems manually and loaded into the set of a hundred Access databases manually now this process took them over two hours to complete every single day and that's if they didn't encounter any issues now where am I going with all this I thought we're building a data warehouse right well just because you acquire a new company and they don't necessarily have a true enterprise data warehouse that doesn't necessarily mean that you go ahead and build one from the ground up as part of the initial transition especially with a deadline looming and so my first order of business was to evaluate their existing environment and see if we can support that going forward so we don't interrupt the day-to-day operations of the business until we eventually get around to building them a true enterprise data warehouse so I flew to where the team that support of this environment was located now this is where I start to get a better idea of what it would take to support this environment going forward and this is where I learned that these databases had dependencies with each other and also spreadsheets located on various file servers now this meant that transplanting these databases from one environment to another was not going to be easy now the other thing is that they encountered issues with this manual process they were using to load these databases two out of the five days I was there now this required extensive research and so the two hours that this normally would took to do took more like four or five hours now this was if you knew what you were doing now the other thing with these types of situations is there's some staff that you inherit and some staff you don't and this particular team was not going to be coming forward with us which meant this proprietary knowledge that they had this environment was going to be lost and it was going to be up to my team to support this going forward now my other major concern was accuracy see the entire organization depended on this data to run the business and make decisions and with such an unstable environment I feared that we ran the risk of potentially giving the business incorrect data and I worried that this could mean that my team would be endlessly scrambling trying to support this environment and resolve data issues now this could also mean reduced productivity for the entire department and since we had other brands to support this could also start affecting them as well now these were all very real risks in my mind and so at this point I was faced with a difficult decision do I accept these risks or do I attempt the impossible an attempt to build a data warehouse from the ground up in record time now when you add up frequent failures loss of proprietary knowledge accuracy issues lost productivity and the fact that we were eventually planning on building them a data warehouse anyway I decided to build a data warehouse but that begs the question what is a data warehouse now we can't talk about building a data warehouse without first defining what it is and I think the best way to do that is to compare it to your regular application database now the job of an application database is to store application data and this could be anything from a financial application to a warehouse management system that runs a warehouse for example now these are the databases that most people are familiar with and interact with on a day to day basis and the goal of this type of database is typically to store and process some type of transactional data and so the structure or design of the database would be centered around the application itself and we based on doing whatever the application was built to do now typically these applications will come with some level of reporting but the databases are really designed for the application not necessarily for analytics and so this can be querying the data challenging and time-consuming if you're trying to perform complex analysis but that brings us to the data warehouse where you see the goal of the data warehouse is to give the business the ability to better analyze their application data now data warehouse does this by consolidating all of the organization's application data into a single database that's geared for analytics but what does that really mean say the reason the data warehouse provides such analytical advantages is because it doesn't just take the source application data as is instead the data is modified or transformed into a new structure or schema in the data warehouse let me show you what I mean now suppose we're a company that has store locations and an econ business well an application might store transactions related to these different channels and separate tables now this might be because they're handled completely differently within the application now let's find from the applications point of view but what if we wanted to analyze these transactions see from a business point of view maybe e-comm is just considered in another store and so analytically there really is no difference between these two channels and so in the data warehouse the structure of this data might be completely different or both store and EECOM data live within a single table so now you can see that adding data to a data warehouse requires a lot of work upfront as you design the new structure and build all the processes that will ultimately transform the source data into the new structure now this is just a high-level example and we'll get into more detail later on but the truth of the matter is that building a data warehouse requires a lot of analysis and design work upfront it's not just as simple as loading raw data into a database now at this point you should have a good high level understanding about what data warehousing is all about and there really is a design process that has to take place upfront so now we're going to take a closer look at that design process so now next we're gonna take a look at the star and snowflake schema as these are the two most common types of data warehousing schemas you'll find now when it comes to data warehousing you need to change your thinking a little bit and you need to start looking at things in terms of dimensions which are really nothing more than categories of data so let's think about a typical retail company where over time they sell their products at stores and so the core dimensions would be product as they might want to determine which products are profitable or not and they also might want to determine which stores are profitable now typically stores can be grouped into districts or regions based on geography which would be another dimension geography now last but definitely not least would be time as your company might want to analyze their company performance year over year week over week and so on so these would typically be your three core dimensions in a typical retail company now this process of designing a data warehouse is called dimensional modeling before you can begin this process of dimensional modeling you're first gonna want to decide on a schema type and as I said the two most common schema types are the star and snowflake schema but first let's take a closer look at the star schema now in a star schema you'll have a single table per dimension so our three core dimensions of time geography and product might look like this where we have a single table per dimension and our time dimensional table might have date month quarter in a year and geography might have store district and region and our product dimensional table might have SKU subcategory and category these will be your typical dimensional tables now the other type of table are fact tables and these contain facts or the data that the business wants to analyze now some of the benefits of a star schema is that it has a simple design which can make querying easier as there's only a single table per dimension now this means that it's also denormalized which will produce less joints less tables hence less joints and less joints often means better performance now one of the downsides to a star schema is that's not very flexible and we'll see why when we take a look at the snowflake schema which is what we're gonna take a look at next now in a snowflake schema you would have a lookup table for each attribute so in the case of our time dimension our date month quarter and year would each have their own lookup table now as you can see this type of schema will typically produce more joints for example let's suppose you wanted to aggregate the data from our fax sales table to the year well we would first have to join our fact table to our date table on date to get the month then we'd have to join our date table to the month table to get the quarter and then the month table to the quarter table to get the year and then finally we can aggregate on year now the other thing to consider where the snowflake schema is the degree of normalization now this type of example is a fully normalized schema because each lookup table only contains the minimum amount of information necessary for example our date table only contains the month ID so it only contains the ID of the immediate parent and if we want to join to each level in the hierarchy we have to join through each of the look-up tables in order to get there now let's take a look at what a hierarchy might look like using a moderately normalized schema so here we have our geographical hierarchy and we have our three look-up tables store district and region but you'll notice that each lookup table contains not only the ID of the immediate parent but also the parent IDs of all the levels above for example our store lookup table not only contains the district ID but it also contains the region ID as well so as you can see this moderately normalized schema will produce fewer joins for example if we wanted to aggregate our fact table to the region level well then we would simply join our fact table to our store table to get the region and then we can join directly to our region table from our store table skipping district this is because we already have the region ID in our store lookup table now the last type of schema is your fully denormalized schema and the only difference here is that each table would also contain the descriptions now some of the drawbacks to the snowflake scheme is that it's more complex and there are more tables to work with however this also makes it more flexible as having an individual table per attribute gives you the ability to use things like aggregate tables so that's why this is my preferred schema now after you've settled on a schema type you might think that you should start looking at tables and deciding which tables you want to bring to the data warehouse but the truth is you're gonna want to look at this in terms of business process and really you want to decide which business process the data warehouse is going to cover remember this is about analytics and giving the business the ability to analyze their data so you want to look at this from the business point of view and then you want to decide which tables you need to facilitate that analysis for example maybe you want to cover sales or warehouse processing so in that case you would want to identify the data sources and maybe it's point-of-sale data or data from your warehouse management system now after you've settled on a business process and isolated the data sources you're gonna want to clearly define the granularity let me show you what I mean now let's suppose we've decided that we want our data warehouse to cover the company's sales so for that we might use point-of-sale data and the source data may have a table just like this now this may be wrong action data so you'll notice the two columns transaction number and line number that means this table is pretty much at the lowest level you can find which means each record in this table represents a particular item that a customer bought in a specific transaction this is pretty granular and you may not want to bring in all this data into the data warehouse because this might be too much data and it could impact performance and you may not necessarily need to analyze data down to the transaction level that often and so if your main goal is to report on store performance or product performance or company performance you may want to summarize this information in the data warehouse and this is important because you don't want your queries to be bogged down by this low-level transactional data every time they run especially if they're not utilizing this level of information and so maybe upon loading this data into the data warehouse we aggregate the quantity to the store day SKU level and so our new table in the data warehouse might look something like you see here and you also want to make sure you identify the facts that you're gonna bring into the data warehouse here we have a single fact quantity but in most cases that you would have many others so now with our new data warehouse table stored a item sales we have improved query performance while still being able to support our analysis this is also why I prefer the snowflake schema because it gives us the flexibility to aggregate FAC data to any level we choose hopefully now you have some idea as to how to define the granularity of your data warehouse and why it's an important step in the design process now look I'd love to tell you that I want about this by the book I spent months agonizing over every little decision but the truth is I just didn't have that kind of time now before I get into how I did it let's first review what needs to be done first I need to identify and analyze the source data next I need to design a new enterprise data warehouse now this is one of the biggest challenges see the access databases actually did a lot of data manipulation and formulas so they were deriving a lot of new fields that didn't technically exist in the raw source data so I have to make sure my new data warehouse design can cover this now this often meant reverse engineering complex access queries now the next step will be to create the ETL logic after all once you have your new data warehouse design you then have to transform the source data into the new design now we did have a team that handled ETL but I was short on time so I personally took this step on as well now once we had our new data warehouse we then had to rebuild the existing reports then after all the development is complete we have to test and validate everything then last but not least we need to train the users on the new system so here's how I did it see I broke everything down into chunks based on business process first then I would tackle everything a chunk at a time so let's say first I decided to tackle sales well then I would identify all the source systems and the people that had expertise in that area and I would have meetings centered around knowledge transfer and learning about the source data then I would design my new data warehouse for this chunk and write all the ETL logic to transform the source data into the new structure now in order to speed up this step I initially built the entire data warehouse using nothing but database views see it was common for our each held jobs to stage the raw source data anyway which means I had all the raw tables in a staging area in the data warehouse this man I could easily build views on top of that to serve as the new data warehouse design initially now this meant if I had any issues found during testing I could quickly respond by just changing the query in the view now you never want to do this in a production data warehouse because views are slow and you will run into performance issues but I knew that this work would not be wasted because once everything was validated the ETL team could step in and use the queries in my views to build the physical data warehouse tables this made their job easy and much faster now this approach saved everybody time now once I had the views in place I would notify my team that they can start working on converting the reports that use this information meanwhile while they were working on converting the reports I would move on to the next chunk see taking this approach was important because it allowed tasks to go on in parallel and it was also important because it allowed us to get the users involved in the process early on which is critical because we were able to run our new environment in parallel with their existing environment making validation easier while giving users more time to adjust thus easing the transition ultimately I had complete confidence with this approach because I knew my team and I had a superstar developer that could take the lead on converting the reports and we had also made a recent addition to the team who was great at working with the business so I knew she would be able to take the lead on transitioning the users now this project started out feeling like my back was against the wall but at the end of the day it was a complete success wouldn't need this project successful was involving the business users early on they even feel like they were a part of the project and not getting the short end of the stick see you can go the best system in the world but if the users don't buy in it's not gonna matter and you run the risk of them having a bad experience and telling their boss who tells their boss and before you know it the project's been painted in a bad light and no matter what you do it's not gonna be successful the other thing was knowing where I can bend the rules to save time the last thing that made this project successful is having a versatile team and being able to leverage their strengths where each person played a role and that's how you build a data warehouse thanks so much for watching I really hope you enjoyed that and please don't forget to Like subscribe and hit the bell
Info
Channel: James Oliver
Views: 19,467
Rating: undefined out of 5
Keywords: enterprise data warehouse, snowflake data warehouse, data warehouse, star schema, snowflake schema
Id: wWbxnwaHhQ8
Channel Id: undefined
Length: 21min 20sec (1280 seconds)
Published: Thu Jul 09 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.