Modern Data Modeling with data lakes and Power BI datasets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm ike ellis and we're going to look at data modeling trends for analytics specifically we'll look at how we can create a modern data lake that will support power bi in a cost effective high performing manner that's easy to maintain and change um a little bit about me i am the general manager for the data and ai practice s alliance i'm also a microsoft mvp i speak at conferences i write books i am readily available on twitter if you need to talk to me i like making new friends and hopefully we'll get to know each other a little bit better we're going to talk about traditional enterprise data architectures we'll look at their problems and weaknesses and then we'll look at how a data lake can address those things and how we can properly organize a data lake for use with power bi so we build enterprise data architectures for our analytical systems for all kinds of reasons we might do it for fraud detection we might do it for decision support we might do it because we need tactical reporting we might be creating an advanced analytic platform for machine learning or maybe for deep learning we might create historical data places where we can archive data cheaply and affordably and get it off of our transactional systems traditionally what we've done for the last two or three decades is we've pulled data from multiple data sources like a financial system or an accounting system or a warehouse or inventory management or something like that we've brought it into staging staging is only there so that if we have a problem with moving data in reporting we don't have to add additional load to the source systems again so we use staging to decouple our data pipeline process from the source right then from staging we'll go to the ods or operational data store an operational data store is a place where we gather the data for permanent storage from all the different source systems so that we can join them together so if you've got 100 tables in an inventory management system and 50 tables in your financial system you're going to have 150 tables in the odfs but they'll be available to join together if you'd like that for analytics and then we will from there move the data into a star schema in a data warehouse or a data mart a star schema if you had 150 tables in the ods maybe you only have like 12 tables in the data mart it's flattened it's highly filtered it's clean it's organized and prepared it's there's heavy data modification when you get into a data warehouse and then once it's there you might build analysis services cubes from that or you might build a power bi data model from that those are all available to you so star schemas which are what we use in a data warehouse or a data mart they're just tables that are organized very simply typically the tables are either a fact table or a dimension table fact tables are numbers they're like a sale amount or number of employees you have or number of inventory items you have on hand or number of rvs you sold right whatever you're doing so fact tables are numbers they also have foreign keys those foreign keys relate to primary keys in dimension tables dimension tables are the buy numbers so for instance if you were looking at sales by region the sales would be in the fact table and the region would be a dimension table if you were looking at sales by employee name the sales would be in the fact table and the employee dimension would have the employee name in it right so the reason why they're called star schemas is you have one central fact table and then you have these dimension tables around it branching out like a star when you build fact tables one of the most important decisions you can make is the grain you really care about the level of detail you're putting in the rows and the reason why that's important is if your decision makers want to see an individual sale let's say you sell homes and you sell maybe 5 000 homes a year then you would want each individual home sale in the fact table but if you're a walmart and you're selling millions of things every minute you wouldn't want the individual transaction in your data warehouse you would want to summarize the grain in the fact table so you would say maybe total amount sold by hour because maybe the hour is statistically interesting to the decision makers and the data scientists and they track things by hour and so you would sum it all up by hour and that would be your lowest level grain choosing the grain appropriately will determine the size of the data warehouse and it will determine how low you can drill into in the data in your reporting so it's an important thing to to identify the measures that you're going to be analyzing and identify the grain and then create the fact table in the data warehouse appropriately now the reason why we make star schemas are that they're easy to understand we'll have one version of the truth it's easy to create aggregations doing a single pass over the data we get faster queries and it's a good place to feed cubes if we need to or feed power bi the reason why i'm being deliberate in telling you the importance of a star scheme is i've heard a lot of industry noise out there from other vendors that say things like you don't need to create a star schema because we have column or indexing or we have the super fast disk disk technology or we do all sorts of uh analysis to say that you don't need a star scheme anymore you can have transactional tables like in the ods and you can just build analytics off of the transactional tables and i'm telling you if you do that you'll regret it what i always say is you can either start out by making a star schema or you can one day wish you did those are the two choices so star schemas are still important even in a data lake star schemas are still important and they're still valuable now with this common enterprise data architecture let's say you built this with sql so your source data might be sql staging is sql ods is sql data warehouse sql and you built this etl tools atl stands for extract transform and load moving the data from the left to the right all the way down the eda let's say you built that you probably had some weaknesses that you ran into one weakness is let's say that in a power bi report you want to add one column that your eda doesn't have yet well you identify that column in the source system and then you have to change one two three four nine different places you have to change in order to get that column into the power bi report and that took a long time it was very fragile you likely introduced some errors in your system and you had a very hard time making changes to it and so you you probably avoided making changes to it in fact a lot of people in power bi just started going to the source directly and extracting the data from the source into power bi and avoiding the enterprise data architecture all together because it was too cumbersome to maintain and too difficult to change and so what that meant is that for all these different power bi reports you're just extracting data from the source over and over and over again and you had a lot of different problems with this right these power bi reports wouldn't be consistent with each other so one report would say something vastly different than another report if you needed to change something about the data model you might have had to go into five different reports to make the change and maybe you missed one so four had to change and the fifth one didn't and now they're inconsistent there was a lot of overhead in maintaining all these different reports with all the different data sets and all the different extractions and then you loaded the source system the source system was getting suffering in performance because you're repeatedly getting the same data and cleaning it over and over again and the data models when people do this tend to look very bad they tend to have tons of tables in them and lots of logic and lots of dacs and they're very slow to refresh and very difficult to maintain so that's one weakness another weakness is that sql server being used for staging an ods is not that great there's a couple of reasons why it's not that great the first reason is sql server was not designed to be a staging and ods location it has been used for that by many people but it was not designed for that sql server was designed to be a highly consistent storage system for transactional data so if you were dell.com and you were tracking your different um orders sql server is great for that because sql server can handle the massive consumption of data by thousands of people in a highly performant manner but for ods and staging it's way overkill um it writes data multiple times it does locking and blocking it has indexing and all this stuff that isn't really necessary for those locations all we really need is to load the data and leave it there for analytics we don't need all that stuff um sql is also expensive it costs a lot so when you use it for things like staging you're spending all this money and you're not using all the horsepower necessary for it so there's got to be a better alternative than sql for these things another weakness of these systems is that these big large data warehouses or data marts have been very difficult to maintain odss can get very big terabytes and terabytes of data and they can be really difficult to maintain it's hard to back them up it's hard to restore them it's hard to create a dev environment in a test environment and a staging environment it's hard to use source control with them it's hard to clean the pii the personal identified information out of them uh creating tests in these organizations are difficult because the databases are so big that the test just takes so long to figure out if things are working or not so people don't like to test these big large things all the time another weakness is if you have a huge data warehouse and you're trying to move it into azure it's very expensive and cumbersome to do that and here's why when you use azure or aws you're fundamentally getting charged for four different things cpu memory disk and network and it doesn't matter how they bundle the pricing whether they call them dtus or they they call them ru's or whatever they call the the metric they're charging you for underneath that metric is cpu network disk and memory that's that's what's being charged to you and of those things cpu is by far the most expensive cpu is very very pricey so when you have a great big data warehouse you need a lot of memory and in the cloud memory is anchored to cpu and cpu is really expensive so when let's say that you have a data warehouse with like 10 years of data in it you and i both know that your decision makers are not using 10 years of data they're probably only using the last like 14 months of data really but in this great big data warehouse you have 10 years of data it's massive it's gigantic and when you move that thing to the cloud and your decision makers say hey that thing's not fast enough make it faster you have this great big knob to turn to say okay that data warehouse it used to cost me five thousand dollars a month and now it's costing me ten thousand dollars a month and people notice bills that are that high so what we'd rather do than have these great big ods these great big staging databases is we'd rather come up with much smaller cheaper things these small cheap things are much easier to turn the knob to if we have smaller dials right so instead of a great big ods if we have a lot of smaller files or something like that if we go to make them faster it's a small knob to make them faster it's not a great big expensive knob to make it faster so when you have a small data mart rather than it costing you know 500 bucks it's costing a thousand bucks to make it twice as fast and that's more affordable for the organization and people might not be offended by that right like they would if you want to change something that's 10 grand a month to 30 grand a month right okay uh there's another problem and that there's no separation of concerns in the architecture so remember all those things that we use to create it in an eda they tried to make one eda do all of those things so for instance this is a very bad alerting system for fraud detection right and the reason is there's a data latency problem as you move data from source all the way to the cube it just takes a long time and when it takes a long time you know people um you know they get upset partly it's our fault right we were using words like real time data analytics and this is there's nothing real time about something like this to move data from the source all the way to the cube it's gonna take probably an hour right maybe four hours maybe all day and then it's not going to take a minute is my point right and it's just the action of etling all that data from one thing to the next thing to the next thing the next thing it just takes some time to do that right so we have built data lakes that look like this and we've been doing this for a few years now every data lake has a capture a store and a deliver so a data lake is a file based not a sql-based analytics system and this is important let's talk about this for a second if you have a relational store like an oracle or sql server data store and you move data into a file based area now you're building a data lake and that's all it is it's as simple as that this is not complicated so what you're doing is you're removing a relational database like sql server and you're replacing it with files that's it so we have files and staging files and ods and then we might still use sql server for our data marts but our data marts will be much much smaller than a great big giant data warehouse right so don't be afraid of files we are not talking about csvs or json files mostly in a data lake we can store those in a data lake but we don't have to instead we're using a format called parquet parquet is compressed part k has reduced io prior k is splittable and scalable so what is a parquet file it's a column store if you've been doing sql server for any amount of time you've been hearing a lot since 2012 about column store part k is a columnar data storage method method that's file based that gives you strong data typing meaning we have data types in there so it's not like csv where everything's a string or you know there's all sorts of bad data in there parquet can preserve um the data structure it is compressed and it is fast very very very fast so we can switch from sql server to parquet files and get a lot of benefits here uh the main benefit is azure sql database might be hundreds or thousands of dollars but files are pennies to store gigs and terabytes of data so we store those files in azure blob storage or in azure data lake storage gen 2 and if you take a look at the pricing online you can see that the pricing is very affordable it is very cheap to store those things so like we said we're using parquet files for staging and parquet files for ods we don't have to there are other file types available like org files but park k is the popular one it's the one most people are using you could put csvs in here too if you want but you wouldn't want to do a lot of analytics over a csv if you can avoid it because it can be pretty expensive now that means that your data architecture might look like this where you're using azure data lake store for long-term storage you have pipelines that are feeding into a sql data mart that might be where your star schema is those pipelines might be written in azure data factory or synapse or azure data bricks so you might write them using sql or you might write them using python and you might be filling that data the star schema into like i said an azure data warehouse or azure sql database you might split off alerting so alerting won't use that data architecture that i just showed you instead you might be feeding interesting events into an event hub and then from there you use stream analytics to load it directly into power bi so by keeping a simple alerting system that only has one or two or three mechanisms you pull directly from the source into power bi so that if there's something like a fraud happening or low inventory or something you really care about power bi is getting updated in real time so people can see it immediately and you can put all fire all sorts of email alerts and things like that to tell people hey something bad is happening but we're not overloading this for alerting this is for long-term analytics and for machine learning and things like that so in power bi you can create real-time streaming dashboards like this and this is a good example of what you would do if you wanted to deal with alerting because as events happen in your organization we can immediately see a live update in the tiles and line charts in power bi so if i were looking to monitor something really bad like let's say a fraudulent call that got detected i would want to see that if this number went to triple digits or quadruple digits i have a big problem that i need to react with uh quickly i have a customer that they monitor several apis on the internet to see if there's a disaster that's happening and they there's always a disaster going on in the world there's earthquakes or storms or tornadoes or all sorts of things mudslides and they monitor these apis and if it looks like there's a certain number of disasters happening in the same area then they they report on it and they built in their organization these really big led screens on the walls where the dashboards just like this are displaying to keep everybody on the same page so that everybody's monitoring the same thing and you can build that with power bi now the way it starts let me log into my azure portal the way it starts is with an iot hub or something similar to that it doesn't have to start with an iot hub but it can start with something similar so um what i mean by that is hang on one second let me just log in real quick what i mean by that is if we look at this iot hub that i have this event hub right an application let's just pretend that whoever's doing the fraud call application is reporting data to the event hub and inside this event hub that i called ikeh you know creatively right i secure it so i basically say okay you know this person has the right to manage send and listen as long as they've got the right key so when you click on this you can see you know typical connection strings and keys that you use to secure azure resources and once that is created the application starts reporting events that we care about to the hub and then we have to subscribe to this now in a iot situation there'll be flooded messages everywhere and we're not going to want to subscribe to that flood of messages and so what we need is something to temper the message load to kind of soften it lower it to a cadence that is digestible for instance you'll see this you'll notice that every second this kind of gets updated and we see it now if imagine we had a call center with millions of calls a second and the fraud numbers were pretty high it would update too frequently so we couldn't even read these numbers right so we need something to kind of lower the speed at which these events are being directed like governor if you will so what we use in azure to do that is called stream analytics and with stream analytics you define an input and so here the the input is that event hub see that event hub that event hub is the input and then you specify an output and in my case the output is power bi see if i if i go and click an output and it gives me several different options and one of those options is a power bi data set called stream data set and then you specify a table and now it immediately starts loading data into a power bi data set what data does it lower uh load into it that's based on the query so if you see the query what does that look like to you that's just a regular select statement so you use the select statement to say hey give me all the fraudulent calls within the last five minutes right so that's what's lowering the cadence of the message load to power bi now what do you do inside power bi well if we go into power bi and let's say that we create a standard dashboard this is just a regular dashboard okay if i click edit and add a tile it says what kind of tile would you like to add to the dashboard in my case it's a custom streaming data and i click here and it says would you like to use the stream data set where was that stream data set defined remember it was in my output there's the pbi if i click on that you can see i called it stream data set and data set name right so if i come back here stream data set i click next and it says okay i've noticed that let's pick a bar chart or a gauge how about a gauge what value do you want on the gauge i want the fraudulent call value what's the minimum value um you know if i have minimum in the data set or not and maximum if i have that or maybe i can just specify a target right that might exist in the data set but for now we'll just do fraudulent calls so i click next and then we name it and we'll just say fraud calls gauge and then we'll click apply and now you can see that gauge is available to me and i can like resize it if i'd like to do to do i can put it up here right and now i've added that to my streaming data set dashboard and that's it notice how we go right from the application source which is the iot hub into the stream analytics job which softens the message load and gives meaning to it and that's just a select statement you output it to a power bi data set and then you use it in a power bi dashboard as a streaming um real time tile and that's it that's what you do and you're not overloading your existing data lake with any of the alerting data if you don't want to now if i did want to put it in the data lake that's just adding another output so i can output to two things i can say output to a data lake file so i can do long-term analysis to it and output to power bi so i can see immediate telemetry dashboards that's it you're well on your way to creating a power bi streaming data sets thanks okay so let's take a look at data virtualization as a concept so this is a new thing right that's relatively new within the last decade um and what this does is it allows us to use files as tables so what you do is you have data in the file and then you define a table over it and you can query the table directly so we've had data virtualization in spark and polybase and hive and even power bi for a long time but let's take a look at what it looks like in synapse so i'm going to open up synapse workspace here do to do and you can see that i have some sql scripts here already built so for instance if i want to query parquet files you can see here on my screen look at this select top 10 from open row set this is the parquet file and i can query it give me a couple rows right now you can watch like i can run this right now and you can see it see this these this is data coming back so i pulled the parquet file out it inferred a schema and now i can select from it year cases deaths right if i don't want an inferred schema i can create a schema i can say hey go into that parquet file the format is parquet and the columns are like date date rep cases geo id right and look at data types date and varchar and if i run this query you can see just those four columns right and all this data comes back out for me now i can also do some aggregations here right so i can say hey open up that parquet file give me the year and passenger count and a total and group it so now i'm starting to do some aggregations in analytics right some basic analytics and this is all file based so it feels like transact sql x equals operating over those files for us and we are running just normal queries here this one's taking a little bit longer but you get the idea so we can do all kinds of analytics if we want we've got an awful lot of power here this is transact sql over files and it's pretty impressive now another thing that we can do is we can execute over multiple files so for instance i can do from and i can do join and i can do group by and order by so i have the ability to join data from multiple files and those files don't necessarily have to be parquet in this case the file is a csv right so i can i can run this query and and here's the schema right just like you'd expect this is schema on read so in a typical sql data warehouse you would define a schema and then load data into it here what we're doing is we're loading up a bunch of parquet files and then when we use those files we define the schema so we're creating the schema on read why is that important well in a traditional sql server based data warehouse let's say that i had a table with a billion records in it and i wanted to add a column to that table i would have to provide a value to the new column to the billion existing records and what this has traditionally done is made changes to large data warehouses very difficult adding columns can even result in downtime maybe you want to make a structural change to the data warehouse you could be down all weekend making that structural change and if there's a mistake and you have to roll it back you could be down and your users might not be able to use the analytics system as a result but in a file based analytics system i can add and remove columns to the files easily and i can define structure when i'm reading the data using data virtualization so that means that i can add and remove columns borderline at will without any downtime data virtualization typically has this graceful failing notion which means that if i go to look for a column that isn't actually there in the file the column just comes back null it just says oh i didn't find that column that you were looking for so there's nothing there right instead of a failure that's a lot different than a schema first relational data store where if you go looking for a column that isn't there you error out that's a hard fail that doesn't really happen in the same way in using you know what we're using here which is synapsing and using polybase does that all make sense i hope that makes sense really well so how do we organize a data lake we organize a data lake using folders so we can for instance what we used to call staging in a data warehouse we now call raw or bronze or we could still call it staging if we wanted to there's not there's no rule that says we don't have to use those names you can use any names you want but it's good probably going to be parquet files it's probably going to be a folder or and it's probably going to live in azure data lake storage so this is azure data lake storage i'm in azure synapse this isn't a course on azure synapse but it gives you a good idea of what azure synapses and just to kind of let you know what you can do in azure synapse is you can collect data from a lot of different places like this is sql data or you can link data if i click here and i say hey what external data can i link to i can link to a cosmos db or to azure data lake store we've been talking mostly about azure data link store but cosmos can work also and if i when i go in here you can see i've got a storage account for adls and um you can see that this is the base um folder structure you see the bronze raw right here and i just wanted to give you an example of bronze what that folder structure looks like now one of the great things about azure data link store is it's just like hdfs so anything that works with hdfs and hadoop and spark can work with azure data lake store because the front end api is the same so here synapse is connecting to adls and it sees a bronze raw folder and we have subfolders just like hdfs and what we do is we organize our subfolders by the source that's coming in so for instance if i had great plains data or sap data or crm data that data would get its own folder and then usually underneath there we have the date we brought the data in so for instance maybe i've only been running this for two years so that i have 2019 and 2020 data and then maybe i only pulled the data once a month so let's just pretend for the crm data i have a month in load and i take that month and load and i it drops it as json which isn't great for analytics but it's good for raw we can turn that json into parquet later if we'd like um but you can see that you know i've i've run this through july august you know july or so and i have a file for each month now in a normal transactional data warehouse in a like a sql server data warehouse a relational one you would load the data into one table the reason why we don't need to do that in a file based system is because the data virtualization will work over all the files in a folder so if i wanted to create a table a metadata table called raw so crm raw right i would point it to this folder and it would ingest all the files underneath it as one virtual table so i could say select star from crm raw and all the folders and files it would it would infer a schema or you could create a schema over all those files with all the right columns and data types that would all work all right there you go i just wanted to give you an example of what a data lake store would look like uh and particularly in synapse because i think the tool is pretty cool and it consolidates a lot for us all in one place um we the ods we typically we can call it ods or we call it clean or we call it prepped but most commonly it's called silver so we went from bronze which was the staging area to silver which is our ods and that is where data has been kind of clean just like the ods data was cleaned before and it is where um data can be joined but the star schema does not exist there right so the question is where do we put the star schema well we can put the star schema file based if we do that we would call that folder gold or we can call it a data wired if we want or we can put it in our relational database like our microsoft sql server or we can use a power bi data set we can use power bi to pull in tables and then clean them up using power query simplify them so there's only five or six tables and we can design the star schema in power bi if we want that is not my favorite thing to do what i prefer to do is build star schemas in azure sql database that's my preference and so i hope you would strive to do that too the reason why i like star schemas in an uh azure sql database is i like to use those star schemas in excel or power bi or other places sql is fast when you have column store indexing over star schemas it's very very fast and the refreshes are fast so we can as engineers avoid building the star scheme as using power query because it slows down the data refreshes pretty significantly and it relies on cleaning that up in the data extraction and in query folding which can be fast but it can be also very difficult to maintain and difficult to troubleshoot and difficult sometimes just so slow that we lose our ability to make it faster so we'd prefer to organize the ods in the silver area using parquet files and then using our data pipelines to build a data mart and sql server and then from there do just a straight simple extraction into power bi very simple and clean so that the star schema exists in power bi so this is a good example of a star schema in sql server so this is the sample and you can play around with the sample this adventure adventureworks dw and here you can see that we have a bunch of fact tables and we can see that we have some dimension tables just like we talked about earlier now in the silver which i haven't shown you in the ods portion of the data lake that is typically going to be parquet files and so you are going to have hundreds of folders and files in the silver area because you're going to take data that's interesting to you you're going to put it in parquet so it's compressed and you're going to leave it there pretty much as you found it without a lot of modifications so you can tie it back to the way the raw came in as json you're going to put it in parquet and silver because when we run analytics on it it's much much faster than running analytics on the raw json right then from there you're going to put the star schema in sql just like this and this is going to be quite a bit different than what you saw in silver so with resilver we'll have 300 tables here you're only seeing about 20 tables and that's pretty typical for a star schema in azure sql database one final note i would make the sql version that you use as your sql database hyperscale we have found that when we load using spark or using synapse that when we load data into sql we get a lot of log contention and it slows everything down and the reason why we're getting a lot of contention is because remember our data pipelines in a data lake are typically in parallel we have multiple nodes creating pipelines at the same time and if they all hit the same sql database they overload it and the log can't keep up and it gets really really slow but if we use hyperscale the log can keep up and we can load data very very rapidly terabytes in minutes i mean really really fast so like i said use azure sql database hyperscale put the star schema here and i think you'll find a lot of value there now what do we do with the math the sums the averages the the things like that well we can create aggregation files that are parquet files that might be stored in an aggregation folder in the data lake in azure data lake storage or we can create tables that are aggregation so this would be like sales by year sales by month sales by day can exist in the data mart in a sql server if you'd like or you can build dax in power bi that does those measures and calculations for you um of these three options i prefer doing it in dax i think that dax is very fast for creating aggregations i think the problem with building aggregation files or tables is that they just are always out of sync with each other so let's say that you have an aggregation table called sales by month and a different one called sales by year and you want to change how you do those calculations my experience is when you build aggregation files and tables the month and year things are often out of sync they're often wrong they often are inconsistent and they need to be troubleshot all the time and it's very very difficult to keep them maintained where if you keep the star schema in the data mart and you extract it into power bi and you build the dax and power bi now the dax the slicing and dicing is done in the dax right so the the monthly and yearly aggregations are just automatically done for you in the dax right using the filter context you don't need to maintain it they're always consistent with each other and it's much much easier to have clean and accurate calculations and dax than it is building it hand building it yourself using python or sql or something like that right that's my opinion other data engineers would have other opinions on that but that's how i prefer to maintain it so in power bi i bring the star schema and notice that these tables are coming in raw with the dimensions here and a fact here just like you just saw in azure sql database hyperscale right and if i come in here and i edit the queries and what you'll see is that here's the table and i'm not really massaging anything in power query that much now i will in the real world i will do a little bit like i'll combine some columns and i'll remove columns i didn't want and i might do some light uh editing here paying attention that i'm always using query folding so query folding remember is our ability to right click here and say view native query so that we can see it here we don't ever want to do so much in power query that we get out of query folding because we start really losing performance when we do that i've got another video on my channel that talks more in detail about that if you'd like to see it okay so what i do is i usually create a table called measures and the way i do that just in case you're curious about how to create a table is you just come in here and click new table and so i create a brand new table called measures and then inside here i create a new measure and this measure is just really simple it's just dax that says count rows now what i was talking about about counting rows is notice that in this visual i'm counting rows by month right and you can see that as the months progress i can see my total count here but i can also use the exact same dax and instead i can do it by year right i can say here's calendar year and give me total sales right and now you can see my calendar year uh you know here's my total sales let's see oh hang on do to do let's total sales do um i want to add the calendar year there there we go okay so you can see by year here's the sales that are happening right now when i talked about aggregate tables not tying up the totals here and the totals here wouldn't tie up easily in an aggregate table that always requires maintenance but what i want you to notice here is my visual here uses a total sales event right and my visual here uses total sales events right so it's the same dax this dax here that is just giving me the total number of rows by that year or month power bi does that consistent um technology that consistent mechanism for you using the evaluation context i've got a lot of dax videos if you want to read more about this but for this purpose dax is providing us a lot of power and it's very very fast for us so when for instance when i highlight one of here i get on the fly i get the visuals being updated that dax is getting reinterpreted on the fly very very powerful so i try to keep my aggregations in dax to kind of tap into this power and speed thanks um we have other uses of folders here too for instance snapshotting now remember if we have file based data stores it becomes very easy to snapshot data so for instance in our financial system if somebody says hey i want to know what the financial system looked like in august 2019 it's just a copy we just copy that data over to other files and now you can examine that data all you want if a data scientist says i want a quick snapshot view of some data copy it give the data scientist access to it in the data lake and they're done why can we do that well in a sequel based data warehouse copying data is very expensive it takes a lot of money so if you have a 50 terabyte ods and somebody wants a copy of that because sql is expensive to license and maintain and the backup and restore take so long of that 50 terabytes you avoid doing that you don't like snapshotting things like that it's expensive and and time consuming and you know nobody wants to keep it around a long time but when you move to a file based system now these things are cheap to do you can easily make a copy of terabytes of data pay a couple hundred dollars a month to maintain it that data scientists can go to town on that day just completely shred it and rip it apart and analyze it and put data virtualization tables over it maybe create their own star schema over it do whatever they want with it and the minute they're finished with it it's just a delete right in a data warehouse when you want to delete a billion records because of the logging and the mdfs and all those things when you delete that could result in downtime but in a data lake deleting is just a file delete nobody cares about it so you can really start to see the power of moving some of these analytic systems off of a relational store and moving into a file based part k store with data virtualization it is very very powerful and affordable for you to do that we can do it for temporal tables and for archiving you know that 10 years of data warehouse that we were concerned about take nine years put it in a file based data archive call it a day and now you've got a lean and mean 14 month data mart where you're doing really good data analytics that are up to date and current and the beauty of this is this if you have bronze silver gold archive snapshot and you do data virtualization meaning you define tables over those files and you've got other data that's highly structured in a data mart what that means is in a single join in a single query you can pull data from all those different places and in power bi you can pull data in from all those different places and in python and sql you can pull all those things in from all those different places so if we go back to our demo right if i want to for instance let's say i want to do archiving so what i do is i create a file that is a parquet file and then i specify the data source of that file is a specific location and then i create a table look at this create table structure you know this but instead of defining the columns i say this file that i've built a connection to in my data lake needs data hydrated from this table in the data mart it's just a select statement so using this method i can take a whole bunch of data from the data mart and i can load it into a file and then delete it from my data mart keeping my data mark lean mean and clean and having the file still there as use being used you know just like this right so i can query it you know just using open row set or i can define a table over the file that external table exists in my metadata repository and i can i can query that table directly right what does that mean that means that as i move data from the expense of sql server over to the cheaper files i don't lose access to it and my queries don't necessarily need to change so if i've got a report that's looking at older data those queries can stay the same and i could be moving data into file base underneath it without the reports ever changing or being modified there's an awful lot of power in building a modern data lake like that and synapse gives us the ability to join that data from sql or from files in a seamless manner and power bi can use it machine learning and deep learning can use it data scientists can use it your auditing requirements that you keep data around for 10 years or whatever is all being met and the fundamental code is not really changing that much so in conclusion yes we still make star schemas yes we use slowly changing dimensions yes we can still use cubes if we want but don't be afraid of files don't be afraid of building parquet files and putting data virtualized table structures over it um you know we can still use sql you love sql i love sql we can still use sql we can still build sql data marts try to keep our aggregations in power bi and if you do that you're going to create a really good analytics solution for your organizations now i realize i have a 50 minute session and so i've purposely kept my video short so that we can stop and start while we talk live and we can use the remaining of the time for a q a and for discussion i can show you other things if you'd like but thanks a lot and have a great day
Info
Channel: Ike Ellis
Views: 2,470
Rating: 5 out of 5
Keywords: Power BI, Data Lakes, Azure, Parquet, Streaming, Alerts
Id: BKJ6qnyL6_Q
Channel Id: undefined
Length: 48min 29sec (2909 seconds)
Published: Thu Sep 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.