How to use the Power BI Aggregations Feature

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
yo Adam Saxon here with guy in the cube and today I am joined with Christian Wade from the power bi product team we are talking about eggs let's do it Christian thank you for joining me we are on the show floor of Microsoft ignite yes we are alright and I hear you're gonna you're gonna demystify so first off hopefully you have seen the clicky clicky draggy drop e demos that Christian has done before it is an awesome demo it's amazing with massive data and so what we want to do is show you the clicks to get to the clicky clicky drag dropping so we're gonna demystify the magic people need to know people need to know so let's go great alright so Christian what do we got Walker Street okay so this is the biggest scalability feature in the history of scalability I mean in the history of everything this is like massive this is like clicky clicky drag and drop it over petabyte scale data sets this is absolutely insane yes on another planet alright now as you know as you said you know we've done the clicker kicking drag and drop you demo so many times everyone wants to know how on earth do I set this thing up okay it'd be real it could be real so I'm about to show you right now that it really is real it's not a figment of my imagination this is really in the September desktop release it's on public preview right now it's going to be hosted in the service very soon and it will then GA very soon after once it works with RLS all right so let's get stuck in so the concept is Adam as you as you are well aware we achieved blazing fast performance in power bi bi cash and data into memory we compress the data in cash it's because it's faster there that's farther there it is accessible that's what enables the clicky clicky drag you dropping that interactive analysis but with really big datasets you don't actually need a petabyte scale data set believe it or not I haven't have a lots of petabyte that I don't have lots going on in his backyard right now you know I pick I I'm on my way into work and I see a petabyte here and a petabyte there and I don't even bother picking them up we were so many of them that we were actually talking walking over here we're like look we even we haven't even checked Twitter yet we might be up to a zettabyte yeah we might get to celebrate already by cooking that's how fast this feature grows and that's the the scale that we're dealing with it so you don't actually need a petabyte of data you don't actually need a Trillium Rose any data set that is expensive to cache because of its size whether it be time money memory management overhead to duplicate the data into the in-memory cache pretty much any of those data sets could benefit from this feature all right so with power bi we casted that into memory so what we're doing with this feature is we're caching the data at the aggregated level which uses a tiny fraction of the memory requirements and unlocks these massive data sets and is going to transform interactive analysis over Big Data right so how do we find the aggregation let's reveal how we actually define the aggregated video I'm ready this is the meat and potato atom here we go all right so we define the aggregation simply as another table in the power bi data set but that's it it's a simple yes more than that no it's if you think about allegations in multi-dimensional with all of these lattices and intersections and no it's just a table it's so simple right so let me tell you what that means right the fact that it's just a table number one it can be an import or directquery table even the aggregation table can be direct query so you can open a data query aggregation over a direct query detail table why would you do that well you could be number one you could be managing and refreshing that aggregation in your data warehouse you could be using incremental data load processes to refresh that aggregation you could optimize it with columns store write and without even having a single import table you suddenly make direct query over large data sets usable right so loved it you know you have the flexibility to work with externally defined aggregations that's one reason why it's just another table another reason it's just another table is that you can manage for example incremental refresh on it if you have a big aggregation that's going to cover a big chunk of the user requirements you could actually set up in from to refresh on it so here I have this aggregation table and currently everything is directquery you can see this is built on the composite models feature that allows me to set the storage mode for each table which is a game changer in terms of not being constrained to just direct query or import now this is the cells aggregation table and it's an aggregation of the sales AG table which is an aggregation of the sales table let's pretend that sales has 10 billion Rosen is expensive to catch right so I've created this aggregation table again it's just a table it should be hidden it very much has to be hidden once it works with row-level security it won't even be addressable if you're not an admin on the data set so if you're not running in desktop for example you have to be the owner of the owner of the of the data set so always have the aggregation table hidden and in this case I could be performing my aggregation in the super data warehouse and I could have it once I set it to import I'd be able to set up incremental refresh so it's so real quick to build that aggregation table we just do the normal get data pull it in normal get data okay the normal get data and set up the relationships just like you and then we can like add measures and stuff to that as well typically the measures wouldn't be on the actual act table because the users are there the the aggregation table is kind of abstracted for them users it's not part of the logical model that the users see the business oriented model so they wouldn't actually see the AG table but you can definitely have measures that refer to the detail table that are then going to leverage the aggregation table and then in terms of building out the eggs table would we do some because we can do like power query to pull that in yes could I do a calculated column on it as well yes if we're doing imported yes you can okay yes you can right good to know so here it's telling me that I can't set up incremental refresh because it's not import that makes a lot of sense you it has to be important to be to set up incremental refresh but just to reiterate you can set up incremental refresh so that if it's a big aggregation table you don't need to reload all the data every time you do a refresh which is much much more efficient which is another memory usage efficient efficiency feature for large models we've got this table I I could be performing an aggregation in the M expression or in the source I now want to set up the aggregation mappings right I want to provide the metadata that tells the engine the United Services engine that pod runs on that queries that refer to the sales table we'll get redirected to the sales AG table and be superfast garance garlic does that make sense yes okay cool now it's a redirection it's a redirection and if you think about it this table it's really just the sum of cells grouped by custom ID date ID and product subcategory ID that's the non-empty cross join of those keys which could be tiny this could be a 10 billion rows sales table and this thing could just have a couple of million records ombré right so because it's the aggregate right okay so time to reveal how I actually set up the mapping so you ready this trick this trip me up what is it this is the the real the real deal right here so let's go ahead and blow and behold the manage aggregations context menu now real quick which table are we doing that also we're doing it on the AG table itself this is what tripped me up because my thought on it would be oh well I'm doing a redirect so I want to do it on the main table to redirect to the axe yeah but it's really the reverse had many design discussions the question parents we have many different designs and we went back and forth and in the end we anchored it on the actual aggregation table but we did consider other designs as well okay and once you're here of course you can switch to other tables using this drop-down and so as you can see it's listing all of the columns in the aggregation table right some of them are foreign key columns so for now I'm not going to bother selling the foreign key columns like you could set them for verbosity and clarity mate we'll come back to the foreign key columns shortly I'm just going to go straight to the actual aggregated values right so I've got sales amount sum so that is believe it or not a sum of cells cells amount and that's it I've just set up that aggregation map that's it that is it oh it's as simple as that there's no aggregation lattices and like node graphs and some Blackheart to try and figure out whether you're getting AG hits and stuff it's simply sum of sales amount all right there we go all right so let's do the same for unit price easy peasy easy peasy lemon squeezy that's how we rock all right so all right so sum of sales unit price and where is it there it is and now I'm gonna do another aggregation on the same detail column but this one's account got it okay so I'll I'll show you why this is relevant shortly so now the so these will only take effect like if I do a count operation from like a Dax perspective right yes so the Dax aggregation functions will leverage these aggregation mappings and as I'm about to show you it's not a one-to-one relationship between the Dax aggregation function and these mapping okay right okay so you actually get more bang for your buck in terms of coverage I like that aggregation I like that that sounds good we're all about right bang for the buck here fact internet sales count this one's a special one because it's the count of the rows in the table this is what the count rows backs function uses so that there was one thing I was really like what's the difference between count in a count table right right and this one is almost like a best practice to always have one of these because sometimes power bi will generate count rows Dax queries without you even explicitly asks so so that's how when we did the the one trillion row demo yeah we did a count rows yeah to show the one trade the count rows that was an explicitly defined count rows and it was using exactly a mapping like this got it now sometimes even in the filter area it will show you the member count of all the the filter members and it will generate a count rows query just to populate the filter pane and so it's a good idea to have a count row okay all right good good do them alright so we'll go ahead and say this is just a count rows of cells and we are done we are actually done it's easy as that quick okay so we're done there all right so we've set up the aggregations what I'd like to do now is switch the aggregation to use the import storage mode instead of because right now it's directquery yeah right now everything is directquery okay so as you can see here yep so you mentioned before that we can use directory or import so what's the reason why we need to switch this over to if we want that blazing fast performance and we've created this aggregation table it's a lot smaller so we can fit it into memory then we want to use the composite models feature to set the aggregation table to be import and we can leave the ten billion row detail table or the ten billion row fact table we can leave that as direct query in there and the reason for that is because direct query we still have some overhead and there's there's not I mean performance is not going to be as nearly as good as import right so the main design purpose of this is really import is the shining star if you can do it do import absolutely pour the eggs table specific the very pack in memory cache is the fastest you can get I mean got it fast on the planet that I'm aware of awesome all right so let's go ahead and say this table is currently diet query let's set it to import now I just like to take a moment to explain this dialog because what it's saying here I'm gonna I'm gonna skim over the first couple of paragraphs so it's saying that it wants to set these related dimension tables to be dual so what on earth is dual let's let's answer that so composite models actually introduce three storage modes we have import we have direct query and we have dual dual means that the table can flip back and forth between important direct query why is that useful you might not yeah why is there useful so that is useful because if you think about it if I submit a query that asks for some of cells and it is at the granularity that can hit the egg then it would get redirected to the cells AGG table and let's say we're grouping by year and then the dates dimension table would act as an import table for that group by operation so far so good so everything would just come back from your memory may be super fast okay but if I submit a query that cannot hit that in memory cache right and it still needs to group by date year then date is going to flip to data query on the fly got it right that makes it's much more efficient to push that some group by operation down to the source sister we let the engine determine what the best route is exactly because if if Bates was purely imports and then it needed to get better from sales indirect query would have to do a joint on the Pahlavi isolate which would be super super expensive it could potentially get all ten billion rows and have to do the joint and power beyond so that's the key way to conceptualize this whereas jeweler allows to push it down and in reality it would optimize a little bit more than put in all ten billion rows it would generate the in clause where filter from power query is subject to the privacy levels and all that but it's never gonna be as fast as just pushing this ungroup violation down to the source the source do the work gives you right that's what Joule gives you so it's saying it once I set these tables to dual and it's figured out what are the minimum dimension tables that it needs to set to Joule which is super useful if you have a hundred table dimensional model coming from a data warehouse and you just want to catch a couple of tables you can let desktop figure out what are the minimum dimension tables you need to have to build right so it's another feature to help with these concepts which you wouldn't recommend we don't need to set everything to no no it's only that only what's needed that are on the one side of one-to-many relationships either directly or indirectly gonna so for example I'm gonna go ahead and set these two and I can also set the storage mode by multi selecting tables and setting them in one so this is the new designer is that you not out yet the the new modeling view coming soon that's coming soon so I'm gonna go ahead and set two of the five tables to dual this way now because this new design is not out yet this is doable in the old designer yeah you have to do it from the report view you can get to the properties pane for another report on you got it okay so having done that I'm gonna come back in and set cells AG to import again as you can see instead of listing the five tables it's figured out that these are the only three that it needs to set to do it because the other ones are already doing yeah and in the current build of desktop it is mandatory that these are set to jewel in the only way you would be able to have these relationships in the current build of desktop is using many to many relationships in a not too distant future release of desktop it will be allowed but it'll be highly advisable to set them to Julian's of having them as input because of the the reasons we described earlier to avoid doing the join on the power bi side that key thing to remember all right so we'll go ahead and say okay to this it's going to this is the the security risk in relation to the privacy levels because of query folding when it's pushes the in clause we're filtered down to the source go ahead and let this refresh okay and we're done and we're ready to now actually stop querying okay all right now we're ready now we're ready so I'm gonna use a phenomenal backs querying tool called Dax to do which you may be aware of yep go get that over at sequel B I'd I can't write free absolutely so then here we have a query that just asks for the sum of sales amount grouped by county year naturally this should hit but the in-memory cache let's find out so naturally this should hit the in-memory cache so if we look here at the server timings tab we got a scan right so the the scan means that it got data from the in-memory Verdi pack store so this alone and it's got an XM sequel query this alone tells me that we hit the cache because the sales act table is in import and the sales table is in DQ so this alone is already telling me that we hit the cache additionally that studio is has been kind enough to subscribe to the aggregate rewrite attempt extended event which is also available in sequel profiler and then here we've got this nice little check mark saying match found and we can expand the details here and get the exact slice of the sub query that was submitted and yes super not so super useful this build of this is a preview build of dak studio yes as of the recording of this video it is not publicly available yet but it is coming soon yeah what does it have a build version on it 2.8 so it's going to be in the two point the current version as of the recording of this video it's two point seven point four and this one is going to be in the two point eight range so stay tuned on that if you're not using two point eight or better you're not going to see this feature this is super super useful for aggregation and with the xml am point that's coming soon you'll be able to use that studio against the power bi service and check whether your eggs are getting hit in the service as well XML a is XML a is we let's do another record alright so here we have another query which again is just asking for some sales amount there's no that the the query isn't even aware that this sales AG table even exists it's completely oblivious to the existence of aggregation love it all it knows is that the queries are magically fast it's you know so then this one we've got the data here and this was actually a direct query right so here we can see instead of a scan we got sequel here's the direct query that was sent to the source and we will look at the rewrite event it said attempt failed why did this not hit the cache because this is doing a sum of sales but this time it's grouping by the product name which back in the model you can see product will not hit the cache right because the cache modulates a product subcategory but the product subcategory could have many products it wouldn't know which product to aggregate to right make sense make sense so it just dynamically on the fly switch to direct query and the the date table was included in the direct query so one minute it was included in the in-memory cache hits and the other minute it was included in blood query because it's dual awesome awesome awesome all right so so far so good so we've got the fundamentals going I'm going to do another query here which is just using the count rows backs function naturally this will hit the egg as well because we set up that count table rows right right so we've got an rewrite over there on that game we're doing this on the fly yeah on the fly and we got a hit and now let's get to some interesting ones like I said there is more backs aggregation function coverage than you might expect oh let's cover it so and wait there's more wait there's more so average do you think average is going to hit the cache we didn't define and shift whatever you define some of them we defined carefully how could you even define an average when it's a non-additive aggregate how are you gonna be able to do all right well let's run this thing whoa we actually got an egg here what's going on that's awesome that is awesome right so what it did average of unit price let's take a look at the aggregation mappings dialog again and if you remember units less so it's taking the average of the sums so unit price has a sum and account right it has a summon account so it can do the math finally the average the night service is engine well internally for an average we'll take the sum divided by the count so internally it's generated two sub queries for each of those it's figured out can I hit the cash and he said yes thank you very much let's do it a beautiful sense the beautiful things that's a beautiful beautiful beautiful thing now now let's not get carried away all right let's not get to kind of it how on earth is really realistic I mean how on earth can distinct count hit me act surely not surely not I mean this is distinct count it's actually got a I mean that's what I mean how could I even aggregate distinct count I mean it would have to be at the exact granularity wouldn't write above any other aggregations so the first hit okay it's not in the cache it submitted a distinct query and the rewrite attempt failed how much mix up that makes complete sense however we can actually make distinct count hit the app in this particular case believe it or not do tell do tell all right so what we're doing here is we're taking a distinct count of customer key now as I said earlier okay so we lift the foreign keys blank right because we don't need them it can rely on the relationships it the engine is smart enough to say you know what I've got relationships I've got everything I need easy peasy let's just hit the act right but if I wanted to for verbosity and clarity sake I could actually set up the group by mappings on all of the foreign keys so let's go ahead and do that cells order date key and customer key so there what's the reason for the group I ordered eight key well sorry customer key check where there's no way there it is so I'm gonna show that the group bys are actually mandatory in some scenarios so this think count happens to be one of them another one is the big data or the Hadoop based big data model it's know which I can show you real quickly in just a second or you can fill them in just for clarity sake it depends on your preference if you feel that this is more readable if you fill in the foreign keys you can do it but as we just saw it's not mandatory if you have a dimensional model so what I mean by that is that a dimension model is going to have relationships between the dimension tables and the fact tables right so it can rely on the relationships to figure out whether the eggs can get used or not right so let's go ahead and say group by and then this one is actually not in the cell table we need to get to the product table to get to product subcategory key now now we are saying to the engine customer key which happens to be what I'm doing a distinct count on customer key is flagged as a group by attribute and therefore telling the engine I've got all the customer keys in the egg table otherwise I won't be able to group by it right so it's smart enough to say okay well I'll just use that for the distant count makes sense yep all right absolutely let's go ahead and run this run this again and lo and behold we got a scan and a rewrite new with a match found that is not easy as amazing as it means guys made it now a couple of quick things about distinct count there is a threshold on somewhere between two and five million distinct values where there is a performance degradation that still applies but there are still many useful scenarios for this thing count so this one here where my detail table is direct query it's going to be much faster by hitting the in-memory cache to get the distinct count that's number one and number two even if the detail table were cashed in in a subsequent release currently the detail table has to be direct query but even if you have a ten billion row cache table distinct count can be slow there but if you have only three million distinct customer key values and you put that in an egg table you're gonna get a performance benefit that way as well all right but in this simple case it's super useful it just means I don't need to run a direct query distinct count yeah that's great okay all right cool super yeah so that way alone is super useful so that's distinct count and in this one over here we would expect this to work right because we've already seen that some of cells and count rows hit the egg and it did the reason I'm showing you this is that this is somewhat more complex measure you know aggregations is not just for simple some measures right you can have all these complex measures with some divided by a count multiplied by a ratio by the percent of parent of all ultimately pretty much everything gets folded down to a sum min max will count right and just like you saw with the average function once it folds down to the sum min max will count it can create a logical query plan to detect for each of those sub queries whether it can hit the aggregation cache or not right that's the way to think about it is the physical query plan will likely not always stick to that rule but if you think about it errant only get folded down to a sum min max or count even the average function does that right and then it will evaluate for each of those sub queries whether it can hit the catch right and then the physical query plan will deviate from that but that's the starting point nice all right all right cool so that's the dimensional model that's how you do it all right right now there's a couple of things I'd like to quickly reiterate because not a lot of people understand this dual-mode thing I just like to reiterate this dual mode okay all right so just think of it like this dual mode allows power bi to not have to do the join on the PI bi side which is way more efficient right so if I had a tape a couple of tables cells and date and have a one-to-many relationship between them in the current build of desktop you can't even set this up it has to be a many-to-many in a not too distant future release of desktop you will be able to set this up but it will not be recommended for the reason that I described it will have to do the join on the power bi side right not very efficient kind of so what we can do instead is just set the date table to Joule and then we do when we do a sum of cells group by year for example it will push that some group by operation down to the source and if I have a query that just touches the date table or just touches another in-memory fat table group by a date attribute it will all be returned from in memory so it seems pretty simple right yeah ok good good good and these are the rules that you have to align with for jewel basically the table on the one side can always be jeweler respective of what it is on the many side which is why you tend to have these dimension tables can it be the the dual tech Mary alright so the last piece of the jigsaw that I think we should cover right now is is how to set it up for these Hadoop based big data modes because they can be a little bit different right so you know for these these Hadoop based like for these big data sources like HDI spark and data bricks you know with these big data sources and like a dimensional model where we can rely on the relationships between the dimension tables and the fact tables with these big data models what tends to happen is that yes they can store petabytes of data but they don't tend to deal with joins between petabyte tables very well no one's perfect right so what tends to happen is you have these fat tables with all of the damp with the relevant dimension attributes denominate into the fact table and in reality it's just stored as a file with all of them this will be the file header right so here we have the driver activity table this is actually the trillion row demo model basically and so this will have a trillion roads caller and petabyte in spark and here we have a the aggregation in this case is direct query but it would be imports right and we can generate this aggregation data set in using spark if we want now the key thing is all we had to do here we've gone from this table for the training row demo has got one point six eight billion rows and actually compresses quite well it compresses to under 20 gigabytes and the this table here is a trillion roses quarter of a petabyte how enough is that possible because we remove the high cardinality columns right we've got longitude and latitude right it's good advice for anything yeah it's good advice for anything if you can absolutely so we remove the high cardinality columns and that reduced the number of rows dramatically my highest cardinality column here is date which has five hundred rows in everything else is pretty much categorical you know AIDS might have you know less than a hundred pretty much everything else is either a boolean or categorical core attribute with like four or five values in so when I do the non-empty cross joint across these attributes it reduces the the row count dramatically right and so any queries that are grouping by any of these attributes are going to hit the cache and it can can be super fast right and then to set up this mapping I can come to the AG table and I can tell you manage aggregations and then here is where the group buys like with the distant count the group buys are absolutely mandatory here why because there is no relationship in the whole model right so I have to replicate all of the dimension attributes and have to set up the group by for every single one of the dimension attributes right and then down here I can set my aggregation functions sum and count table rows this is where we get the the the count of the count rows function returning a trillion in the trillion row demo is paging off this mapping right here and that's it we're done that that's all there is to it so there's no relationship there and there's no relationship and it doesn't even need it because it can it will count on these group bys instead okay yeah interesting yeah that's pretty slick yeah that's pretty slick right all right that works for either of the the modeling snows and actually you can have hybrid scenarios as well like you know if you've got a denormalized dimension table like maybe your date dimension table and you've got an aggregation table that's up at the month level you don't want to have to normalize out your date dimension table to be into two tables one at the month crane one at the day grain so that you can create that relationship you can just replicate month quarter and year in the aggregation table set up these group by mappings for month quarter and year and rely on relationships for everything else right so it's pretty flexible and yeah so it'll be interesting to see what people come up with yeah as part of this the last thing I might quickly mention Adam is this precedence field you can actually have so if I go to this layout I've actually got another aggregation table this one is import so you can have and this one is going to be considered first they're both aggregations on the driver activity table this one will be considered first because of the precedence has a higher precedence than this one right got it so what this allows so as the numbers larger that means it'll yeah first it will it will hit but it'll be considered first so if I have a query that aggregates by any of these columns it will hit this aggregation if I then drop in one of the columns that is not in this table but is in this one it will hit this aggregation interesting and if I then drop in longitude longitude and latitude it will then go to the source right so this allows for these kind of balanced architectures where I could have this table be an in-memory cache that's going to serve my executive dashboard this one here could be optimized my data warehouse using columnstore indexes and then this one provides a drill through all the way through to the IOT data from the big data system and you get this balanced architecture this thing's a beautiful thing yeah that's if you think about this you can catch a very high percentage of your business intelligence queries because most business intelligence queries are aggregated so catch a very high percentage of them by using aggregations in memory and then for the queries that don't get caught they will get though that will get sent through to the data warehouse or to the big data system those are more targeted less aggregated ones which tend to be queries that they can deal with those systems can deal with better right so you end up with a much more balanced architecture and you utilize the resources much more efficiently and it's a beautiful beautiful thing yeah oh man all right all right guys I know that was a lot of information thrown at you hopefully you found it useful Thank You Christian for hanging out let me know down below did this make sense you guys have questions we can try and follow up and maybe even update the documentation to clarify this a little more all right guys if you like this video be sure to give it a big thumbs up smash it if you so desire if it's your first time here hit that subscribe button and as always from both Patrick and myself thank you so much for watching keep being awesome and we'll see you in the next video
Info
Channel: Guy in a Cube
Views: 40,814
Rating: 4.9232111 out of 5
Keywords: Power BI Aggregations Feature, power bi aggregations, power bi, power bi advanced, power bi advanced tutorial, power bi aggregate, power bi aggregate table, power bi dax, power bi desktop, power bi features, power bi features in depth, power bi training, power bi tutorial, power bi tutorial for beginners, advanced power bi, dax power bi, how to learn power bi, learn power bi, learning power bi, big data, data analysis
Id: RdHSo43LkQg
Channel Id: undefined
Length: 32min 18sec (1938 seconds)
Published: Tue Oct 16 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.