Speeding up Power BI queries with aggregations

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] in this video you're look at how you can speed up queries using aggregations if you're not familiar with aggregations there's a blog called keeping big data big with aggregations on the fam DB site aggregations not only speed up performance but can actually help reduce the load that are on your underlying data source so let's get to it for this demo we're going to use telemetry data from how old that website if you're not familiar with this site it allows you to upload photos and we use API to detect faces and try to determine the age and gender of the other faces in the photos the site itself lets you upload photos in three ways one is you can you just choose the photos from the site itself and it will go ahead and detect you can also search using Bing or you can go ahead and use your own photos the data that we collect involves the browser that the user used to upload the photo the type of submission either the search upload or picking one of the existing photos the operating system the date that it happened on as well as a user ID the latitude and longitude determined by reverse IP lookup the number of faces that are detected in each photo and then each line in the table in the table itself is considered a photo so every time you upload a photo or you look at one photo that will be an additional line in our fact table so in order for me to get the number of photos that were actually submitted all I need to do is count the number of rows in the fact all the tables that you currently see are in what we call directquery mode this means any queries that we generate in power bi reports will actually execute directly against the source data itself the data for the how old site is in a data warehouse that's actually singing inside an azure sequel data warehouse so if I go to the reports and I pull in here to see the number of photos that were actually uploaded I drop it on here this is actually executing a sequel query now you see it takes a few seconds remember the way we get number of photos is by looking at the but actually counting the rows in the photo fact table everything I put on here if I wanna break this out by the operating system there's now gonna run a query every visual I put on here it's also going to run a query so I look at let's say the fête number of faces and we look at that by the different browsers that were used now there's two different queries that run on here when I click on one visual it filters the other visual it's going to run a query to do two now filter down the other one so every visual I put in this page is going to run a query every time I click it's going to run refresh all those visuals so a single user can generate dozens sometimes even more queries that's just a single user this database itself happens to be pretty fast but I'm the only one on here so with a lot visuals on here the things will start to slow down and then as I add more and more users I'll get even more and more slow so just because the data can just because the data can handle singer user or you can get good speeds out of your out of your database doesn't mean you can't benefit from adding aggregations aggregations will mean that you actually do less work at the time the query runs because you'll have done that work in advance by creating an ideation table so let's actually put some aggregations in here I'm gonna delete these visuals for now so we can come back and recreate them we'll see how much faster they are so we go back to our diagram with you now the biggest things that make this fact table so large are the really low event information so for example we have a user ID on here we have a session ID on here we have the event date/time this is down to the second that the photos were uploaded so if we just get rid of those and we keep all our other dimensionality on there we'll actually condense this data down into something pretty small so let's go ahead and start creating a new fact table here now this is gonna be our aggregation table you can create the aggregation table many ways that reaction table is really just another table stored at the aggregated level it can be another directquery table it could be another in-memory table and you can create the table any number of ways you can create it in your ETL process in this case I'm gonna use power query to go create that table so let's do a get data we will pick sequel data warehouse now we have the different connectivity modes here everything else is in direct Creek now I could make an aggregate table that's in direct query as well but in this case we want to choose to make this an import table I want you're gonna put it in memory so the queries that run against it will be really fast so I have my photo fact table the same one we imported before this time I'm gonna edit it so first thing we're gonna do is get rid of any columns we don't need these things tend to add a lot of grain to the fact table don't need the exact latitude longitude we want to keep the browser the OS other things like that the date it happened on and we'll go ahead and remove these now the rest of the columns the submission method ID the browse already the osid to date ID I'm going to take these we're going to make these are group by columns and our measures we're gonna have a count rose so we're gonna call our count remember the number of photos is the number of rows ever in the table since we're gonna aggravate this we can no longer count the rows at query time we need to count them in advance or I call this photos that's going to count rows we add another aggregation here for a face count this will be a sum of face count another one here for image size this will be a sum of image size well I don't know over here for processing time yeah it'll be a sum of processing time so we're not losing any measures here I'll say okay and now we have our measures grouped by the different dimensions so when I call this photo AG and we're going to go ahead and load this to the model so what we've done here is we've actually created what's called a composite model by combining data vets and direct query with data that's in memory I'm gonna create a new diagram here I'll call this our AG diagram and you can see we have a new photo AG and let me take this and we'll put this right next to the photo fact you can see the differences let's zoom in a little bit here so we have the same measures photos face count image size but we've dropped out so a lot of details here it's going to bring in all of our related tables your browser you want date we want operating system and we want the submission method now these tables are all stored in directquery mode you can see that by looking at the storage mode where this table is an important mode now when we run a query against an aggregation we will run that query either entirely within the aggregation and in this case would be all in memory or at the source itself all in direct query mode we don't want to do a mix of both because that could hurt performance so these tables as being all indirect query we can't run them in memory unless we either put them in memory but if they were just in memory then we couldn't run them into our query mode so we have a new mode that's called dual mode which means the table both exists in direct query and in-memory cache so let's select all these tables here and we'll change the storage mode to duel now they're imported into memory and they can be used both as directory or cache then we'll go ahead and create a relationships so the photo AG is now using the same tables for dimensions as the regular old fact table if we go to the old table view you can see that photo AG photo fact and then we have our dimensions but we have to do one more step here if we want to go ahead and use the AG table in places where we can instead of actually hitting the fact table we have to create this I as an aggregation so we need to set up our mappings between these between the photo fact and the AG table and once we do that any queries that go against the photo fact table that can actually be answered by the AG table will be automatically rewritten and run against the AG table before we do that we need to make sure the columns in the AG table that are going to be used as the columns on the photo fact table are of the same data types so image size here is a whole number but image size here is a decimal number so I take all the columns we just added all the measure columns really and I'm gonna make them whole numbers let me that's nice about this new diagram view you can do multi select and you can change things all in one shot so we've done that now we can set up our eggs so I click on the table I'll say manage aggregations I'll bring up our egg window we won't go into Preston as much in this in this demo but with precedence what that allows you to do is you can have multiple aggregations in the same model and when more than one aggregation can be used to answer a query you can decide which one comes first by setting the precedence so what you really need to do here since we're using the same dimension tables all we really need to do here in order to set these mapping is to really map the measure columns dimensions will automatically work because they're exactly the same tables that are in dual mode so photos if you remember when we create this on the AG table this was the count of the rows in the table so what we're gonna do is we're gonna say that this can be used with sone does a count row of of the photo fact table we can use this column to get the answer so anytime you use the count rows function on the foot effect table it will actually then rewrite the query tip to sum this column image count will be a sum anytime someone selects the face count column so face count column on the AG table but you will be summed anytime someone selects the face count column on the foot affect Abel will do the same with image image size and finally processing time so now we have our AG mapping setup and they come back here and now I pull photos on to the canvas remember I took a few seconds the first time to run well now it ran against the AG table and instead of taking a few seconds it was instant it didn't actually run the directquery against equal to hit the end memory cache I can put in some more put in the different dates and you see everything I do now is instant if we fast-forward a little bit to a little bit more of a finished report you can see I have multiple visuals on this page with a different slicer and a slider every time I do something here right now all these visuals are updating by hitting cache I click on this one here this filter down hit cache I filtered this one down and hit cache and I filtered this one down to hit cache right now we have browser photos the name the the names of the week the names of the days and then the days themselves are all in memory but what if I want to drill down not just to the days but to the actual minutes within those days in the seconds within those days so I can now actually do a drill down I've put the event date/time here under the day name now the event date/time is not in cache it's in direct query so when I do the drill down here I'm filtered on Safari so what it just did now was for this day for Safari it ran a direct query against sequel to bring you back all the data so seamlessly I'm transitioning between direct query and cache I'm using cash whenever possible now it also at the same time filtered down these other visuals for the same day I just did the drill down on but these were all answered by cache and every time I change these now these were answered by cache cache but this one was they'll direct query seeing if I go back here and filter this and let's say filter by face list these were filtered by cash cash and again directquery down here now as I click on something down here like a particular second now every visual in this page is running in direct query now we can do some more advanced things here let me pick a particular time frame there were 12 photos here I right-click this I can do a drill through the way I set this report and let me see a different lab to Anna longitudes where we didn't put those in in memory either those are all indirect query which I'll drill through two locations and now it's gonna plot the map now the number of faces in the that in that block was 22 faces 12 photos and 12 different users now these two cards here all refreshed from cache actually I take that back since we were since we had drilled down to a particular second none of those could be answered from cache so everything on this page was now answered from direct clear but you see the different locations I mean you see the different stats about about that time frame you know because we narrowed it down to such a specific slice of data everything can actually run pretty fast here we're not scanning the entire data set then trying to every eight it and bring it back so we actually get much faster response times for some pretty intense queries and we're doing a distinct count of our user IDs or look we're trying to plot latitudes and longitudes but since it's becoming a very targeted query sequel TW can handle that very well and without having to put all that data in memory we can get some very fast results so that you have a simple example of creating aggregations over a sequel datasource there's all kinds of other things you can do more columns more sources much more complexity it is really a powerful feature that can save you a lot of time and if done properly can actually reduce the total cost of ownership of your solution but give it a try let me know any feedback in the comments below or any questions you may have thanks for joining

Info

Channel: PhantomDB

Views: 5,057

Rating: undefined out of 5

Keywords: PowerBI, Power BI, SSAS, Analysis Services

Id: uW7lh78CHQw

Channel Id: undefined

Length: 18min 5sec (1085 seconds)

Published: Wed Jan 30 2019