Common mistakes in big data models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] so [Music] [Applause] [Music] so [Music] so [Music] [Music] do [Music] [Music] [Applause] [Music] [Music] um [Music] [Music] so [Music] do [Music] so [Music] [Music] [Music] do [Music] [Music] yo [Laughter] i wasn't even watching the countdown [Laughter] this is a secret bi channel but today we are we have the pleasure to have adam saxton and patrick leblanc from gyneco because we want to talk about uh what can what are the common issues in large database in power bi and analysis services alberto present yourself alberto is hi i'm alberto alberto ferraris i work with marco i do all stuff about dax um yeah let's see it about me just for the very few people who don't know uh adam and patrick adam and patrick do you want to introduce yourself very very quickly so uh uh adam saxton just a guy in a cube doing the work we have a small youtube channel where we do videos and uh you know yeah working with large enterprise customers in our day job and uh yeah that's that's we're gonna try and bring that to bear on this stream yeah patrick leblanc what that guy said what that guy right there he split the screen this way i think this is the worst question you can ever ask to a geek present so let me just introduce the format so we are live on different channels so we have youtube linked in the channels where we have the larger number of people and twitch and facebook now we can see the questions from everybody we already we the idea of this session is to discuss about the common issues that we can find in large data models and we would ask to patrick and we would ask them questions i albert and i we have questions and probably they will ask something to us too but i will and we will take a look at the questions coming from the several feeds and we will try to answer online so we may not be able to answer into the chat but we will try to answer directly mark marco one suggestion also uh as at least on the youtube side the chat is very lively um as i normally expect um and i would recommend i do see alex powers in the chat if you want to you know borrow a feature from the guy in a cube live stream you may be able to take advantage of hey alex if you make him a moderator he can drop links okay just a suggestion okay okay uh just a second yeah it's uh uh it's it's about it's about the cell or it's all about the cell alex say something in the chat so you can grab me i don't have the youtube link here so i can i'm uh i just i distract it i distract it i'm sorry i apologize [Laughter] okay all right that's fine so let's start yes and we we have actually we made a detailed plan about what to say and i think the first question was uh patrick what are the big what is the more important problem the more common problem you see in large data models modders that have uh billions of rows what do you think is a large data model first what do you think patrick 10 000 rows now one one million rows what gets me is when customers call up and they're like oh you know we're trying to work with large scale of data we're struggling with size of the model you know we've got a p3 or a really large analysis services and we've got 300 million rows and i'm like what i'm like all right we can help you yeah it's data modeling marco i think for me it's just people don't understand that you need to have a you know a curated data model adam and i we work with custom big customers all the time and like he said some people are coming they have this hundred million rows and they go we can't get this in the memory but then we work with customers that have billions of rows but they get it into memory because their fact tables are i like to say they're skinny you know their fact tables are really narrow but they've done a good job at designing the data model and that is probably the number one thing that we see and just one more thing one more thing another thing is that people always think they need all the data so when we talk about hey you only need to pull in a certain certain amounts of data people always think about columns they think about columns oh i don't need this column i don't need that column but it's a horizontal you know slice that they also need to look at from especially from a time perspective people think oh i need 30 years worth of data nobody's going to look at 30 years worth of data my friend nobody's going to look at that you only need about five years okay there you go marco the the as an example uh there is a customer i'm working with where they've got uh a model where they were able to get it up to four billion rows imported on a p2 in premium p2 uh and the model size was about 15 gig of note their fact it was a very clean model so keep it clean keep it simple their fact table only had about 12 columns and they were all numbers yeah yeah yeah but this raises to me another question because i mean it's totally clear that uh most of the times it's the data model but the natural question is why is the data model so is that the developers do not know data modeling they didn't study kimball or they want to try something different or is it the business that do not ask her the proper questions yes because it's so common that there should be a common a root cause on top of uh yeah about that you know alberta i learned something this week that i did not know other bi tools that are not like power bi they actually encourage you to use these flat tables with you know and you don't have to worry about cardinality but what i also learned is they run into the same performance problems you know and so i agree that maybe people don't know maybe people don't know these skills but sometimes you got to blame the technologies because i mean you become a product of what you use a lot of times like excel think about excel excel hell everybody uses excel i remember marco saying excel is the best data entry tool right and so everybody's using excel everywhere so we have all these silos that the people on this panel right here we got to go figure out how to consolidate those silos and i think that it's not that they don't know them right it's just nobody they never had to learn it they never had they they were never posed with these problems and now power bi is just everywhere poverty is all over the world people are downloading using it and now the data scaling and now they got to learn just like you know they gotta learn this new this new way of doing things the other thing i'll add to that though too is it's not just the bi tools though some of the larger some of the data sources that you commonly see with big data so if i think of hadoop if i if i'm thinking of redshift if i'm thinking of you know some nosql implementation a lot of those are more of that large single table structure where it's not broken out and then to put power bi on top of that to use a star schema of some sorts it's typically what i see is the skill sets aren't there to necessarily accomplish it yeah it's a it's a it's a it's a tough one it's a tough one to me what i find strange in this discussion is that you imagine that a large company has a team to create data marts data warehouses they know everything about kimbo dimensional modeling and it it always surprises me when when i see when i see this happens because uh basically you you find a company that probably has a data warehouse within the company but then they try to do analysis ignoring what they already have or the experience they had accumulated over the years is this something that you find in the companies too or sometimes so there's there's other aspects too that we see where depending on the ownership of the data so like some teams look i just own the infrastructure and we're letting the business own the data i actually am not i own power bi but i don't know anything about the data i let them deal with it i just manage the data source and i manage the infrastructure from an i.t side of it and so that makes it challenging when you try and go in and help them because the team that you're interacting with they're not the business and and or they're saying look the business doesn't even know how to do that necessarily so we can't make changes to the data and so that becomes it's it's a it's a struggle so marco when i remember when i joined the cat team i was so excited because i was like oh my job's going to be so easy now because i'm working with the most strategic customers and these guys are so smart i'm just going to you know it's i'm going to be talking about power bi can't not that part you know that not that probably i can't perform anything like that it's just feature things my very first engagement on the cat team i get on with them and i look at their data model but it wasn't a data model and i'm talking a big fortune i mean you know big company and you go how did this happen how did this happen you know so yes sometimes they abandon it but a lot of times we assume that these big companies have all the skills you know out there they a lot of them don't you know a lot of them don't and they don't follow the best practices or it's not in the right team yeah to get the job done yeah yeah i think that many people like the same problem i mean probably we didn't clarify but both adam and patrick work in microsoft they are microsoft employees whereas but alberto and i we worked with microsoft for many many years but we are not employees so our perspective outside of microsoft oh my gosh big company it's a big company they they always do everything right they always think about every detail and so on and 90 is true but sometimes they miss something right like anyone and big bigger companies can do mistakes the same way as smaller companies just yeah so um but just coming back to the initial questions when alberta and i think about big data models we think about more than a better one or two billion rows right one i think the you can put the threshold around one billion rows because you can make it slow even with a few millions rows but you really need to do something dirty to do that i saw someone make it really slow with eight million rows and that was impressive yeah i mean i agree i tried 125 million and i couldn't do it so kudos to them but but large companies usually have this data volume or they look for trouble when they are just at 10 or 100 million rows in your experience what was the question again alberto now the question is when uh when the companies large companies okay large companies yeah call you there is a problem right you you you're not we don't get the call where it's like man power bi is amazing thanks for everything you do those aren't the calls we get no it's always people that is that are angry because that something doesn't work i'm not angry but they're they're frustrated or they've it's it's usually performance i don't know some of them are angry but so and those companies have problems with with what we think uh is a big data model with one two ten more billion rows or they are getting in trouble when they just have uh 20 million rows one other oh yeah absolutely marco millions of rows they get in trouble with millions of rows because they're still they're not following any of the best you know design practices you look at their dax it's materializing hundreds of millions of rows because they're doing a cartesian product or something like that so a lot of times it's just small baby data that they're they get in trouble with so one example i'll give you is uh um and so when we say large customers that we're working with or we give customer examples just know if we were to say the company name you would absolutely know the brand i'm like these are these are big companies that are across the world and are very common um the uh one example was we got into something it was performance uh and they were pulling it we asked them like well okay how many how many rows are we working with here uh 11 million and i'm like what the and okay we're doing 100 direct query against sql and i'm like okay but still 11 million roses nothing like what are you doing and oh it's a view with like 40 joins underneath and i'm like backed by like 500 you know or like five billion rows behind the scene i'm like okay now we're talking i'm like why don't we just import that and like well no because you know we need to we need to make sure it's up to date well how how often do you need to update it uh twice a day i'm like why don't we import it so before we move to the directory point which is a very interesting one i have i just looking at the many questions there is one question that is uh related to what we were talking about because we talk about billions of rows but actually sometimes the problem is that we have just a few million rows in a dimension right and this is the question i i got from youtube channel where yeah we talk about billions of rows but actually we could have a big issue when we have just a few millions in a dimension right uh is this something that you also experience because for example here alberto and i have a lot of experience in customers medium companies that struggle with this problem yep so the the four billion row example i gave you i think the dimension table there was around uh i want to i want to throw out a 30 million row limit or 30 around 30 million rows we'll say um in that case i think the approach they were taking was not to import that table because the dimensions typically are going to have a lot of strings right and so it's going to be higher cardinality it's not going to compress as well it's going to take a lot more space and they actually came up with an approach i i want to i want to say this is true um i know we had a discussion with it well well so i i don't know if you're going to go with this patrick but they were trying to do an act they were doing ags on the dimension itself so they had a smaller set of dimensions at different grains to try and hit what they needed to do yep so because what they wanted to do they had you know this dimension table they had i don't remember how many rows in it so we have our fact table and it was imported believe it or not and then we had our dimension table which was direct query and i was able to feel has a term for the shadow something the shadow model yeah shadow shadow dimensions and so i was able to put these eggs at so let's say it was let's say it's product right like the adventure works you got product subcategory category things like that color and so the product was a grain that was a 40 million we put an ag at the subcategory the category and the color you know that was imported so you can dynamically use that in your slices and filters but then if they went down to the grain they would hit direct query and they wouldn't have to import that 50 million row table now mark marco alberto it's possible to get that 50 million row table in and people don't think about you can actually use incremental refresh on a dimension table with if you have some type of you know time intelligence or date or something on it and you just have some partitioning also right and then you can process it and then you can use the i forget the property on incremental refresh where i wrote a little pattern where it can intelligently refresh you know those partitions so there's a way the polling was yeah the polling expression expression yeah so going back to the question because i still see on the screen yeah i'm just wondering if uh if you have a dimension with 40 million rows and a table with 50 million rows what do you call it one dimension and the other one the fact table oh it's basically the same table even though this is totally against my latest article just join them in one fat table and that's it well okay what if what if we've got like a hundred string columns on the dimension table do we want all of that on but look at the numbers so i mean we're talking about close they're pretty close from the dimension though like i don't know what's duplicated i don't know what's not so i i would say it depends but it's something to look at it's something to look at i think if we look at a question the question is about how to fit in memory in memory both the dimensional factorable and i think the problem is what um adam said at the beginning we have many strings we have a lot of space that is required to store data in a not compressed way because unfortunately the dictionary which is the list of the unique values you have is not compressed once it is in memory whereas the data is compressed but the data can be compressed only if you have multiple times the same value in each column and here is where we have a number of techniques to import only the data that you actually need to do you know to browse the data and to keep maybe in direct query the data that you want to use to drill through and to analyze the data at a detailed level now the problem is that if you have actually imagine another scenario so let me remove this another scenario where we have maybe two billion rows and 20 million customers right could happen at this point you have a problem which is that the granularity of the customer is very high and slow down any query that grouped by a customer attribute which could have just you know the state or the country and so at that point the technique is to create another dimension with just a smaller number of uh rows which group customers by for example country or state or something like that and you improve the performance because you can instead of trying to aggregate the data is actually just a very good optimization for data imported in memory try to reduce the size of the dimension in terms of number of rows this creates a problem and we discussed that in the book in some videos and in some article but actually these are techniques that are i think necessary when you have at least two or maybe four minerals in a dimension depending on the hardware for imported data yeah so the other thing i'll add to that question that was out there um when we're talking about the the one thing i'm keying off of here and i saw some comments on the youtube side of the uh the the skew of as that they were trying because it was an s1 which is which is which is a baby skew um the if you want to get and we see this on the premium side too it's like look we want to get the most out of it that we can with the least amount of memory footprint and if that's the case the strategy at some point you're going to bump up against physics right you're just it's just not going to fit and if you want to avoid having to go to the next sku for cost reasons then you got to get creative and figure out how do i reduce um or increase the grain of the data so i can get that summary table in memory and then maybe kick it back to direct query for the for the actual details and understanding that's going to be slower if we hit it and you can do that with analysis services it's not the actual aggregation feature in um in power bi so you have to kind of customize it a little bit phil see mark alex's or yeah alex has dropped phil c mark's aggregation blog series in the chat and he's got some awesome approaches horizontal ads filtered ags those things are all things you can play with and do some switching from a dax perspective to to know where your boundaries are so there is a always because it is related to what we discussed yes when you import data yes uh removing the strings could be an idea so removing the string to integers but actually the this source the problem only for the the key column if you have other attributes the thing is that you have maybe maybe i actually have an example where that's not the case come on well so in terms of yes in terms of converting it to an integer the balance on that actually just had this where they created a composite key for for the aggregate table and that's where we got like the biggest gain that we had was because that one column it was a string column and the dictionary on that column was like larger than the actual data so yes converting that to an integer gets rid of the dictionary and massively reduces it in the in the case of the that example though i told them can you just get rid of that column altogether like do we even need it yeah so yeah i mean this could be a solution but you have to understand where you are spending uh memory using inverted pack analyzers in that studio or in other tools provides you exactly exactly the the the analysis or where you are spending more memory if you realize that you have many columns and this cost is spread across many columns this will solve the problem only for one of those columns which is the most important one probably but you still have to consider the others and so this is something you have to one so one thing i'll i'm gonna i'm gonna uh advertise your book here uh so this book if you don't have it at the back of the book around page 560 70 or so there's a whole chapter on how the vertipack engine works how processing works and there's other information in here about how the compression algorithms work not not in detail but enough for you to understand the concepts of what's going on hash value or hash encoding versus value encoding how that relates to processing size and how that relates to the data structures that you have i cannot stress how important that for the stuff that we do at the big data scale understanding those key concepts of how the engine actually works helps you with debugging and where do you get the benefit of how you can make this go further anyone doing big data needs to have that book if they don't have that book they're not doing big data bottom line and a star schema yeah sorry you're not doing big data yeah i i found another another comment we are still talking this is interesting because this this question is about what happens if i have many you know sentences in a text phrases in a dimension i actually have seen this in problem where imagine you have the list of book names on amazon or a website like that and you want to do a full search or movies you have movies you want to search with it's a problem because for several reasons first because the dimension in memory is very large and the functions in that that you use to do the text search are not that fast we don't have indexes we don't have a full text search that works well we don't have regular expression too but besides the real problem is the performance and i have seen some customers starting to explore the idea of keeping in direct query the dimension while they have the fact table in memory so it sounds crazy but there are a few border cases you know a few side cases where this could be an advantage i don't i don't think it's a good idea in general but for this specific case like you have to do this kind of search maybe that you could find some what what would be your thoughts marco in that case if we wanted if we were thinking oh dimension leave that in direct query what about uh and i know this depends on the uh the structure of the dimension but if there were interesting enough attributes where we could leave off that full text string off of the main dimension and have a snowflake to the dimension that has the full text string uh what what are your thoughts on that and leave that leave that one in direct query the biggest issue in this approach is how many if you're selecting books how many how many books do you get as a result of your selection because if you get 10 100 books or videos or movies whatever is fine the problem is when you get back a list of one million rows because at that point even though you have just integers you have to move a few megabytes of data from one storage to another and then you have to use this list uncompressed you apply a filter over compressed data that you it's slow it takes seconds and seconds if you're lucky otherwise more so for this reason i i would say it depends right if if you have if you know that you just want to get this a small number out of the dimension it could be an idea but at that point well the other thing too the other thing with that too another reason where that may be beneficial is what are you actually reporting on and if the main piece of what you're reporting is at the summary level you don't even need it right and it's like only if you go into this detail area where maybe you have that you know 50 to 100 results then yeah you're gonna hit that you know it'll be slower yeah but yeah i mean it depends right yes so and and so also some insights and i'm sure margot you and alberto do this as well but like when we get into customer situations like patrick and i know alex and i have done this when other folks uh casper is another one where we do this we just sit there and we have this discussion like and just brainstorming like okay what are some based on what we know what what are some things we can think of that we can try and a lot of it is let's try it let's compare the results and see what wins that we get oh that didn't pan out let's go try something else yeah yeah uh okay so moving otherwise we talk only about larry i know i could go for hours on this but yes uh but because we touched the point right direct query we we just introduced the topic just because we started talking about oh we could have the data outside but going back to the main title of today session when and how often direct query is the problem so how often you see that people are struggling with problems performance issues with direct query and what is the common solution to this beside the data all the time all the time i mean it's not all the time when they get to us if they're having you know when by the time they reach us in this direct query it's they're having problems you know and so every time i talk to someone regardless of the the backing the storage engine right it could be sql it could be snowflake it could be bigquery it could be redshift it doesn't matter by the time it gets to us they're having some type of problem with direct query and the funny thing is you'll go they go well power bi is slow i go it's not power vi guys it's not power bi they go oh it's absolutely power bi and i go open up performance analyzer look at the query let's get this sql run it oh it took 45 seconds to return that query i told you it wasn't power bi you know and so it's always whenever they get to us it's direct query it's slow i will i will caveat that with auto-generated t-sql is always the best tcp absolutely it's the best it's the only t-sql you could use auto-generated so yeah but but a lot of times though too is that it comes down to the the data structure right so even though we're talking about you know modeling and power bi that applies to your data warehouse as well because if it's structured in some weird way that you because you're complicating the data structure that's going to end up in complicated dax which is going to res in a direct query situation could result in weird complicated joins that have to happen on that end which slows everything down so how do you fix it marco a composite mod oh go ahead go ahead go ahead go ahead no the the the next question the natural question at that point was how often people use direct query even though they should not so they have no reason at all to use that query they could live in import mode but they use for whatever reason so this i have this going right now i have one customer and it's i think i do think it's valid i already have a repository of my data right i'm spending a ton of money and a ton of time loading this data warehouse that everybody's saying i need so now you're telling me i need to take that data and move it again into this model when i have this direct query feature that i should be able to query i get it you know and that's something i'm dealing with right now because i think it's a valid a valid point that they're making why do i need to keep my data in two places that's not fair right that's not fair because i already have it over here you know um but but right the the other scenario is a lot of times they do direct query and then i go in and say how often did you load your data warehouse they go once a day but we need the freshest data patrick okay but it only loads once a day so why are you doing direct query you know i think so let me let me introduce a question i have seen which is a 10 million records which query i need to use dark query or import so i start it depends but it depends the general idea is that direct query so probably it has the wrong name let's call it um no cash okay if you see no cash oh no no no no stay away i want cash i want cash yeah i want cash right i won the cash that kind of cash or no different different cash i want because in that case i do want the cache yes i know sorry missing in translations but when i when i have the data in import mode actually the import mode can be seen yes it's a copy of the data but it's also a sort of cache of the data that i have in memory yeah so if i call one cache mode and the other no cache mode everybody would say oh no no i want a cache mode right yeah yeah and so that's the point so yeah it's not a size yes there are two cases i think where direct query could be easily chosen one is that i have so much data i cannot import in memory right we talk about 100 billion rows if you have 100 billion rows and you need that kind of granularity no matter what okay you have to stay in that equity but if you have 1000 rows probably you can use direct query too because it's so there is no difference right yeah it doesn't matter what about 10 million because that was the question that's the point 10 million rows usually i imported data no brain i mean everything that is in the middle i try to import and if i if i have a problem with import then i start to say okay let's see if that equity could be an option but to me i never i never start assuming that direct query is an option is that equity is the last result right yeah the the the thing i'm seeing with with a bunch of enterprise customers though too is there's a lot of scrutiny on uh compliance and regulation uh especially in certain customer segments so if we think about financial sector uh government things of those natures where they're like look we need to either one we're cloud-averse and so we've got to keep the data on-prem and so we don't want to import data or two a lot of times there also a lot of customers are spending a lot of money to host this large amount of data so whether we're talking about synapse snowflake sap hana any of these infrastructures are very costly and they want to keep costs down and then the flip side of that also is i want to use the thing i'm paying for yeah and so i want to leave it in direct query i've heard that a bunch of times as well so there's there's different reasons why customers are coming in saying i need to use direct query and then it's an education standpoint on our end to understand that look yes that's fine but here are the pros and cons and here's the things that you need to do to get your data even remotely ready for direct query because right now it's not so marco this this that go ahead i always try to do simple examples and i think direct query import mode is uh or data at my age it's like glasses i mean i have one pair of glasses but that's not enough i need two players one for near and were for far that's it you have your you have your data your data warehouse that works fine that's fine but you also need another set of data which is the glass is for near and you use it for a different purpose it wouldn't work if i use the wrong glasses i just cannot see it that's it that's great that's great but i think whenever people start talking about direct query it's a negotiation especially if the data is being frequently updated you know it's not we're not talking about data warehouse data all right you know if i have 10 million rows it's in some repository let's go with the cash there's no discussion there's no negotiation there's no discussion but when we're talking about you know highly data that updates really frequently you ha you know i don't just give in to the end user and say we're just going to direct query now maybe we can come up with some negotiation that says okay i can't give you real time because nobody really not there's very few scenarios where we require true real time it's a negotiation say you know what maybe i can do some refreshes three times a day four times a day five times a day so i can get it cashed up you know so i don't know i i think as as data modelers we kind of give in to our report consumers or whoever is you know the the end users too quickly and say okay you can have direct query and then we we're up at night trying to figure out how do i make this faster how do i make this faster when you just kind of you know get a spine and say look you can't have direct query we're going to come up with something better so anyway i i put this question because it was related it clearly data query performance depends on the data source and the best uh performance you have on the remote data source the better and of course for the kind of queries we do which are mainly aggregating of large amount of data having the column store index on secret server or another columnar storage for the relational database can help even though please don't expect date to be faster than importing data so the the if you need a reference in terms of performance import data anything else slower this is the right expectation right this is not uh it's assuming that oh i go to whatever the name because it is much faster no it will be slower because at a certain point you have to move the data between two different servers and depending on the queries the data is not complete you don't get the result of the query you get a partial result that has to be you know merged by the formal engine in the tabular engine and this takes time and overall you slow down the performance so it's a good idea of course this is a good idea but again remember the the right expectation i don't know if you have something to add on this oh you're right yeah you're right yeah you're right i haven't you're right but then i just want to follow up on um patrick i think a lot of people use direct query with very large table because they want real time and they also have a multi-billion tables yeah but this goes back to the question if you have billions of rows so you have like 10 years of data now why on earth do you need real time because real time is for comparing yesterday week to day or a week ago not comparing what happened at 9 00 am 10 years ago with what is happening right now so either you have a lot of data and then you go for import because you act pre-aggregates or you have a very small data set that needs to be updated very frequently so you need to make a choice either you have a lot of data or you have real time there's no point in trying to accomplish both at the same time it's it's going to be a failure yeah agree agree and so some of that comes back to understanding what is it you're reporting on and also if there's separation between the folks building it and the people using it go talk to your people please and find out what they're reporting on yeah instead of just designing for that worst case scenario which is like i'm like okay well in that oh yeah don't don't do that talk to your people talk to you that that's uh you know to continue the conversation about this stuff this is an interesting question because again so what's your i mean i already have an idea but this time i want to hear your your feedback first so if you have a data warehouse that updates every five minutes will you use a direct query or import for this type of scenario yes it would be a composite model maybe maybe so like this is where i start looking at like phil's blog on filter tags yeah right so like if i've got a small window like maybe the last you know hour or 24 hours leave that in dq and then the rest of it i can import or cache you know and partition it in certain ways and play around and get creative with it right so are there ways that we can accomplish some things of that nature that that satisfy the business requirement but still give us flexibility on how we're working with the data but but before we go to that level of complexity yes which is complex he just said this go talk to your people just because i'm loading data every five minutes that doesn't necessarily mean they need to read data every five minutes you know what i mean and so i get it right i built a data warehouse like this back in a long time ago almost 20 years ago now uh that was capturing data every minute from these web servers off of billboards and we went and talked to the consumers and they said oh no we're just going to look at this like every hour every two hours just because you're writing it every five minutes that doesn't mean we need it every five minutes so yes you can go do these complex filtered ads and all this kind of stuff but the first thing you need to do before you get into those complexities is get out of your seat i.t people and go talk to the people that's going to consume these reports because maybe they don't need it to refresh that fast so that's nice i would add one which is i would add one i'll hear that every day i would add one thing which is if you have a data warehouse that updates data every five minutes i'm pretty sure there is a lot of code behind it right there is some developer doing something database developer now spend a little bit of the time of these developers and using analysis services or xml endpoint in power bi use incremental refresh if you have the new data in five minutes you can just add the rows to an existing partition it will take a fraction of the time for the full refresh and you have pretty much the same uh updates that you have in the data warehouse and you will have the performance of importing mode so the only price to pay is a latency of maybe two or three minutes the time to transfer the data to add the rows and to complete the refresh but if you optimize that part and of course to there is no wizard to do that that's the problem you have to take the books and look at some example of how to write in maybe powershell or because you can do the powershell script doing that yes and so without having to code everything in c sharp or other languages you could write in powershell a script that you can schedule on an azure function and so with a minimum amount of code so it is not no code but is really a minimal amount of code if you understand how this architecture works there is a way to do that already and it i mean it worked for many many years but now in power bi because we have the xml endpoint we can use the same feature we use the for analysis services for many many years so so one thing uh it's it just brought something to mind that i've been i've been noticing as well as there are um people it's almost like there's two different sets of folks folks that have been working with analysis services for a long time and now we've got power bi is is big enough to where people are used to working with power bi desktop they're work they're used to working with the gui and they're like okay well i'm working with this big data how do i actually get like incremental refresh isn't working for me how do i how do i get this approach on this through power bi desktop and i'm like look at this scale like if those if the gui dialogues and stuff aren't working for you you are entering into a prodev model at this point and you've got to use those tools to get the complexity that you need to work with that scale of data if you can do it in in power bi desktop and just using those more power to you that's great but at some level you're gonna have to use a tool like tabular editor you're gonna have to use tools like dax studio to get to that level and have to write some scripting and possibly integration with etl processes things of that nature uh and i've heard people trying to avoid that specifically and i'm like oh you gotta roll your sleeves up man you gotta roll your sleeves up yeah yep yep so just to cover you know also other topics uh i think that sometimes the problem is okay we talked about the different storage we have direct query and uh vertibug import mode we talked about the dimensions and the data modeling in general can we add something more keeping the data model and then we will move out to the reporting area but staying on the data model for a few other minutes what about problems in relationships so how how often do you see that the problem is that it's always related to the data but when the relationship uh an improper use of the relationships is causing a big issue in big data models in your experience we just went through this with one customer go ahead patrick it's not it's not as common as you know direct query poor data modeling it's not as common as those things but we see those yeah we see those we see scenarios like that where it's ladled with bi-directional you know relationships and then there's ambiguity and things like that we see tons of that we see i've seen one where they had this relationship and there were so many violations of the keys they just didn't exist you know so there but it's not as common though i i don't know go ahead alex i mean it depends uh they uh again this comes down to people when you try to over complicate your model you end up in wacky situations especially with relationships where you start introducing bi-directional you maybe have some approaches where you're like oh well i need an inactive relationship to avoid some circular reference i'm like okay well you the way you did that we can just avoid that all together with a direct relationship to get your other thing to work um so yeah it's just to me it's the relationship issues usually crop up when the model is over complicated because of the fact they're not either they're working with a data warehouse that they don't have direct control over and they're compensating or they just didn't really spend time to really model their data yeah and sometimes they just don't know you know because i saw one model they had eight fact they had one one you know the fact table but they had eight date tables and they had a date table for each of the dates on the fact table i was like what are you doing why do you have that you know when they could just have one day table and do some role playing with inactive relationships so yeah they just i think so a lot of times they don't know they just don't know so i i have an interesting question because it's related to the to the relationship topic which is uh we i'm importing two tables dimension and fact and uh refreshing integrity violation exists between them which means for those that don't know there is some value in the fact table that references a value in the dimension that doesn't exist and this generates what is called a referential integrity violation which is not the real error is this is a normal condition of a tabular database but it's something that could be more expensive for for the queries so the question is the full table is being scanned for every query how can fixing the violation optimize the internal query fix the violations fix the relationships fix the violations fix it i mean why would you have values in your fact table that don't exist in the dimension i don't i fix it i don't i don't know what else to do with that but you got to go fix the violations yeah but alberto to me it seems strange that we have this kind of i mean it doesn't change why this should change in an important table i i understand for a you know a mixed model or a composite model in an import table i don't i don't think it's a is an issue by itself yeah i mean it's to me it's not clear what it means that the dimension is being scanned for every query because actually you cannot check what the engine did yeah at that level probably there is there is a scan of the table but uh so so i would also go back there too and say like well you're seeing it's being scanned like if you could see that i'm like are you seeing a performance problem like because if you're not seeing a performance problem why do we care yeah usually i mean usually there are again there are side conditions where this could be but but not in a simple query as the one that we are discussing but i think mark in general just a few seconds ago you say that a reference integrity violation is not a problem it's just a regular state for a kepler model i would not say that at all so to me a reference integrity violation is a huge problem so if you have it that's a serious flow in your data model because you have data that cannot be distinguished in any way unless you you go and look at the different values for the keys so if you have a single reference integrity violation your model is flow there is a problem somewhere and you need to fix that then might be nice enough to work with that and show some number but uh but referential integrity should be maintained at all costs and that should be considered by everybody as as a big problem yep i bet you there i bet you some type of big data system that doesn't enforce referential integrity doesn't allow foreign key constraints on it and somebody's loading it and they're not you know there's no check there's no checks to validate the relationships in the data source yeah just based on what you said a few seconds ago patrick i would just try to plot if there is a correlation between the word problem power bi and big data in the same sentence looks like just before before moving to the to the report inside of the of the problem with big data models i have another question that is a little bit in you know it look at the problem from another perspective because uh the first question to patrick and adam is how often these days you have large models imported with this kind of problems to columns that have a large amount of data because uh in this case we are talking about a column that has a very large cardinality for a double or floating point value and what could what could be the the the optimization now the specific optimization mentioned mentioned here could be not a good idea for it it depends as usual but it's dangerous it's very dangerous dangerous so but before we before we answer to the question and specifically providing some some direction what is your experience in general about this kind of problems lots this is a common thing this is a common thing well so i would say in i mean it's common in the sense that we do see models with that um i would say that the answers to that questions are varied yeah so in terms of what's done about it the first thing and so i don't know if you guys have noticed the trend but let's go talk to the people that's consuming the report and see how many decimal places they really need yeah you know do i really need to keep this as a double that's the very first thing i'll say right because the more decimal places it's likely the higher cardinality the larger the data set and so the first thing i'll say is hey let's go talk to them do they need eight decimal places maybe they only need four you know we can reduce the cardinality i try to be as less destructive as i possibly can splitting splitting that value into two then that is a dangerous thing and if i need to do that then i'll get down to that path but the the path of least resistance is the path that i try to take most of the time and that's just talking to them and say hey do you need this and nine times out of ten you know what people come back and say no actually at that level of uh just i'm in in the swings right now with a customer where uh we saw the fact table had maybe 20 30 columns on it uh this one table and it was an ag table trying to get it into into memory it's around 60 gigs and one of the things we had were they had doubles on everything uh including quantity columns uh and i'm like do we need double do you need do you need the what what are the decimals can we change it to currency instead of instead of a double can we get a fixed decimal can we get and they even said like doing that they said oh my gosh that actually helped a lot in their model and they found like no we talked to the business and they said they only need the two decimals so we're good we we went from a decimal to a whole number on one and i'm not kidding you the model went from six gigs to one gig from six to one now it wasn't just one column it was several columns that we did that for but changing the precision and people don't think about that people just don't they just assume we need that level of precision sometimes changing the precision can help drastically reduce it going to the split going to the split is the last thing i think about you guys have an article that's what back in 2000 yeah and i wanted to clarify because uh those articles where we describe the split as an option uh is because we want to explain the concept behind but to be honest they are very effective in small data models if you have for example you have one two million of unique values in a table with five ten million rows it's effective splitting the value in two values is effective but is it it should be an optimization that you do on small models especially in power bi desktop because you want to squeeze the memory the smaller you want and you already made the other optimization which is reducing the precision however when you have one bit of rows the problem is that if you create two columns instead of one sum in the total of the two columns is more expensive usually it's more expensive because at that point you will lose the saving that you have in the dictionary because you are increasing the size of the data size the compressor data and the compressor data is what matters when you run the query not the dictionary size so at the end in a very large data models splitting the column is not really a good idea the priority in every model is trying to reduce the precision the number of nic values you have for small models you could split the column to further reduce the memory which could be an advantage especially when you use power bi pro and parabed so because you you have a limit and the size of the model that you can send to the service okay so now changing the topic because we're going uh close to the hour but reports right reports uh what is the what could go wrong there you have this big data model we optimized everything what is the common issue that you see when you know in the last mine you're right you want me to start there patrick yeah go ahead go forward so the common thing i see is customers just they it's 100 direct query and we throw like 20 visuals on the canvas and then wonder what the heck's going on um and it's it's because of the fact that when we're in that direct query model what people don't understand and usually is a revelation because they think about concurrency and they think oh well we've got a hundred users and so that's going to be a hundred queries on the back end if we're if we're doing concurrent uh that's not actually what happens so each visual is going to introduce one or more queries to your data source depending on the complexity of what needs to happen that visual could issue multiple queries because of how data movement needs to happen and what the engine needs to do to compute calculations and as a result of that you just introduced a ton of queries to your backend data source and you multiply that by the number of users that's your query patterns and that data source needs to respond and usually sub second uh so it's it's painful um and there are techniques to avoid that but now we've also seen models where you know they've got 20 30 visuals on the canvas and you look at performance analyzer it's imported and the dax is returning sub second it's all fast uh and you know the the glorious category that is consuming time is other right and so then we you know like what the heck is other and so you need to start thinking about how many visuals we're using can we get uh can we be creative on how we're actually doing that and then you have to start tinkering in terms of what you need to do so to add to that the thing one thing that i think when it comes to power bi is great right it's ad hoc you can drag and drop but a lot of a common thing that we see is people drop a table or a matrix and just start adding everything to that table and matrix and they go but i can't export it all right while power bi is great for ad hoc analysis and great visualization tool you got to pick the right tool for the job especially when it comes to big data and chris finland is going to love me for saying this he's going to love me for what i'm about to say but sometimes you should not use a paginate i mean a power bi report you should absolutely you absolutely use a paginated report especially if you're trying to see you know and i have no idea why people want to see millions and millions of rows of data but if they're trying to see tens of thousands of rows of data don't use a matrix don't use a table it'll work i mean it can work but think about the right tool for the job and sometimes paginated reports is the tool when you're looking at lots and lots of data so and patrick you were this was i think yesterday you were showing me a vpax file i'm like let me see the v-packs and we looked at and you're like oh that's going and i looked at the results and i'm like dude your dax is fine it's it's rendering your rendering took like you know it was like three seconds on a render of one visual and i'm like what's that visual yeah i don't know we're gonna go figure that out it's probably gonna i guarantee it's probably a wacky matrix visual yeah i'm just gonna say it or a custom visual it could be a custom video some custom visual that's causing a problem yeah i have a now it's not really i understand that for the report the number of visuals is the more important thing but how often because when you talk about the visual you talk about the number of queries that run over the server and when you have a large customer this could create a scalability problem right that the number of concurrent users could be a problem so i think in some ways related to this because if you import the data what happens is that you have a finite amount of resources on the server right so you have a slow dax query creates a problem for one user but it creates a large number of problems you have many users so the query that optimizing the tax query is so the question is about should i pre-aggregate the data if you already fixed everything else yes but believe me you need to fix everything else before uh needing pre-aggregated data in input mode because if you are using direct query yes you you import the data sorry you create pre-aggregated data to avoid going on uh on the data equity side because you have data imported i mean an aggregation is important data and so the idea is that you should optimize your dax query before everything else just optimizing the tax code right because this is the the way you reduce the consumption the cpu consumption on the server and you obtain two advantages first a single user run a report a single report faster two when you have 100 users at the same time over the same report it scales well because uh you uh you you reduce the amount of cpu used by for each query that can run at the same time on the server so is this something that you have seen too the scalability issue for large customers yeah yeah yeah patrick you had one specifically that was that was very interesting with the one yesterday yeah that one too the one yesterday we were talking about the dax was fun honestly we couldn't and i was thinking about this a little more adam i think they gave me cash results i think they gave me cash results on the perf data i think it was cash but marco yes you're absolutely correct we see this probably once a week uh where people they're just not maybe maybe we just they they're not thinking about it all the way through right they throw the dax out and they go oh it's just gonna work and then you look at the spikes you know when a lot of people start hitting the model when there's a high concurrent concurrency a number of people hitting it and everything is bad then it just calms down when a few people are hit and they go oh everything's fine now but it's just poorly designed i see it a lot i would say there was one time patrick that we had this where it wasn't that i i think it was there was a legitimate like it was something on the back end that was oh oh that's that was your talk only that's the only time we've ever seen that but not to say that it couldn't be like some actual infrastructure type issue like not related to your model necessarily yes but that is extremely rare most people tend to think that's the problem you know because it's not their ugly baby it's you know it's it's the service but i i've only i've only personally seen that one time yeah that one ever yeah not to say it doesn't happen with other folks but like for us personally one time but from a dax perspective though i think we've seen it several times i do yeah yeah from the dax yeah oh yeah yeah it's just bad attacks and that's the ugly baby yeah [Laughter] so i think we pretty much i i mean we pretty much covered the the many many uh ideas many areas we wanted to talk about uh probably we missed that i have seen hundreds of questions yes in the chat and i think that it would be very interesting so for those of you that didn't see the the question uh answered during the session uh we will take a look at the questions and this could be i did these these will be ideas for future articles or other webcasts or other videos so certainly if there is something interesting that we see we will take this as a good idea uh is there something you want we are at the hour so we probably have to close this session but is there something you you want to say to our you're at our channel so i i would say if we were smarter people um and by by we i mean me and patrick um i i would have we would have followed this up with having you actually on the live stream tomorrow morning on our channel um that would have been like a one-two punch of like look you got more questions come over here we're just gonna tag team we we did talk about that at a larger scale of doing some sort types of crossover but it uh if we were smarter we would have actually planned that to the land but we did not so yeah no but actually uh patrick and that i mean we at sql bi we do these live events uh not on a regular schedule whereas uh adam and patrick every saturday more or less they have a live chat uh live q a that they run on the saturday morning saturday morning center time but it could be afternoon for europe uh and night for other people but we see people from all around the world every time i attend the chat session every time i can because arco's are regular yeah but it's too much of fun i mean it's very i enjoy the you know i will actually extend if you guys are available tomorrow morning we can pull you in like that works for us it's up to you though we didn't necessarily advertise it or anything else but uh you're you're always welcome i i would be i would be there i would be there so okay i will be there too yeah i'll send you an email that's fine let's do it tomorrow morning on the guy in the cube channel all four of us again all four of us again tomorrow and we will follow up the questions that we had today yes yes okay so it has been a pleasure uh yes thank you very very much thank you guys thank you everybody thank you thank you everybody and see you soon bye-bye tomorrow morning [Laughter]
Info
Channel: SQLBI
Views: 46,569
Rating: undefined out of 5
Keywords:
Id: L3uT-cn_eO8
Channel Id: undefined
Length: 73min 22sec (4402 seconds)
Published: Fri Apr 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.