Comparing DAX calculated columns with Power Query computed columns

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Ciao friends. In this video, I want to answer one of the most frequently asked questions about Power BI and Tabular in general. It goes like this. If I need to create a new column in my model, and I have the option of doing that in DAX or in Power Query, which one is better? You did ask it yourself, didn't you? Now, the thing is answering this question can be rather complex or extremely simple. It all depends on what you mean by better. Better means smaller, a smaller model is better or better means faster or better means simple, more simple to create. There are a lot of possible options and answering all of them takes time and a lot of reasonings. That's why I divided the video into two sections. In the first one, I just give you my opinion and my answer. And if you trust my judgment, then you just go for that. If you want to know more, then you will need to follow me up to the end of the video, where I will give you an explanation on why I provided the first answer. Let's start with the answer. The answer is it depends, but if your model is not really huge, it doesn't really matter. Go for what you like the most. That's the end of section one. You're still there? I know you, you want to know all the details and you want to understand how I ended up with this conclusion. If that is the case, follow me and let's start analyzing why this is the answer. We start with the usual Contoso data model. We have sales, promotion, a lot of different dimensions, but specifically for this model, we are only interested in looking at the sales table. I loaded the entire Contoso model. So it's 12 million rows. And the reason is if we don't have enough rows, it is impossible to measure anything. The first thing that we will need to do is take some base measures, because we want to measure the space used by columns created with DAX or with Power Query. So we need to know at the beginning, what is the size of the model. And then we will do the two steps. We create the column in DAX. We create the columns in Power Query, and we do some reasoning in the meantime. Let's start by measuring the size of the database as it is now. To do that, we just need to open DAX Studio. And from there, we take a look at the metrics and we save them just in case. So I just connected DAX Studio, take a look at the metrics. And now I have all the information about my data model. You see that the only table, which is relevant is sales that contains around 12 million rows. I'm not interested in doing any kind of analysis so far. I just want to take the metrics and save them for later. So next step I export it and I give it a meaningful name, like Contoso 12 million base. Okay, let's save it. And we have our numbers. Let's go back to Power BI Desktop. As I said, the first step is creating the column using DAX. I'm gonna create a new column that contains a small number of distinct values, and it's the price range. The price range basically splits the sales depending on the net price. So let's call it price range. If the price is less than 100, we say it's low. If it is less than a 1,000 it's medium. If it is higher than a 1,000 it's high. So we use a SWITCH over TRUE and we check if the net price, sales net price, sales, we don't have sales net price. I am probably creating a measure, yes. So let's start again, I wanted to create a new column. It was price range that does a SWITCH over TRUE. And if the net price, now I see that, is less or equal than a 100 then we said it's low. Otherwise if sales net price is less or equal than a 1,000 then it's medium. Otherwise it's high. This column only contains three different values. So it should be a small column in the model. Just a second to compute it, now I have my price range column and it shows three values, high, low, and medium. Now that I have the column in the model, I can go back to DAX Studio and look at the metrics now. So let's go back to DAX Studio. We still have it here. And I can look at the metrics here, but I'd rather create a new window so that I have the option of looking at the metrics in both. So we look at the metrics right now. And as I always do, we save it just in case. This is Contoso 12 million calc col in DAX. Now the sales table still contains 12 million rows, but now I want to dig into it a bit deeper, because I'm searching for the new column that I created. Here it is, price range. Price range uses around 2.6 megabyte of space. So it's a column, 12 million values. 12 million rows. Three distinct values. It's around 2.6 megabytes in size. Creating it was rather simple and didn't take a lot of time to do that. But now the next step is, let's do the same using this time Power Query. So first we delete this column, because we want to go back to the original size of the model. And then we do that in Power Query. So we go to home, transform data, go to sales, and we create a new column, custom column. Okay, now we need to write M code. It's still unit, no it's still price range. If ... the net price is less or equal than 100, then low. Else, if the net price is less or equal than a 1,000 then it's medium. Otherwise it's high. Okay the code is not different than what I did in DAX and we just click on okay. And finally we close and apply. Now, here comes the first difference. When I created the column using DAX, it was quite fast, because it created the column and it was immediately available. DAX is super fast when it comes to creating columns. Whereas with Power Query, I now need to wait until the full table is loaded again. I could have cut this section from the video, but I decided to keep it here. The reason is that I want you to experience the time that is needed to load 12 million rows. It's not fast, and it makes the whole process kind of annoying, because you need to wait a lot of time every time you do a small change. We are around 9 million. We have three million still to go. And if for whatever reason, I need to update the column, I need to run again through the entire process. So from the development point of view, it is kind of better to use DAX, because the entire development process is much faster. Now it finished. Keep in mind that it's not that easy. It's not that DAX is always the best option, because if you really have a huge table, then it is very likely that you have partitioned your table. Now, if a table is partitioned, you probably always load, especially in production, only the last partition, which is rather small compared to the entire table. A calculated column will force the recalculation of the calculated column for the entire huge table, not only the last partition. Whereas a column computing in Power Query or in SQL will be loaded from the server and does not trigger any recalculation. So for a large model, the choice really depends on different aspects, not only on size; for small models, the difference is not really relevant because it doesn't take a lot of time to load from a small table or to recompute a column for a small table. Now let's go back to the size. We created the column, we need to measure size. The thing is, if I measure the size of this model, right now, I will get wrong values, because immediately after processing, the model looks larger than it actually is. There is a good reason for that, it's rather technical, but you cannot measure the size of a model immediately after a process. But what I need to do is save the model and load it again in order to take a good measure. So I save it, and then I close it, and then I will open it again. Hopefully, it's not gonna take too much. So let's open it again. Now the model is being loaded. Memory is being allocated and now I will be able to take the correct measures about the size. Just a few seconds, come on Power BI, chop chop. We're just waiting for you. Here we go. Now the model is ready. I already have DAX Studio open. I just need to go there and create a new connection. I need to connect again, because I closed it earlier. So I now need a new connection, go to Advanced, View Metrics and expand it. Now we can start to make some comparison. We have sales. We search for the price range, and here it is, the size of the price range column. It's 470 megabytes, kilobytes. So it's much smaller than the previous column. Let's take a look at query two. We have 2.6 megabytes, and here we have 470 kilobytes. And this is expected. A column created in Power Query is always smaller than the same column created in DAX. Again, the reason is rather technical. When you create a column in DAX, it does not participate in the choice of the optimal sorting in order to increase the compression. This column has a very small number of distinct values. So it's likely that it will be very early in the list of columns that need to be sorted. And this is the reason why, if you load it from Power Query, it will be much smaller. But how much smaller? That is relevant. We are talking about 12 million rows and the difference in size between this column for 170, and this other column is around two megabytes. Now, when I was a teenager, two megabytes was likely the size of a good hard disk. But as of today, two megabytes are just a tiny number. There is actually no real difference between half a megabyte and two and half megabytes from the point of view of the speed of querying this column. So I do not expect any difference at all in the speed of query of this column. Beside there is another important factor that you need to take into account. We've looked at the price range column. Let's look at the other columns. For example, the order number columns. In the new model the last one that I created, where the column has been created using Power Query, it's 82 megabytes, the size of the order number. In the model that was created earlier, so the one with the calculated column is 80 megabytes. So the two megabytes that we waste on the calculated column, we are now using the two megabytes that we saved on the calculated column, we basically moved them on another column. Again, the reason for this is, again, technical. Whenever you add columns to your table, the sort order of the table being loaded is likely to change. And again, because this column has a very small number of distinct values, it comes very early in the sorting. So it changes the sort order of all the other columns, this way, changing the size of all the other columns. Therefore you cannot just look at one column and conclude that that column is smaller, therefore, the model is better. You need to look at all the columns and reason on top of them in order to make sure that the columns that you use the most are the mostly optimized. So don't fall into the trap of thinking that optimizing one column is always good. The entire model needs to be optimized. Moreover, we are not yet done. Let's look at the total size. The total size of the model without calculated column is 173 megabytes. With the calculated column, it's 175. So two megabytes more, the calculated column, we knew that. And the model with the column computed in Power Query is again, two megabytes more. Again in consideration, did we really save space? It's hard to tell whether the calculated column using Power Query is better or it's worse. The overall size of the model is nearly the same. The individual column is smaller, but the space that we saved on that column has been moved into all the other columns. So we made some other columns worse. It's not enough to reason on top of one column. When you need to optimize a model, you need to look at the entire model and check that you are actually obtaining your goal. Beside it's very important that you focus on these two megabytes. It's neglectable, it's not important at all. There are no differences. It makes a difference if instead of having 12 million rows, you have 10 billion rows. There, the difference is important. But if that is the case, you will not just look at one column. You will try to super optimize everything. Finally, it is very important that you focus on the fact that, if you have a very large model, optimizing the place where you load the column is not the most important thing to do. You need to focus on cardinality. You need to focus on other aspects, data types, cardinality, number of distinct values, size of relationships, a lot of other aspects, which are much more relevant. That is why we go to the conclusion of all this. Is it better to create the column in Power Query or in DAX? It actually does not really matter. First, it depends. It depends on what you mean by better. The overall size does not change by a lot. Yes, the single column size might change, but overall the model does not change in size by a lot, at least in this specific scenario. You might find different results looking at your model, but you need to run through the entire process. The reasoning that you do in choosing whether to compute something in DAX or in Power Query is more about partitioning, the time that you need to refresh your model, the option of doing that in Power Query or in SQL or in DAX and the kind of calculated columns. There are calculated columns that are super easy to compute in DAX and extremely complex to compute in Power Query or in SQL. Then for that column, you go for DAX, but you need to take into account that there are drawbacks. So the answer is really, it depends not only on where you compute the column, but on the overall effect that this has on your entire model. And it does not really matter. The net result at the end, that does not depend on where you compute that column. It depends on other factors that you need to take into account. Enjoy DAX.
Info
Channel: SQLBI
Views: 19,419
Rating: undefined out of 5
Keywords: DAX, PowerBI, Power BI, CONT, C0099, Alberto Ferrari
Id: -x2dLTtKiR8
Channel Id: undefined
Length: 18min 4sec (1084 seconds)
Published: Thu Jan 28 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.