Formula engine and storage engine in DAX

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Ciao, friends. This video explains the differences between the formula engine and the storage engine, which are part of the DAX engine included in Power BI and Analysis Services. Every time you have a Power BI report, your DAX queries are executed by these engines and understanding the differences and how they collaborate is important to understand how to read the performance metrics and to optimize your DAX expressions. We will start with a quick analysis of how the engines work together with some practical example and then we will see demos using Power BI and DAX Studio. So first of all, let's see the architecture. When we have a DAX query generated by the report, by the visual in your report, the query is received by the formula engine. The formula engine is the first stage of the execution, which generates the query plan, organizes how to execute the query and retrieve the data from the storage where the data is stored. The data can be stored in memory, VertiPaq, or in DirectQuery, for example, in SQL Server. The storage engine is a separate engine that provides the engine to the storage engine to execute the query. And during the remaining part of the video, we will understand this better but, before starting, let's take a look at a practical example that shows you how the two engines collaborate together. Let's say I want to know the maximum number of CALCULATE words I have in a single page of this book. How can I do that? I can read the book cover to cover. In each page, I look for the presence of the CALCULATE word and I count how many CALCULATE words I have in each page. In order to help myself, I can use here a small sheet where I write down for each page, the number of CALCULATE words. So let's start. So I go to a page here, I open a page and I have to repeat this for every page and I scan the content. One, two, three, four, five, six, seven, eight. So this page, 153, 153, has eight CALCULATE words. And I continue to do the scan in every page of the book. So for example, here, page 155 has one, two, three, and four. So we have four words here. And I continue, I just keep the page on this side because it is not visible in the camera but I repeat this over and over for all the pages. One, two, three, four, five. Here we have five and we are at page 157. Now, once I complete my list, imagine I continue to do this for all the pages of the book, I no longer need the book at this point, I can remove the book from here and I complete the job, finding the maximum value here in this column. So this is an intermediate step in order to complete the calculation that I write here, writing the final result of the calculation, which in my case is the maximum value of the number of CALCULATE words in a single page, which is eight. So let's recap what I did. I started from scanning the book. And for each page of this book, I wrote down an intermediate step of the calculation here. Once I completed this intermediate step, I had to do a second step in my calculation, which was scanning the result of the previous step and finding the maximum number. Now, this is a representation of what the DAX engine does when we have a query. The initial request is, "give me the maximum value, "sorry, the maximum number of pages for, "the maximum number of CALCULATE words "in a single page of the book." And this produced this result. This is what the formula engine does. The yellow sheet here represents the formula engine. The formula engine can do the calculation but has to access data from the physical world, from the place where we have the data. But the formula engine doesn't have a direct access. The formula engine has to request the data to a second layer, which is the storage engine. Only the storage engine has physical access to our book, to our data. And the result computed by the storage engine is consumed by the formula engine. Why do we have this architecture? Because with this architecture, we can change the storage engine with a different one that has access to different type of physical devices. In our case, we could have not something like a physical book but something completely different. In the case of the DAX engine, we have a storage engine, which stores data in memory compressed but we have also DirectQuery that can store the data, keeping the data in SQL Server. It doesn't matter what the storage engine does, from the point of view of the formula engine, it is just an intermediate step of the calculation and the formula engine performs the remaining part of the calculation. The formula engine has the more advanced calculations, whereas the storage engine usually has a limited set of calculations that can implement and compute and the result of the storage engine is consumed by the formula engine as we will see in the next demos. It is now time to see the formula engine and the storage engine in action. We will see how a Power BI report generates a DAX query and how this DAX query is executed by the formula engine and the storage engine. The first example is a Power BI report with a DirectQuery model. We have a single visual in this report that shows two measures, the sales amount and the margin percentage by month. So this visual has to retrieve one data point for each month, and this is obtained generating a single DAX query. We can capture the query using Performance analyzer and we can analyze the query that we we will obtain with DAX Studio. So as soon as this execution completes, I click on copy query. The query is copied in the clipboard and I can paste the query in DAX Studio to see the content of the DAX code that has been executed. Now, this code actually includes also the SQL query. This is a DirectQuery model, we don't want to see the SQL query at the moment, we just want to execute the DAX query here. And I click Run so that I see what is the result received by Power BI to populate the visual. And the result is this table that has three columns and one row for each month for which we have data. You see that there is a column with the end of the month for each month. We have a column with the total of the sales amount and the value for the margin percentage. Now, this is the result that Power BI receives from the DAX engine but what happens during the execution of this query? We can enable the Query Plan and we can also enable the Server Timings pane, which provide me information about the actual execution of the query. Let me run this query again. And here, in the Query Plan, I will see a detailed list of the steps performed by the formula engine to execute, to compute the result that we received. Now, the formula engine, remember the formula engine doesn't have a way to directly access the data from the SQL Server database, in this case, we have a DirectQuery. So looking at the content of this query plan, we can see that there are several points where we see direct query result. Direct query result and we have many of these points here. Now, the query plan describes the steps, the calculations performed by the formula engine. But when the formula engine needs to retrieve the raw data from the storage, it has to send a query to the storage engine. In this case, because we are using DirectQuery, direct query result represents in the query plan, a point where the formula engine has to consume data coming from the storage engine. And what is the request made to the storage engine? In server timings, we can see here a list of requests sent to the storage engine from the formula engine. In this case, we have only one, even though we have multiple nodes in the query plan we have seen before. But let's take a look at this query. I copy the code here, the SQL code in this case, the SQL request sent to SQL Server and this time, I paste this code in SQL Server Management Studio, which is an environment similar to DAX Studio to execute SQL queries. So I paste the code here and I execute the query. It takes a few seconds. Again, it's a complex query, an expensive query for SQL but at the end, I obtain the result that is similar to what I have seen before because actually, if you see the numbers here, we have one row for each month and we have two numeric columns. The question is, "is this exactly the number "we have seen before?" If we carefully look at the result, this is not true because these two numbers are relatively large numbers, whereas the value that I obtain in DAX Studio, if I go back to the results, is a number where we have the sales amount has an actual number here, which represent the number, a number in millions probably but the margin is a decimal number. So if you think for a moment, the model we have shows sales amount, and margin percentage but we didn't spend time looking at how these measures are written in DAX. So if we go back to the model and we look at the definition of these measures, we can better understand whether these calculations can be moved to SQL or not. So if I go back here to Power BI, and I click on margin percentage, which is the measure that I didn't see in the result, you see here I have the margin percentage is the division between two other measures, margin and sales amount. And what is margin? The difference between sales amount and total cost. And what are sales amount and total cost? Well, sales amount is the sum of the multiplication of quantity multiplied by net price and finally, these two columns are two actual columns of the data model we have in SQL. Quantity and net price, like for total cost, total cost has quantity and unit cost. So these columns are the columns where we have the raw data, the data that is in the storage. And only the storage engine can access to this information. However, we didn't copy the entire content of the sales table. We would have got 12 million rows. We only received one row for each month, which means that the storage engine request transferred a request to aggregate data and because we have to sum the result of a multiplication row by row, I think that also the multiplication has been sent to the storage engine. So if we go back to the SQL code we see here, I can tell you that the number we see in a0 here is the value of sales amount, whereas a1 is probably the value of... Oh sorry, it's the opposite. A1 is the value of sales amount, yes, whereas a0 is the value for total cost. So total cost and sales amount are two measures that the formula engine can request to the storage engine, in this case, so the storage engine has the ability to execute the result of our SUMX formula. However, the subtraction between the two measures and the division between the result of the other two measures is something that the formula engine has to do because we see that the storage engine didn't compute the margin percentage. If I go back one second to the whiteboard, just let me recap what happened here. So we have seen a SQL query sent by the formula engine to the storage engine. And this query SQL retrieved a table that has, let me draw here, three columns. We have the month here, we have the total cost and we have the sales amount. Okay, and we have this total. However, the result produced by the formula engine and provided to Power BI is different because the result here has the same granularity, the same number of rows, because we always had one value per month, but the value here is sales amount but here, we have a margin percentage. So this calculation is performed in the formula engine. And if we go back to the query plan, we can see that there are steps that consume the result of this storage engine request and perform the additional calculation. In this case, our DIVIDE between the measure, one measure is the difference between sales and total cost, so it will be sales minus total cost, comma, and at the denominator, we should have, if I remember well, sales. So this calculation is performed by the formula engine because if we see that the result provided by the storage engine doesn't include the result we see in the Power BI report, it must be the formula engine to do this calculation. And the reason why I wanted to show this example with this schema, is because if I go back to DAX Studio, and so let me go to DAX Studio here, and I go back to the query plan, my problem with the query plan is that it's extremely detailed and verbose and it could be complex here to retrieve the single step that perform the division and the subtraction. If you look carefully, you can see this information but if you look at this code, it's many, many, many lines of code in the query plan, sometimes for relatively simple queries. It is much simpler to just look at the server timings, look at the result provided by the storage engine in order to understand what is the workload for the formula engine. You can also see that here, we have the formula engine and the storage engine represented here and in this case, most of the time is spent in the storage engine. When you see that the formula engine consumes more time, it's because the storage engine retrieve the data quickly but then the formula engine has to do more work. Sometimes it has to aggregate the data because maybe the storage engine was not able to perform the aggregation based on the request you made. So this first example's shown how the DAX query produced a query plan that produced one request to the storage engine and the same request made to the storage engine was consumed in different parts of the query plan executed by the formula engine to produce the final result. Now it is time to see the same example, using a different storage engine because at this point, we have seen the storage engine using DirectQuery. Now we see what happens when you import the data in the model and you use another storage engine, which is VertiPaq. Every time you import the data in memory, the engine used to keep the data compressed in memory in the Power BI file is called VertiPaq. This model has the same data that we have seen before. The only difference is that now, we have the model imported in memory. So we can have the ability to see the content of the data of all the tables here. You can see in the data view this. You can see that the data model is the same that we used before. And the measures we have are the same too. Same visual, same model but this time, the data is imported in memory. So what changes when we look at the query plan and the execution? So if I go here and I enable the Performance analyzer, start recording and refresh visuals, you see that this query has been executed in just a few milliseconds. Now I copy the query and I go back to DAX Studio and I open DAX Studio here. And I will execute the query in DAX Studio. So first of all, let's execute the query without any query plan just to see that the result is the same as the result we have seen before with the DirectQuery model. Now I enable the Query Plan and the Server Timings and I execute this query again. You see here the query plan, the description of all the steps performed by the query plan and this query plan is slightly different from the one that we have seen for DirectQuery because for a different storage engine, there could be a different query plan. The reason is that the different storage engine can have different features and can perform aggregations or calculations that don't have to be made by the formula engine. In this case, this difference is not visible but you can see that there is a different naming, a different name of the node that retrieve the data from the storage engine. This node here is called cache. So cache, cache, all these points in the query plan identify requests made to the storage engine. Now, the storage engine used by models where you import data in memory is called VertiPaq. VertiPaq is the storage engine used by Power BI, by Analysis Services when you import the data in memory, whereas when you use DirectQuery, the storage engine is your external database. Now, we can see in Server Timings, the detail of the request made to the storage engine. And in this case, we see that we had only one request made to the storage engine that is described here. Now, let me reduce this and I can copy this code and I paste this code in the editor here at the bottom. Just to show you that the storage engine request made to VertiPaq is actually a query in a language that resembles SQL. Indeed, this is called xmSQL. XM, sorry, this one, xmSQL. So we are looking at this code. Let me just go here, here we go. So this is the name of the language that is described here. It's not an actual language. We cannot execute this query for real. This is a textual representation of a request made to the VertiPaq engine. Internally, this request is made using a binary tree in memory, which is more condensed and compressed but this textual representation help us to understand what is going on. And you see that this syntax contains expressions, these two expressions are derived by the two measures, sales amount and, oh, I removed this, sales amount and total cost. And here you see that the SELECT statement retrieves three columns. One column's used to do a group by and two columns to perform an aggregation of the result of these two expressions. One is multiplying quantity by unit cost. This is the total cost measure. And one multiplies quantity by net price. This is the sales amount measure. So the result of this xmSQL query is not visible unfortunately. We can imagine that like we have seen before, the result here is a table that has three columns, the month, the total cost and the sales amount with one row for each month for which we have results. And the formula engine receives this cache, this data cache, this is the internal name of the result of the VertiPaq query data cache and this data cache is transferred to the formula engine that performs the other calculation that, in this case, are identical to the calculation we described before because we have to compute the margin and we have to compute the margin percentage. So this is DAX code that is executed by the formula engine, so we can say that what the formula engine does, it knows, the formula engine knows DAX, whereas the storage engine knows, in this case, the VertiPaq engine knows xmSQL, which is the language that we have seen before. Let me show you this this way. I was covering the name here. Okay, so this is the area. Now, once the formula engine receive the data, once again, it builds the final result sent to Power BI, which contains, for each month, the sales amount and the margin percentage. As you see, this value has been computed in the formula engine, whereas this value, sales amount, was coming directly from the storage engine. There was no transformation required by the formula engine. So we have seen that the DAX engine has two engines, the formula engine and the storage engine. The formula engine is very sophisticated, has the full knowledge of the DAX language and can perform any calculation. It can perform also further aggregations if required. Whereas the storage engine is more limited. It has direct access to the physical data and this allows flexibility because this way, DAX can be used with DirectQuery, with VertiPaq and potentially with other storage engines in the future. So this is the first step to understand how DAX works, understanding the formula engine and the storage engine. You can start to investigate your measures using DAX Studio to see how they are executed in the query plan in the server timings with all the storage engine queries and of course, there is more to study to optimize your code but by far, this is the most important concept to understand before you move forward with other additional details about the VertiPaq engine and the DAX engine. Enjoy DAX.
Info
Channel: SQLBI
Views: 12,640
Rating: undefined out of 5
Keywords: DAX, PowerBI, Power BI, CONT, C0013, Marco Russo
Id: hc-x4DBv0V4
Channel Id: undefined
Length: 25min 49sec (1549 seconds)
Published: Mon Nov 09 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.