Integrating Power BI and Azure Data Lake with dataflows and CDM Folders

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we have we have one minute before the before the meeting is scheduled to begin so I cannot imagine a more exciting way to spend a Saturday morning so let me let me actually do this my clock has just ticked over to 6 a.m. on Ted please give me a verbal thumbs-up or thumbs-down that we are ready to begin on your end all right well we are actually ready they're gonna close the door so you can get started and from a logistical perspective in the room here if you have questions or anything just shout them out I will lay the item at you like I have I have a slide for this so let's actually put that one on hold so let me say hello to everyone so it's a good morning here but good afternoon to everyone in Stockholm to get things started I want to emphasize how important and valuable it is for sponsors for community organized events like equal Saturdays you cannot make an event like this happen without support from sponsors the organizers are key but the sponsors are as well so please visit them and say thank you I also want to emphasize how much I really really wanted to be in Stockholm today Stockholm is my favorite city in the world whether it is for the Lions or for the Dragons whether it is for the nordic museum or for juni bakken and the food even after my favorite favorite restaurant in the world clothes last year I don't know if you ever made it to Esperanto while it was still there it's amazing food amazing people and even though my recent trips to Sweden have taken me more to Gothenburg that is primarily because they have the better swordplay events Sweden will always have where Stockholm will always have a true metal place in my heart because this is the first place that I ever got a bass string from Joey DeMaio at Amana work but with all of that I will not be in Stockholm today as I haven't shared the details yet but essentially I failed to book flights for this event when I was talking with the organizers two or two-and-a-half months ago I looked at the plane tickets airfare was great I said yes I will definitely be there and I remember booking tickets but I don't trust that memory because I don't have any tickets to be there so if you do have any questions what I will ask is that you open your laptop open your phone drop them in to Twitter so my my twitter handle is at sequel all-father and if you ask me a question on twitter during the event I will answer that question immediately after this session is done and if you ask me a question later on if you've got other things to do I will do my best to answer that later today redman time so this is sort of the the introduction I'm rushing a little bit because I've got so much content but I'm interested in going through today and I know that an hour is not a lot of time from an introductory perspective I am a program manager at Microsoft power bi team I have been working at Microsoft for around 10 and a half years at this point before joining Microsoft I spent 10 to 15 years primarily as a data warehousing and database and ETL consultant so the topics and the capabilities around data flows have really been near and dear to my heart for most of my professional career and when I think about data flows and power bi I think about them in three ways each one of these introductions each one of these ways of looking at data flows shows an aspect of how they add capabilities into power bi in ways that they can be used since this is not an introductory session I'm not going to spend as much time on this as as I otherwise might but I am going to go through this because I believe it's important to lay the foundation for the demos and for the Azure integration details to come so the first way that I think about data flows in power bi is that they are part of an evolution of the self-service bi platform I've been doing data stuff long enough to remember when it was really cool and exciting to have a database to work with you know your your client would have an application you know their their first PC is sort of thing there's a file based database with the information behind it and even though each application had its own silo of data and typically would have some built-in reporting capabilities clients would want new visualization so they would want reports or dashboards or other ways to get insights that the application developer didn't include and as the as the use of customer is an automatic the use of computers and automated systems got greater we had more of these transactional systems and the need was to have the reports in the dashboards and other visualizations include data from more than one system which caused the need for the data warehouse pattern to emerge and when I say data warehouse most professionals will immediately think of a cloud data warehouse or relational data warehouse in this context think about the capability is not the technology the technology could be a relational database it could be an underlying data weight data Lake with files and folders but this pattern solves a common well understood set of problems with capabilities around staging around having a location for the consolidation preparation cleansing and standardization of data from multiple sources so that you can have a single point of access for downstream analysis and other processing so when we had that data warehouse introduced into our pattern this introduced in need and Ted are you trying to jump in with a question well ok ok so I I just want to be attentive if you do have a question just speak my name loudly and I will not listen to the smaller the smaller sounds so the only thing I would just Matthew if you could pipe away from your lip just a little bit is that a better audio I think so excellent yes everybody in this agree thank you for that and if there's anything else that I'm doing that is similarly annoying definitely do speak up because I will do my best to adjust so here we here we have as the data warehouse pattern or architectural tear was introduced came with it the need for data preparation or ETL this is the logic to get the data from all of those transactional sources and to reshape it and load it into that central place and of course once we had all of this data in a more analytics friendly location and format we ended up having more and more users wanting to do more and more things and at this point the the platform or the industry evolved to include a greater greater performing architectural component with OLAP in-memory analytics models so this could be a cube this could be a tabular model and what we're looking at here is a generic way to visualize what bi systems look like 20 years ago is there's nothing in here that's new what's been new in the last 10 to 15 years maybe a little bit less than that is the evolution of self-service capabilities and business intelligence so we have tools like power bi like tableau click view MicroStrategy whatever tool you choose the goal of self-service bi is to reduce the load on the IT team to let the the data professionals do more strategic things while at the same time enabling the business professionals and people that need to get insights from data to be more agile and more responsive and not be blocked by IT most self-service bi tools will focus on the visualization side of things so self-service reporting self-service dashboarding and power bi has always included both a self-service modelling so that data that you created in part bi desktop and it's always included as well the self-service data preparation through power query but the challenge with power query as we have traditionally used it is that the queries that we define in our self-service data preparation layer are finding the tables at her analytics model that means that any logic that defined here is trapped here in a in a useful location but one that is not useful for breeze so in this context this way of introduced power bi data flows power bi data flows are a way to bridge this gap and take you is the same but the same self-service Deighton capabilities through power query to load that data into a location where it can then be shared and reused for multiple users in multiple analytics models in downstream contexts in a self-service bi context without requiring explicit IT involvement so it bridges that gap for self-service business intelligence the second way that I want to introduce power bi data flows is that they are another object type or artifact type inside of the power bi service so if you've used power bi you're familiar with creating reports and dashboards and you know that you need to create a dataset the underlying tabular model for your reports and dashboards to use this is still the same thing data flows do not replace data sets what data flows do is they serve as another possible data source type on which you can build your data sets data flows are fully managed by the power bi service so that when a user is connecting to connecting to data from power bi desktop they have data flows as another data source type and they can mash up data flow data with sequel or Oracle or whatever other data they need an interesting analogy here is that while data sets in the power bi service are simply analysis services tabular models that are managed by the surface data flows our CDM folder which is a data storage format we'll go into more depth on later but there are these folders of data that are physically stored in Azure storage behind the scenes but they are also fully managed by the power bi service just as your power bi users never know that they're creating a tabular model they'll also never know that they're creating files and folders and Azure storage when they create a data flow but we're applying the same patterns here building these self-service capabilities on top of the mature underlying data platforms that we've developed the third way that I want is power bi data flows is to say that our bi data flows are like Excel if you think about how Excel works a given cell in an excel workbook contains references to other cells or ranges of cells in that workbook or other workbooks and because Excel formulas are a functional language so essentially each each one of these defines a function that references another that references another and at some point references data that is entered directly and not a function each one of these functions understands its lineage and its dependencies and it has the Excel process as an execution context to track all of this now we take for granted that Excel works like this we've been using spreadsheets for decades and we understand you know instinctively at this point we understand work if you've ever done enterprise ETL if you've worked on an enterprise data warehouse project you know really deeply that this is not how Enterprise ETL tools work for enterprise ETL tools you have to develop packages you know whatever your whatever your now an in yes I love you need to orchestrate them to ensure that the right things run in the right way is and that transactional consistency has main patient so that if one thing fails another thing doesn't start or if one thing runs long the other thing doesn't start until it's completed now how we're bi data flows are like Excel because every entity in a power bi data flow is defined by its function now I haven't introduced any terms yet because this is not an introductory session but let me just say that if you think about it dataflow is being a container like a database is a container that's a great way to think about it here so in this diagram we have a data flow called ingest from dynamic sales or sales staging or whatever we want to call it twenty two entities in it this is kind of like a database with 22 tables but each one of these entities is defined by a our query query and this power query is a function that when executed populates the data in the entity but like an excel formula it's a single authoritative function that says get data from here or do these things with it put the value in this location and if we look at this diagram we're on the Left we have sources that are external to the power bi service this layer the staging layer is where we're bringing the data into power bi these data flows need to be explicitly refreshed so someone needs to click refresh now or set up a refresh schedule or manage this programmatically but when when these are refreshed in the data in these entities is updated because power bi understands the relationships between all of the data flows in the service when these are medical these downstream data flows that in there pull data from other data flows inside the service they will automatically refresh in a transactional a consistent manner so that without any explicit user action being taken simply by building these data flows that represent our star schema and having them based on these data flows that are represent our cleansed and enriched data which in turn were based on these data flows which represent our stage data the entire orchestration simply works cuz of the relationships because of that authoritative lineage information that the power bi service maintains now enough talking let's actually look to see what this looks like as we're building a data flow and end so from a demo perspective I'm going to move over here and I'm going to start with a blank workspace like created the workspace before before this session began and I'm going to build up three different data flows to show a canonical or or regular example of how we see this being used I have a set of URLs that are on a screen that I'm not sharing so if you see my mouse moving away that is because I'm copying I'm cheating to get my URLs from my text file and for this first data flow I'm going to pull in data about promotions that my organization is running I have this data stored in a set of text files in Azure storage I'm going to do some basic transformation on it but you'll notice that the experience here is based in power query online but a very familiar experience for part bi users it's a new creation but it's the power X or query experience know and love and I'm going to do some basic transformations I have added or I've promoted headers I'm going to split this particular column because we've got multiple values that are in it that are delimited so I've got this this nice end to end table or entity that I've defined I will give this a name all that promotions and notices well if I right click on this and choose advanced editor I have the same I have the same power query the same underlying language and the same connectivity stack that we use in other parts of power bi and this opens up all sorts of capabilities but this is where the excel like nature of power bi data flows comes in because that power query is the formula that defines all all of where that it comes from and how its transformed I'm going to save this data flow and I'm going to call it promotions stagings just using this to pull the data into the system so we are good there I am going to create and I will refresh this as well so I'm immediately prompted either refresh or to set up a schedule while this is refreshing I am going to create a second data flow again I'm going to pull in new entities by getting data from outside of the system for this one I am going to pull in data from a sequel server database that I am pulling in data only from cloud sources today this is not because that's a limitation of data flows it's simply because it is easier for me to set up and run a demo without needing local or on-prem resources but you can use the same gateways that you would use in other parts at power bi this for this one or for this particular dataflow I'm going to pull in three entities from sequel server tables I'm not going to do any transformation but I'll give this one a more meaningful name choose save and close again once we're done with the validation we will put in a name we will refresh this oh remember that data flows are like databases so each one of these a container that has multiple entities inside it the next thing that I want to do I want to take these two staging data flows and I want to bring them together so I want to use data that I've been from one and data that I've pulled in from another to create data flows that combine data from these various sources I could combine these directly but there is various government situations where a data source has a load window due to the timing of transactional loads or other considerations that would say these get refreshed at this time these get rushed at this time and I having a different data flows with different Refresh schedules allows us to easily handle this type of pattern for this third data flow I'm actually using power bi as my source and when I go through this a second path I can see all of the workspaces in my power bi tenant that I have access to that contain data flows that I can use as sources and down here I will put in data from the accounts and called staging data flow that I've just created and I will pull in data from the promotion's staging data flow that I've just created and you'll notice that these are a same workspace that my current data flow is in but this is not a this is not a limitation of data flows I can also pull in information so let's say that in addition to the promotion data in the the calls data that I've just pulled in here I also have this ting data flow in a different workspace that another user has created where I can pull in information about products and customers their addresses and when I choose next let me just click on these guys to get all of the the icon to load so when I click next you will notice that each one of these entities that I've created our linked entities so there's a little chain link icon that is displayed next to them what we have done here is we have just created a pointer to where the data already lives we're not duplicating it we're not copying it we're simply saying let's create a new starting point by referencing this other data source and once I have done this I can use this and you know any all of the ways that you would expect so here I will say I'm going to combine tables by emerging queries to create a new one so I will have promotions on the Left I will have calls by a count by day on the right I'll choose account ID as the column that I'm going to join us she was okay I will scroll over here to the right will expand out this my day make our little changes here we'll choose okay and now I have aid a dataflow entity that pulls data from multiple other data flows does transformations and presents them in a consistent way when I choose save and close here the validation is going to take a little bit longer because I have a larger number of entities to validate but once this is done we'll have Oh put in our name and for this one I am NOT going on safe or sorry I'm not going to refresh when I say is I instead want to do is to demonstrate what it is that we've created so far before we bring this back into the context of Azure so as I close out my third data you'll notice that back in my workspace I have this original promotion staging data flow I have this constant called staging data flow and then I have the data flows both in information from them as I move but notice as well that one of them has not yet been refreshed as I move from the ListView to our relationship view we can see both the relationship between the data flows that I've created today and this other data flow this adventureworks data flow that I've pulled in that staging data for for customers and addresses and when I enable data sources and let me zoom out just a little bit here so we can see the entire thing we can see how that lineage information that that power bi that the power bi service automatically maintains is represented graphically where we have the sources that are outside of the system we have the sources that are inside the system in our current workspace or outside of it and then we have that downstream the downstream data flows hands the data from all of these and this visual view gives us an easy way to understand relationships and dependencies and impact and when i refresh any of these upstream data flows so if I right Clara click on the action button and choose refresh now you'll notice the the refresh starts to spin here and as soon as when this refresh is at a point where the downstream refresh can begin it will begin automatically so let me zoom back in and hope that we can see this a little better but if I come in and to look at the Refresh history for this you'll notice that we have two different on-demand refreshes so you watch me do each of these one was when I create the dataflow and one was right now if I look at the third dataflow and I look at the Refresh history is only one and instead of saying on demand for the type it has linked sources the Refresh type letting us as the workspace admin know that this refresh was triggered by the underlying RS re the upstream data flows and the dependency is the underlying dependencies in the service so everything that I've done here is defining entities of data that are stored as CDM folders in Azure under the hood let's actually take a look from what we've just done from a self-service bi capability and look at more of the details about how this actually interacts with Azure in and what are the scenarios that that enables let me emphasize here that everything that I have done from a demo perspective today is something that an analyst can do in power bi we're simply using power query in the browser to connect to a source to do transformations and to save them so this is a new experience but it's a familiar set of tools what's really interesting and exciting is the underlying integration with Azure data like storage Gen 2 that enables more scenarios for readings so if we think about what we just did here we created a power bi data flow it defines the ingestion logic to store data in a CDM folder or a set of CDM folders a CDM folder is just set a set of file and folders in a data Lake with its specific format but it is designed enabled integration between multiple services both cloud services from Microsoft and from you and from 3rd prettiest and we think about the simple scenario power bi can write to and read from the CDM folders that it creates and other Azure services that handle data ingest like as your data factory can also read right to Azure data like storage and can create and consume CDM folders and from a strategic perspective this CDM standard this technology is a big bet that Microsoft cloud an enterprise group so this includes all devasher all the power bi all the dynamics all of power apps and flow and so on so so this is a significant bet all cloud data services at Microsoft are investing in to enable simple integration between these services so even though I and data/factory are the two primary ingress services all other Azure data services will be reading and writing from CDM folder to support as error codes so that a CDM folder that is gated by an analyst and as a power bi data flow could serve without any integration code needing to be written without any transformation or thunking they can be used as a data source for ash machine learning models for districts or for other downstream and it's also worth emphasizing that however I have other integration pads with these other Azure services do but p.m. folders are a primary mechanism for data exchange now if you've used data modes and power bi you have probably lookin at the CDM folders that you are creating are not available other Azure services to consume the default setup is that the power bi service uses an Onan built in storage so it's still as you're behind the scenes but it is an azure storage account that is fully managed by our bi which means that the power bi service is the only writer and it's the only reader this is great for enabling self-service collaboration and one analyst to create things for other analysts to consume in power bi or enabling a central IT team like you know part of your bi center of excellence create data flows that are reusable by analysts inside a power bi but if you actually want to enable this full integration scenario there are explicit steps that you need to take to set this up so there are three basic paths that you need to go through in order to enable this as your data like integration and power bi the first path which is shown by the first swimlane up here is a one-time Commission step where a power bi global administrator needs to create and configure an azure storage account I'm not going to go all over the requirement here because it's basically a bunch of fiddly identic things and because we're going to make that experience much easier before the azure integration which is currently in preview is is generally available so this this configuration is is step intensive today but it's also well documented we're going to make it much more a much more simple and user-friendly by the time that work that we go GA but well create an azure storage account we'll configure it as an ad LS gen to account we will assign specific permissions to it that the power bi service can read and manage the data that's in it then we will also attach that storage account to power bi this is a one-time configuration at the tenant level power bi admin center and it's also important to emphasize that during review this is configurable once so that there is a single adjure storage account that is used by the entire power bi tenant making it so that there is more granular control so that for example you can have multiple storage accounts configured at the tenant level and then at the workspace level you can choose which one to use this is something as on our back but it's not available in preview today so we do that one time step to say we've attached this storage where power bi tenant and the administrator and say I will enable workspace administrators to use this store account so this is saying power bi knows where the storage is and then there is an explicit switch to say yes we can use that account once this is done this means then it's administrators when they create a new workspace they will be able to assign that workspace to the organizational data lake account this is a a per workspace option the default is to use the built-in power bi storage but a workspace administrator can can choose something other than the default the second guesser here you could just spend a moment and discuss the capabilities as they apply to standard capacity versus premium capacity that is an awesome question and thank you very much for asking it I have a slide all the way at the end that touches on this a little bit but since we only have an hour I may or may not speak to that slide so the there's two ways to answer so the first one is for everything related to Azure integration so all of the things that we're looking at right now none of these depend on premium capacity if you're using shared capacity you can still integrate in with your own organizational account and you can still attach external CDM folders which is the other flow that will show later on so none of the azure integration requires power bi premium . another way to answer is that there are two capabilities in power bi data flows that require power bi premium today one of them is the composable ETL so that excel like automatically refresh when the other things refresh that does require Premium the second thing that requires premium is incremental refresh of data flow entities but everything else with data flows will work with shared capacity as well does they answer the question Ted thank you excellent and thank you for the question and thank you for jumping in so let's look at that second swim lane in that second swim plane once the tenant configuration has been completed when a workspace administrator creates a new workspace they can choose to put that workspace you know they have the option to put it on a premium capacity they will also have the option to turn to turn on the switch to say to use the organizational data Lake account once this is done when they have created and refreshed their data flows or refreshing it a key is that's when the data actually gets produced when this flow is done there is now a CDM folder so a folder full of piles of data that exists in that data Lake storage account for others to consume so when the to swim lands and power bi are completed now we can move on to the third swim lane which is enabling a developer or data scientist or data engineer is that data in Azure so here the first thing that they need to do is to be aware that that CDM folder exists they need to connect to it and they need to have authorization to connect to it but this is essentially a discovery and connection problem it's no longer a real technical problem and when this is done we now have an end-to-end experience where as power bi is waiting to the CM folder through the Refresh of data flows every Azure service is consuming it and can now have access to that data even though it's being produced with Service bi process so let's jump back in through our demo environment and let's look at some of the additional things that we've done for the demos that we already completed so you notice that I already had a workspace created to start with in order for me to do this demo beforehand I had configured Emily do this so my zoom it doesn't want to work so I'm going to assume that you guys can can read what's on the screen but here in my my admin portal since I'm a tenant administrator for this particular demo environment under date of both settings I have configured this preview feature to enable power bi to write to this specific data like storage account so this storage account has been set up it has the write permission sets of the carbs I can do what it needs to do this is that first swimlane that we looked at in that slide the second one is we have allowed workspace administrators to assign their workspaces to this storage account so this fulfills the two requirements to be done for that first line the second thing let's go back to our default zoom the second thing that we've done is in our workspace settings so so we're back in the workspace I've come up to settings I even though I have premium turned on so that the composable ETA all those linked entities would work that wasn't required for our for our Azure integration what was required was turning on this data post storage and you'll notice that up and saying dataflow storage can't be changed because they're already data flows in it because there is no capability to move native data between your organizational account and the built-in storage this setting can only be changed or a new workspace or for a workspace that does not yet have data flows defined and another aspect of this is that the linked and computed entity is capability remember how I emphasized oh we're not copying the data we're duplicating it at etc etc a side effectiveness is that the ability to have these relationships between datas in different workspaces both workspaces must be in the same storage account computer linked in computed entities to work so for our azure integration setup we set our tenant settings we set our workspace settings and we created and refreshed our our actual data flows what happened behind the scenes so let me pull at your storage explorer down here so we'll make this a little bit bigger make it a little bit smaller so Azure storage Explorer is a free UI tool for working with storage accounts and Azure and I will use this to look at this subscription so I've got this pay-as-you-go subscription that I use for this demo and if I expand out in the account and you'll notice that this is the same storage account that we saw figured power bi if I expand this out you have a blob container inside or so we have our blob containers node and inside this we have a blob container that is called power bi this is where all of our workspaces that are configured to use this storage account will put their CDM folders and as I select this and I scroll down over here we see that there is this sequel Saturday Stockholm demo all of these others folders our folders or current folders for the different workspaces that are created power bi automatically finds this folder structure so if I double click into my workspace I can see the three different data flows that I'd created and each one of these is a CDM folder or common data model folder which means that it will contain two different things it's going to create contain a model dot JSON file this model dot JSON file contains all of the metadata for all of the entities in this data flow and for the data flow itself and now and the CDM folder also contains a set of CSV files for each of the entities that are in that data flow so here you'll notice I guess there are a couple things to called out one is that we have a folder structure that includes snapshots in its name and we have a CSV file but it also as a snapshot in a timestamp on it for data and you may also notice that even though in this final data flow that we created there were like seven or eight entities but there's only one entity that's right here this is because the actual data for those other entities still lives in the storage location where it was originally created we're only referencing and that snapshots is important because of the need for a transactional consistency if you think about what would happen when a reader whether it's someone in our bi desktop or a dataset refresh that's taking place or another data flow that is referencing one of these data flow entity is if someone is reading from it it'll flow entity and that entity is being refreshed at the same time there need to be a way for both of those operations to complete without a transactional consistency being violated power bi data flows actually use the same logic and much of the same code as the snapshotting mechanism inside of analysis services to handle these similar patterns as well so we will have and actually I'm going to come up here back up to my workspace folder I will come down into promotions to look at some of the additional details here so notice for this one because their refreshes that I've done on this workflow I've done more things with it in addition to our model JSON we have snapshots that and if we come down into promotions because I see this shots for the the entity as well so moving back into the moving back into the power bi I actually want to look at some underlying JSON that's being created this kind of significant and I will choose the action item here and choose export to JSON I could do this from inside of Azure storage Explorer as well and get the exact same file but I like to do it from power bi because it just the the the UX flow is a little bit easier so I end up with the JSON file let me pretty firm at this and let's look at some of the information that's inside it the reason that this is important is because the CDM folder format defines an explicit required schema or all of the the entities that are inside it so some of these are going to be power bi is specific but much of it is going to be general to any CDM folder that is being created or consumed by any azure service so this this JSON file define constant calls staging data flow and it will allow it to see both basic metadata like the name the description the version the last time that it modified it it will allow us to see what is the M package underlying mashup that power bi use to load the data so so this metadata is essentially saying we've got these three different entities and they're referencing these queries by ID and these are the names of them this is the same sort of meditate it would be defined in any power query vironment and if you look down here this document attribute contains just full power query scripts for all of the entities that exist if we move further down we can see that we have three different entities so we've got a local entity called accounts and inside it we have all these different attributes and the the interesting and perhaps most important thing is that we have this partitions section that will define the partitioning policy and this is just a full refresh because there is an incremental refresh the last time that it was refreshed and the full URL to file that's being referenced if we had set up internal refresh we would have a list of partitions here that matches to whatever a refresh policy was in the history of this entity coming back into the the power bi UI I'm going to export the JSON for our final downstream data flow as well just so that we can see some of the differences here again we'll wait for this to load we'll pretty format it maybe search so when we get down into the ended here because this CDM folder does not contain all of the data for all of the entities that it defines what we instead have is a set of reference entities that will call out the name this first at the location of all of the entities that are physically stored in another location but which are logically exposed as entities in this data flow so essentially creating the data flows through the power bi power query online experience will allow us to automatically you know the service does all of the heavy lifting define both the logical definition of the CDM folders in these model dot JSON files and also define its meant in the data that's being produced through the execution of those queries yes sir I have another question here in the room I'm going to let him ask it if we can hopefully do that I'm just I'm picturing the in-room logistics so thank you everyone for understanding my airline failure as your analysis services perspective to be able to start or load these or export these to to now services so let me answer that question in two different ways so the first the first thing is until an upcoming capability is published in our business applications release notes which is our public roadmap I am not allowed to discuss it publicly I could only discuss it under NDA so I cannot answer that question today but what I can say is that all of the azure services even ones that I'm not explicitly mentioning here are invested in this CDM folder format to use it as a data source or a data destination depending on their capabilities and Azure analysis services literally shares the same codebase as power bi and power bi premium not all of the code is enabled for all deployments but this is a very logical future direction it aligns with our priorities so okay thank you hopefully my not answering your question gives you the information you need so there's one additional path for Azure integration that I want to show before I jump back into my slides we do have just over ten minutes left so you've noticed or you may have noticed I copied the URL to this model dot JSON file so I went into properties I copied that URL out what I can do here let me come back to my list view so here I have the three data flows that are defined in power bi I want to add a fourth data flow to this workspace but what I want to do is to choose this final option which is attached a CDM folder this is known as an external an external CDM folder or external data flow where I simply put in a name in a description and paste in the URL to the model JSON file that defines a CDM folder in Azure and when I say create and attach there's no definition power bi doesn't have any logic behind this and we go out of our way in the workspace view to say is this is an data flow you know look there's less information there's fewer options what this gives us the ability to do is to have some other Asscher service writing to the CDM folder and managing and maintaining the data while still making it available as a fully managed self-service data source for power bi consumers so if we jump back if we jump back into the demo let me actually come back here to show this this is the path where we have data coming in from another service writing into CDM folders and azure and then making that available for this low code no code experience inside of power bi now I've talked about CDM folders and the common data model a lot but I haven't actually defined anything and it's worth spending some time at the end of the hour now that we've seen some of the end-to-end capabilities and some of the details let's talk about some of the concepts that are important to understand as you move forward to use this later on when you get back to the office when it's not Saturday anymore so the common data model is a metadata standard so the text on the slide I literally copied and pasted from our documentation so the CDM is a metadata system that is designed to enable structural and semantic consistency across multiple applications and deployment I like to think of it as a platform and tool agnostic version of the system tables or the system catalog in sequel server so so this is Syst out objects and fist columns and cysts in a way that any application can use it the reason that I didn't think that it's really important to talk about the fact that the common data model is a metadata system is because most people don't think of it this way and it's Microsoft's alt that they don't when Microsoft first announced the common data model the second definition is what we led with we're telling people oh the common data model is this a set of pre-built business entities we've got a count in campaign and product and all these different things and it's awesome because everybody can use the same definition of all of these things and this is really true the common data model is also this but at its root the common data model is a metadata standard and Microsoft has built a set of standardized extensible schemas using that standard and we're working with industry partners and third parties to build out additional sets of entities for things like healthcare and retail and manufacturing but at its root the thing that is most generally useful today is the fact that it's a metadata standard and how about CDM folders our common data model folders a CDM folder is simply a folder in a lake that conforms to a specific well-defined structure and schema that conforms with that common data model standard so if the common data model is a metadata format CDM folders are a data persistence format that uses CDM metadata and there are two different types of content that you will see in every CDM the first one is a model that JSON file it has to be named that it must be model dot JSON and it must be at the root of the folder this makes a discovery incredibly simple and then there's a set of data files they have to be CSV today we are going to support park' as our next file format and it's likely the Avro will be the third format that we support the as of as of whatever month and year oh it's made of forest 2019 as of today CSV is the only format that we support but because these are standard files that every everyone can read and write very easy to use the way that it's set up is we have some sort of file system level typically managed by the applications that are rating it we have a set of folders with that model dot JSON and with the data files inside it power bi has a specific set of sub folders that's optional but it's nice it makes it manageable and the reason that I asked these questions the reason I'm saying what about CDM what about C DM folders and all these things is that this is really the magic of our bi Tatom loans those CDM folders are magic glue that allows power bi and Azure and other Azure services even if you're not using power bi it enables all of these integration mats for both business analysts this is really the key we've got the low code no code experience and we have the the pro code mo code experience over here the low to high code experience what makes it possible for them to work together is CDM folders in Azure doodlee we've got five minutes left I want to wrap up with some of less technical things I want to emphasize here and this is this is not specific to a sure but these are common questions that I hear this is kind of an FAQ when you think about power bi data flows do not think about replacing other tools don't think about replacing data factory or integration services that's not what we're trying to do what we are trying to do is we are trying to label enable business users power query users and analysts to do things that they used to need to go to IT to do so it's all about self-service data preparation closing a self-service gap not being a new Enterprise ETL tool and the other way and we're kind of restating things here but another thing that is that is key to understand as well is when you don't have data flows we invariably see our business users either having to wait by tea or working around IT by dumping things to Excel or CSV files or the like or we see organizations that are investing in third-party tools like ultrix or data mirror or in the light which often tend expensive and don't integrate in no data flows close these gaps and what I believe is my final slide before I wrap up the type of customer production scenario that we see and we've got customers that have been using data flows in production even before a data flow is wengie a but what we generally see is a set of data flows that has been created by a central team this may be an IT team or it may be a bit missed in most frequently it is a bi center of excellence inside a large organization that uses both IT and business resources but they're they're using power bi data flows new stage data and to do cleansing and in this case calculations as how I'll call it out in the slide but we get the data into the system and do comment things to it and this is done by a central group and then from this point on a specific line of business so some subset of consumers with a specific and smaller scope they will use this as a Smurfs they will filter it the old transform it they will add other data sources to it and then they'll build their data sets and visualization on top of it and this part of the diagram horizontally as well so there may be dozens of different lines of business that are using this data flow as what their sources so having this data flow as a composable unit of data that multiple internal groups can consume and work with and so on if there's both what data flows were designed to do and it's the pattern that we see customers lament when Davos I will share these slides these will be on my blog either today or tomorrow and I'll tweet the link out but there's a ton of information on data flow the azure integration that's available my favorite resource is this end-to-end partner sample basically walks you through starting with an azure trial subscription to create and consume the CDM folders multiple hash services I also need to do the same plug for my blog I've been blogging for about six months on on data flows there's a ton of information that's out there I honestly think that this is the best single online resource for power bi data flows information you can choose to agree or disagree on that one but the best thing to do if you have questions following this session ask head because he knows everything or find me find me on Twitter if you've asked the question you know if you ask it within the next 10 minutes or so I will answer it right away and if you ask it later in the weekend I'll answer it as quickly as possible so with no further ado we've got one minute left head do we have one question or we give everybody the the extra one minute break in between from the looks of everybody there smiling and laughing about the extra one minute well it don't if it's Saturday morning and I could not hear what the gentleman said he was talking about all right well I will say thank you very much to everyone the next time that there's an opportunity to be in Sweden I will definitely be on it so thank you very much for both joining me and for understanding the special accommodations and thank you to Ted for making this work well thank you Matthew it worked out very well take care everyone have a great day
Info
Channel: Matthew Roche
Views: 7,315
Rating: 5 out of 5
Keywords: Power BI, dataflows, Power BI dataflows, ADLSg2, Azure Data Lake, CDM, CDM Folders, BYOSA
Id: 84Bk9hg7t4o
Channel Id: undefined
Length: 60min 50sec (3650 seconds)
Published: Mon May 06 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.