Microsoft Fabric: Data Warehouse vs Lakehouse vs KQL Database

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone and welcome back to the channel today we are continuing the powerbi to Microsoft fabric transition Guide Series with a very important one now one of the biggest mistakes that people can make when they're moving to Microsoft fabric is to select the wrong data store so in fabric there's three main data stores got the Lakehouse the data warehouse and the kql database and each of these have slightly different characteristics use cases features and also limitations and if you don't fully understand these things then you might not make the right decision for your requirements in your business so in this video we're going to be going through each one in detail understanding what they are what they should be used for again understanding the features and the limitations and comparing them side by side and then in the second half of the video we're going to be learning about how you can combine these different data stores into some sample architectures so we were going through some use cases typically these aren't just used on their own but in fabric you might be using one or two lake houses plus a data warehouse maybe a kql database as well if you're using real time data so that's what we've got inall for you today enjoy the video so to start us off today I'm actually going to start in the one Lake file explorer so this is a plugin that you can download and it acts a bit like one drive for your local machine and the reason why I'm going to start here is to understand that under the hood the lake house the kqo database and the data Warehouse they all store their files in the same format and this is particularly visible when we look in the one like file explorer so here I've connected it to my fabric environment and we can see different artifacts different fabric items and when we click on it it's just folders so we've got a tables folder and then these are tables in my lake house this is a lake house and if we click into this you can just see the park files and the Delta log so these will be Json files that just show the the Delta log for that particular table in the Lakehouse and if we go to another one so this is my workspace if we go to for example this data warehouse you'll see that it's exactly the same structure tables we do have a schema in the data warehouse but if we click through then we can see that again we've just got some parket files and the Delta log so under the hood everything is exactly the same so that's the first point to not about all these different data stores under the hood it's the same data so what is different then well when we flick through to the UI here we're in a Lakehouse and I'm just going to give you a quick tour of each of the three data stores just that you can understand what they look like and what you can do and some of the unique features in each one so here we are in The Lakehouse and what's particularly unique about the Lakehouse well we have access to tables and files so the lake house we can store structured data so this is these tables for example here we can see a preview view of data in these tables but what's unique about the Lakehouse is we also have this files area so the Lakehouse kind of the definition of a lake house is that it's part data Lake IE files unstructured data and semi-structured data so data in csvs and Json parket whatever you want and we can also have completely unstructured data so image data audio data and if you're a data scientist then that's really useful right because you might be doing some image analysis computer vision problem all that kind of stuff you're going to be saving your data into a Lakehouse files area and one of the main kind of workflows that we use in the lake housee is to get data from an external system might be a rest API might be files in ADLs for example we're going to bring them into our files area we're going to pass them transform them clean them maybe and then we're going to put them into more structured tables so that we can use them later Downstream in Pipeline and visualize them in powerbi for example so for that reason the lake housee is very often used as the bronze layer in The Medallion architecture because it can handle both files and tables so in terms of using the data in a Lakehouse or interacting with this data in The Lakehouse the tool that we use is a notebook so hand in hand with the data in the lake house we use notebooks to analyze or visualize or transform or validate or test all of this stuff and you can create a new notebook just by clicking on open notebook if you got an existing one you can link it to this lake house or you can create a new one so I've got one here I'm just going to click on here so the thing to know about the notebooks is we can write code in a variety of languages and the notebook is built on top of the Spark engine so if you watched my 38 minute video introducing you to Microsoft fabric you'll know that the engine Behind The Lakehouse and the data engineering experience in general and the data science experience that is going to be the Spark engine and because of that we get access to four main languages that we can write in a notebook we can use P spark spark Scala spark SQL spark R and that's if you want to use the Spark engine so if you have really large data sets then you're going to be want using Spark engine because it's going to be able to chop through all this data in a parallelized way and I've got a whole series about using spark and notebooks in general it's about a 4H hour long video I'll leave a link to that in the description below if you want to learn more about that it's not just spark as well that we can use in notebooks if you've got slightly smaller data you can just write conventional Python and use libraries like pandas and map plot lib but all this good stuff that's also available to you but you won't benefit from the parallelization that you get with spark so we can also create shortcuts in the lake house so we can create shortcuts to anywhere in fabric so this table here is a data warehouse table and you can see there's a small link there in this Delta table icon to indicate that is a shortcut so this data doesn't actually live in this lake house but it's accessible to us in our notebook code for example now as well as the lake house view which is what we've been looking at currently the lake house also has the SQL endpoint so we can write tsql scripts to view data in our lake house now the important thing to note here is that it's read only so you can't do anything like alter tables adding columns update inserts deletes you can't do any of that only really select statements and you can also create views on that data as well so that's the only real like WR capability because the view is not actually altering the underlying data set right so here we get a view that looks a lot like the data warehouse but it's actually the data in a lake house and we can write tsql scripts as well and we can also create measures from this data Dax measures as well build semantic models and all that good stuff from this view so next up we're going to look at the data warehouse so here we are in the data warehouse and if you've used any sort of SQL database before it's a very similar structure so we have scheme you have tables views functions stored procedures right and the language that we're going to be using in the data warehouse is tsql so we can write tsql queries and then we can save them for our colleagues to look at as well in this my queries and shared queries option so the data warehouse is going to be very familiar for database developers SQL developers maybe even powerbi developers that have used SQL before so a typical workflow or work load that you might be doing in the data warehouse is more towards the end of the data processing pipeline right the the gold layer of a medallion architecture for example you might be creating Dimensions creating Aggregates analytical models in here because you might already have SQL code that describes business logic in your organization right and you're going to be using SQL to write those things and again you can create semantic models from your data in your data we house okay so here we are in the kql database and I've got some data here just from the weather sample which is the sample data set that you get in fabric so the kql database really is built for streaming and realtime data sets so a few different data sources that we can connect to to get data into a kql database now some of these are like onetime operations so if you've got a big batch of iot sensor data for the last 2 years you can upload it here and just store it in your database but the main use case really for kql is these continuous data sets so we can use event streams or connect to existing Azure event hubs to stream data into the platform so once we've got some streams set up we organize things in tables we can also use shortcuts and we can also use materialized views so if you've used Azure synapse or DBT or any sort of database that incorporates materialized views basically means we can create a view of the data but then we write it out into a materialized views which acts more like a table and it improves read performance when we're for example quering it in powerbi we can also create functions and data streams now in this one I haven't actually got any set up but that's just to give you a bit of an idea about what the kql database looks like now in terms of languages that we're going to be using in the kql database well the main one is kql and we can also Implement SQL kind of as well so here we've got a bit of a cheat sheet that's included in the documentation and if you use this keyword explain in a kql query language statement it's going to Output the kql so if we try that then so say for example here I'm going to create a new new kql query set now this is going to take us through to the query set editor and so if we only know SQL then we can just copy this and it's going to return a kql query and then if we want we can copy that paste it into our editor and then just run that and then we get the result so if you're familiar with SQL then that's a good way of kind of learning KQ out as well so that's a bit of a tour of the three main data stores that we have in fabric now let's do a bit of a side by-side comparison just to summarize everything that we've gone through so far and I'm going to be talking about a few other considerations when we're comparing between these three different data stores so let's jump over to the iPad and I'll walk you through those now okay so here we are down in the iPad and what I've got is a bit of a side by-side comparison of the three tools so the three data stores that we have in fabric are the lake house data warehouse and the kql database as we've just seen and then to give it a bit of structure I'm going to be talking through these things down the left hand side so what the different data types we can use the languages co-pilot integration the different ecosystems that that particular data store might open up which I think is very important the different cons security considerations that we need to bear in mind and configurations that we can make in each of these the typical workloads the unique features of each of them and the personas and then afterwards what we're going to do is look at some endtoend architectures that combine these different data stores in some unique ways depending on the requirements in your organization so let's get started in this and talk about the data types so we're going to be basically revising what we just spoke about earlier just to reinforce what we were talking about so let's start with the data types for the lake housee and as I mentioned we can do structured data and by that we mean tables we can also do semi-structured so things like CSV in the files location and Json for example from a rest API and we can do structured data and by that we mean images audio anything really and this is particularly useful for data scientists if you're doing AI projects based on computer vision or anything like that that requires the analysis of unstructured data we need to get that data into Fabric and the place we're going to get it in is in the files location in a lake house so the data warehouse we only have the structured data right we only have tables really now kql is kind of the so we have the structured tables we can also do semi-structured because we can import files that we can analyze in a kind of oneof batch scenario as well okay so next up we're going to be looking at the languages so for the lake housee obviously we have access to the Spark engine so we can write any languages that we can write with spark so for spark that's going to be spark SQL which is similar to tsql but there are some differences youve obviously got P spark did a long series on that spark R and Scara now on top of the spark languages we can actually also write just raw python right so pandas map plot lib Seaborn all those traditional python languages as well and on top of that we can also actually write tsql as well in The Lakehouse via the SQL endpoint but that's going to be read only right so we can do select statements but we can't do inserts updates deletes any of that stuff in the data warehouse really we only have tsql right so we can write queries store procedures all that kind of stuff there now one thing to bear in mind here is that if you have a table in your data warehouse then is an option to short cut that into the lake house right so this is my shortcut logo very bad but then we can create that here and then we get access to all of these things as well right so just because you store your data in the data Ware doesn't mean that the only thing you can do is to use tsql you can still shortcut it into the lake house if you need to use pis spark or anything like that right but natively you only have access to tsql and the tsql here importantly is read and write so you can do inserts updates and deletes with data so that's one of the the main important things about the data warehouse right so the K database obviously is K ql and SE as well so in terms of co-pilot integration here so if you're using an f64 capacity or P1 capacity then you get access to co-pilot in The Lakehouse experience or technically in a notebook right but the notebook's connected to a lake house so yes that is available in the lakeh house currently there's no integration with the data warehouse so you can't use co-pilot to write your tsql scripts for example and is integrated into kql so that's something to bear in mind if you want to take advantage of low and no code stuff then copilot is accessible to the lake house and kql now I'm sure they will build this soon but currently it's not available and what're on the topic of no Code and low code there's a few different options there as well for each of these so if the lake house I'm just going to add in a new topic here like low code no code bit messy but is what it is so in the lake house obviously we have co-pilot we also have data Wrangler which is a feature that kind of helps you build python queries pandas as well if you're not too familiar with those languages in the data warehouse we have visual queries so this is where you can create a query just using a UI interface so if you're not familiar with tsql you can still build queries with that as well and in the kql database I don't think there is anything that's low code and no code I might be wrong with that but I think currently there's no options for no or low code okay next I want to talk about ecosystems because this for me is a very important point to consider when you're choosing a data store so let's talk about the Lakehouse first and because the lake housee is built on top of the Spark engine and we can use Python this opens up the spark ecosystem and the python ecosystem if your data is stored in a lake house and with these you get lot of stuff right so you get things like machine learning there's huge existing ecosystems in both of these tools so for data scientists really want your data to be in a Lakehouse and you can easily access machine learning libraries and build machine learning models on top of that data but other examples might be things like data validation Frameworks particularly python there's a lot and in the next video I'm going to be talking a lot about data validation and how to implement that AC across all of these different data stores now in terms of ecosystems in the TC well tsql world with data warehouses the good thing is that it opens up SQL projects so we can manage our tsql code and our database configurations using SQL projects and we can also use DBT right so DBT is a third party tool that is very commonly used for analytics engineering so what that means is we can use it as the the transform tool and it's all in SQL right so you can connect it to your fabric data warehouse you can use it to build aggregations you can use it to build analytical models all that stuff and the good thing is that your code in DBT is Source controlled and there's a lot of functionality for validation so this is a really powerful ecosystem to tap into if you're going to be using the data warehouse now the kql database I'm sure there are ecosystems but I'm not actually aware of them so I need to check that out but I can't really comment too much on the ecosystems that the kql database opens up but is a very mature framework I'd be very surprised if there's not existing ecosystems and tools that we can plug into that okay so next up I want to talk about the security configuration options with each of these so let's start off with the Lakehouse what I mean by security configuration is how granular can we get with sharing objects or rows or columns and that kind of thing in each of these well in the lake house if we're using the Spark engine then it's actually none so we can't really do any of that if we're using the spark and we're accessing our lake house via notebooks so that's something to bear in mind talked about this a lot in one of the previous videos about access control in this series now if we're using the SQL endpoint then you do get access to Road level security Now one thing that I mentioned in my previous video is that this roow level security is only enforced if you give viewer roles only right because anything higher than a viewer role so if you give them contributor then technically they are going to be able to write a notebook and then your road level security is not really enforced so if you want to lock down at the row level for Lakehouse data then you want to be giving people maximum of view up permissions only now in the data warehouse there's a lot more options for Gran ity in terms of what we can share with people so we can share at the object level which is a specific table a specific view for example we can also do row level security we can also do column level security so if you want to hide particular columns from particular users or groups then that is possible in the data warehouse we can also do Dynamic data masking which means masking sensitive data in particular columns for particular people or groups so if you got sensitive information like an email or credit card for example that data can be masked in that column if you use Dynamic data masking so there's a lot more granular permissions that we can apply in the SQL data warehouse if that's what you want and in the kql we have access to roow level security and that's it okay so in terms of the typical workloads that we have here so if we look back at what we've written here we can start to understand what the typical workloads might be right so if we start on the on the Lakehouse well the fact that we can have unstructured and structured data in the same place means that a typical workload really is getting data from outside of fabric passing these files getting them into a structured format so this file to structure transformation is a typical workload that we can do in a lake house and especially if this file is complex so if you have to do a lot of passing if you've got really nested Json structure coming in really you want to be doing that with code right CU it's a lot simpler than doing it with SQL or anything like that so that's a particular use case that you would definitely use the Lakehouse and notebook for on top of that any sort of machine learning data science stuff you're really going to be wanting your data in a lake housee now the data wehouse here some typical workloads we're going to be doing analytical modeling normally in preparation for powerbi so like Dimension creation facts star schemers creating all that stuff aggregations all of that and for the kql obviously the main workload is for incoming streaming data it's got a lot of inbuilt functionality for time series analysis now obviously you can use Python for time series analysis but what I mean by time series in this instance is time series on Real Time data as it comes into the system right not just in a batch processing style time series analysis so in terms of unique features for these three data stores another thing that we haven't quite mentioned yet is git integration so in terms of the git integration for each of these the lake housee is something an artifact that you can back up with Git now currently the data warehouse is not supported but that's I'm sure will be supported soon and also if you're using SQL projects then there's an opportunity to use that as your G integration method and the K database as well I don't think is supported via git now some other unique features that I want to mention here for the data warehouse is table cloning so that is particularly unique to data warehouse and what that is is you can create clones of tables in your data warehouse for development and testing right so if you've got a dimension table in your data warehouse you can create a clone of that Dimension table alter it change the structure of it and then kind of test it out before then reintegrating it into your main code base your main Dimension table right so that's a feature that they released fairly recently and if that's something you want to be doing then just know that that exists there okay so finally personas right so hopefully it's fairly clear who's going to be using which of these in the lake house typically we're going to be having our data engineers and our data scientists because they're going to be the ones that are comfortable writing spark code or python code they're going to be want to using machine learning they're going to want to be doing data validation all that kind of stuff that is easiest in Lakehouse now for the data warehouse you're going to be having dbas SL developers database developers maybe powerbi if they know some SQL to create dimensions and facts and things like that if they can create them for themselves that's great and the kql is going to be data Engineers analysts who are comfortable with real time and kql and yeah that's basically that okay so we've gone through quite a lot of different considerations there things you need to think about when you're choosing a data store next I just want to finish by looking at some sample architectures right so taking into account all of this we can begin to understand that some data stores are better suited to some tasks and others are better suited to others so if we talk in general terms about the data processing pipeline we have data coming into fabric here and at the other end we're going to have a dashboard with your user on this end right and if you think about a sliding scale from unstructured or semi-structured so these could be like files API data databases and it's typical that these are in not good shape when they end into Fabric and the goal is to gradually create more structure in our data pipeline right so as we move from left to right in this pipeline actually this should be here the goal is to clean transform add structure to these semi-structured and unstructured data sets right and normally that involves converting from this area might be files and this part of the pipeline is going to be tables so then when we think about how we architect a solution here well towards this end really we need to start at least with a data store that can handle practically anything right because these are all going to be different formats you don't know what format they're going to be in we definitely need a place to store files and tables so typically we're going to start in an endtoend Pipeline with a Lakehouse because we can have files and we can have tables right so all of our files can get stored in here data from apis even database dat data we might bring in and write it as paret for example then in our lake house we can create notebooks that validate and clean and transform all of this raw data and then we're going to have some tables in our Lake housee that would be like raw tables like we haven't done anything with them we've just got the data out of files and put it into a table given it a bit of structure but we haven't done any sort of we haven't done much cleaning or validation then typically from here we want a platform where we can do data engineering and data science a lot of the time or if you need that in your organization then typically what most companies will do is create another Lakehouse that's going to read in these raw tables and do things like cleaning getting them into a more refined state right now it could be conforming and by that we mean merging different data sets into a single data set so if you've got a customer table you might want to bring in data from three different Source systems and conform that into one table that is your kind of Master customer table for example also this is where your data scientists might be working right so you might have some tables that take these tables here and then add another layer of insight right so it could be predictions or it could be some sort of data science analysis so you're creating new dat from your existing data and then in this part of the analytics workflow where you're getting close to powerbi right and powerbi developers like structure now either this layer it's typically a data warehouse because you have to think about the skill sets available in your organization and who is going to be wanting to do this kind of stuff so here we mean like creating star schemas creating aggregated data sets that are going to be fed into powerbi and typically the people that are building these things are a lot more comfortable with SQL so that's why in this layer of our pipeline a data warehouse makes a lot of sense right now you could have just Lakehouse Lakehouse and Lakehouse here but it would require people in your team to know python so they have to build all of these things like your dimensions and your facts in Python which is possible but it's not what most companies do now the way that you would integrate a kql database into here is if you had streaming data then you'd probably want a kql database here so you could create an event stream that's getting stored analyzed and processed in your kql database now what you can then do is then rout that into your silver layer of this Medallion architecture so once you've kind of got the Insight you've done some analytics in real time then you probably want to save that data into more long-term storage on this lake housee so that's how You' integrate a KQ database into this kind of architecture but all these things are highly dependent on the skills that you have in your organization so hopefully now it's quite a long video I've been talking for a long time but I think this is really important material people are making these decisions about which data stores they're going to be using in their organizations so hopefully now you have a better understanding of the key features for each of these some of the limitations some of the things you need to bear in mind with each of them so that you can make good informed decisions about when when to use each one in the next video we'll be focusing on data validation and this is one of my favorite topics how do we introduce data validation into an architecture like this to build a system of data quality assurance at each stage in your data processing pipeline so make sure you join us for that
Info
Channel: Learn Microsoft Fabric with Will
Views: 8,568
Rating: undefined out of 5
Keywords: lakehouse, data warehouse, synapse, kql, kql database, fabric notebook, shortcut, power bi, microsoft fabric, power bi training, data analytics, data science, data engineering, workspaces, shortcuts, databases, medallion architecture
Id: Cs6cEeSSTVw
Channel Id: undefined
Length: 30min 18sec (1818 seconds)
Published: Thu Mar 14 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.