Learn Together: Use Apache Spark in Microsoft Fabric

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello hello and good afternoon how are you doing Nicola hi hi hi Hy good to see you and glad to be here uh with you today I'm doing good and you yes I'm doing really well as well and well I'm really excited that we got this topic of of discussing how to use aash Park in Microsoft fabric same here and I'm sure yeah yeah sorry sorry go ahead yeah because for me it's one of the parts of fabric that I really like to use the most yeah I I have to be honest I'm still learning it uh yeah but uh as more as I learn about spark in fabric uh uh the more I love it so yeah I'm also very excited to to be in this session today up exactly I think it is one of those things where you need to get going a little bit uh the start can feel a little tricky but then once you get going it can get a whole lot easier and we can kind of find out how powerful it is yeah I couldn't agree more great so today we are here to present along this learn live module that there is this link for on this slide and the QR code that you can snip from there and this is the second session of this learn together series that we're at and I'm myself Haney marinen and I live in Finland and work for a Finnish company called polar Squad uh I work with devops topics but I look at the devops topics from the data platform perspective from my point of view and yeah I'm also a data platform MVP and today here I have Nicola with me who is a fellow MVP with me yeah thank you so yeah uh I'm Nicola I'm originally from Belgrade but for the last almost eight years I live in beautiful city of salsburg in Austria uh where I work uh with Microsoft Fabric and powerbi predominantly and uh yeah the reason uh uh so the the the because I live in in salsburg and everything here is in sign of ban G Mozart I took this nickname data Mozart because of that so yeah I'm trying to make music from the data oh that that sounds amazing amazing and hopefully with this session also people are able to you know make everything harmonious within fabric as well yeah let's hope let's hope yeah great so today we are also joined by uh two amazing moderators behind the scenes uh they will be interacting with you in the comment section if you have any questions and relaying also those comments to us uh so today we have K saer uh who is a senior bi developer and also a fellow Microsoft MVP and you also have his context there as well and as the second moderator we have Sun Kumar who uh is also a Microsoft MVP and a data manager at data Singapore and also his contact there for LinkedIn so I I would say we're in good hands also because uh it's quite difficult to follow all the questions while being uh live here so it's really helpful to have the moderators behind the scenes helping us out yeah absolutely as far as I can see there are more than 300 people currently watching the live stream so uh uh I'm deeply grateful to our moderators for joining us today and helping with the handling questions yeah great and then uh here is the links for you so that you can follow along as we go along this session so we will follow along the learn module that you can find through this this link so it is the use apash Park in Microsoft fabric module that we're at today and we will kind of go along the same flow as there is in the learn module but of course we have a little bit of a chance to give some additional information while we are doing it so hopefully you get a little more from this than just reading the learn modules itself yes exactly and what Hy said after you watch this you can go and practice on your own it would be probably much easier exactly and of course we want that this session will become very interactive so please do write in the chat say hello tell where you're from uh what's your experience with fabric and ask questions because that is the benefit of watching this session live then you are able to get those questions maybe answered even here online so with everybody so please do engage so we can make this as interactive as possible and for learning fabric there is a really uh good resource here shared with you so there is the fabric career Hub that is available and there is a lot of uh good content there that you can use for your fabric learning and for example you can see here that there is a 50% discount on exams for example and that is of course always very very useful and then you can also find videos and career advice and things like that from the career Hub as well so many interesting resources that you might want to take a look at yeah what I just wanted to add so you are not alone on this journey so a learning fabric can be uh can look like an intimidating task so you're not alone and uh this fabric career Hub is a great place to uh to to start your learning path but also to network with other peers who are also looking to learn Fabric and the take dp600 exam yes exactly and of course there is uh very specific details about even uh like what kind of role could you have as an analytics engineer or data analyst or so forth within the context of fabric and you can even find some like sample architect diagrams as well as what kind of skills you should have maybe even what could be the expected average salary I'm not sure which region this applies to but anyways you will have all that information available there as well so you can kind of start your certificate Journey also from the career Hub you can join a cloud skills Challenge and you find all the learn together series information there as well then you can register for an exam and then there should be quite soon coming a practice exam as well and then you can schedule an exam as well sounds easy and straightforward exactly very simple the one part is of course to still you know accumulate the uh knowledge that is needed and we are here now on step two in the learn together Series so after this you will be ready to go for the exam crem to prepare for the step two session two for today exactly very true very very true and yeah here is also the information about how to uh apply for the fix 50% exam discount so make sure to use that since it is yeah it's it is actually quite a substantial uh discount that you get from that so then you have like 30 days to complete the challenge and that means that you go through all the modules kind of related to that specific exam so if you join this session today and then afterward you go through the module and you join the other sessions as well and do the same thing then you would have completed uh this this challenge as well so make sure to sign up for that I have one pro tip if you don't mind so there is ongoing ongoing Cloud skills challenge which ends in two days I think on 19th of April which gives you 100% discount so if you are fast enough uh but please don't go now go after this session and uh then you can join the cloud skills challenge quickly complete those eight modules and you get 100% discount yes you have to be really really fast yes very well so uh yeah so here's also the link to the exam cram so there is actually now after this uh series of learn together there is an exam cram coming uh it will be on May 8th from 4: to 6:00 Central European Time uh so make sure to put that in your calendar and you kind of have a prep day for the exam itself where you will be able to also see some things about the exam format and how to just be ready to take the exam as well and with that we start to get towards today's content so where we are at this series is we are at the second session uh yesterday there was the session about how to get started with endend analytics and now we're going to like zoom in more in detail with a apash spark and then in the following sessions kind of going into the different parts of fabric and zooming in there to all of those different portions so so that you can be really have a lot of knowledge on those areas as well so we will start this journey one piece at a time so so it's easier to keep up as well are you ready Nicola yes uh I'm eagerly looking forward to uh to to go through this session and uh yeah show all the things you can do with the pares parking fabric great all right so in regards to what will be will be will we be covering today is we are going to look at how to configure spark uh what do you need to do to be able to just either run notebooks or jobs with spark in a Microsoft fabric workspace then we'll look at how to identify suitable scenarios for spark notebooks and Spark jobs so there's kind of these two flavors of using spark within Fabric and then we'll look at how to use data frames to analyze and transform data and that is the very like traditional way of working with data in your spark notebook then we'll also look at how to use spark SQL uh to query your data and this is of course if you're maybe more familiar with uh SQL language this could be a little more approachable than using some of the other languages but of course uh the benefit of spark is having this huge Variety in what kind of languages you can also use and then of course lastly we might want to visualize data so we'll today look at specifically how can you look at visualizations within a spark notebook sounds great to me yeah so let's start with a little bit of introduction and maybe looking at what is spark because I think that is a very important question to look at first and Spark is an open-source parallel processing framework and you might have heard this parallel processing framework even in other contexts and what it normally means is that we are able to split a specific task to multiple compute nodes uh so that we are a able to run that specific task in parallel and since we are talking about data that means then we are able to process our data in parallel so behind the scenes in the spark architecture there is actually a control node so the one that orchestrates how is that job split which CL cluster nodes does it run on and so forth but luckily uh when using in fabric we don't have to worry about that and then there is the actual nodes that run the queries and things like that on their own and due to this architectural approach that there is for aash spark it means that we are able to really run large scale data processing and analytics and sometimes with spark it can almost be like it seems like the performance is the same whether you are running it for a 100 row data set or a thousand row data set it doesn't matter it's it kind of scales as our data amount scale as well so that is kind of something that can be maybe a little unexpected if you haven't worked with uh a p park before that there is as I already mentioned a little on the side previously there is many languages that we can use for for Co doing our coding in so we can use Java Scala or P spark or even SQL if we wish to what is your favorite favorite flavor Nicola my favorite flavor is definitely spark SQL because I'm I'm coming from the SQL background so whatever can be done in SQL I'm doing in SQL thanks for asking and yours yeah for me it's p park I would say uh it's just so powerful within the apash park framework so there is so many cool things that you can do and we'll of course show some of those today absolutely absolutely and of course kind of when you use spark in fabric the nice thing is you don't have to like set up the control node and the workers and everything and you don't like have to set up a lot of networking and compute and things like that to get going but instead it is fully managed and you just need to configure a few things that we will look at in a little bit and there is even like a default spark cluster set up for you so necessarily you don't have to do anything to get started and the thing is that of course if you would want to you could like in your on-prem environment or in your Cloud environment you could run up a spark there as well because it is open source so you could run it anywhere that you want but then you would have quite a bit of management overhead that you would need to take care of so if we start to look at the context of fabric on the fabric side of course we don't need to do so much setting up uh it is kind of ready to go for us uh for spark and each of the workspaces that you create will have a spark cluster assigned and it will have some default configuration in place by default but there is there is like a possibility that you can also edit that uh we will in a moment see this in practice and I I think actually the view has changed a little bit from this slide specifically uh but it still has kind of the core things that there is for us to configure when we are setting up the spark cluster for our workspace and then it is the workspace admins that then can manage the settings for the spark cluster in the workspace settings and for example you can choose the node family uh kind of what size is it and then you can set how many nodes it can have what is the runtime version uh so if you want to direct people to use a specific runtime version you can do that and then some other spark properties as well so that is the gist of it but we will see a little more as we go into the demo as well one of the most common things that we need to do uh uh when we are running with aash spark is that we might need something in our notebook that is not there by default and in that case what we might need is we might need some kind of libraries that we import into our notebook and there is already out there in the spark open source ecosystem there are countless uh code libraries available uh There is almost libraries for anything that you might think that you need to do for example if you need to manage uh coordinate systems which is like a very specific kind of thing that you might need to do with coordinate data or you might need to um you might have some more basic processing that you might need but what you can do is that also these libraries can be managed at the workspace level and there you can then say which Library do you need what is the version where is this sourced from and so forth and then you can have those libraries available within your workspace for all of the users that are there so makes the management quite a bit easier than just having to import everything and you know kind of somehow separately manage this per user for example or per notebook and with that I think we are ready to go on to the demo to look at how this looks in practice and I will hand it over to Nicola for this thank you Hy that was a great introduction and uh yeah it's nice when people hear about certain things but it's easier when they see with their own eyes and exact the final stage I would say when they practice on their own so uh I'll do a quick uh short introduction short tour uh about spark configuration things that Haney already mentioned uh uh in previous few minutes so I'm currently in my fabric enabled workspace and I sincerely hope that you already know how to create a workspace itself if not uh go and watch uh session one from yesterday uh but just in case that someone is here who haven't joined yesterday's session for whatever reason so you go to a workspaces on the left hand side and then you simply create a new workspace in this case I already prepared a workspace which is called learn live it's fresh as a daisy so nothing is still in there and uh we will today build some stuff here but before we build some stuff here we need to make sure that our configuration is uh set up properly so uh depending on the resolution uh uh on your screen you might have seen this workspace settings uh tab also listed here next to the manage access so because I am using a different uh resolution for my screen it's under three dots you will you will maybe see it all already here but once I click on workspace settings there are a bunch of different things things uh that we can set up and configure in this case we are interested in spark configuration right so all the way down at the bottom at least as of today as you saw the on this slide things are changing in fabric rapidly so as of today this setting is uh uh down at the bottom maybe it will be pushed somewhere else but at this moment so you go to this data engineering data science uh tab expand it and then under spark settings there are different uh option configuration options that uh you can adjust according to your needs for this demo today and for preparing for dp600 I would say you are good to go with default settings but in real life in production environment you may want to change certain things so instead of using this starter pool which is provided to you uh by default maybe you want to create a new one and then you also can uh change the number of nodes change different uh uh uh other options here the uh for example to customize computer configurations for items turn this this on or off also environment this one is uh I would say one of uh more interesting things in spark configuration environments can be useful and honey please correct me if I'm wrong with my understanding that if you have uh different libraries and you want them in different workspaces so you can create multiple environments within one workspace for example and then uh uh uh provide access to some detailed visual libraries for example to certain group of users uh that that's my understanding about the environments but yeah I would say a useful very useful feature uh in uh in Fabric and also you can change the r runtime version this one that I'm currently using is the latest one I'm not using experimental you can play on your own if you want so that's where you set where you set default environment and where you where you set uh runtime version then also under high concurrency I suggest you keep this option uh turned on uh because then multiple not notebooks can use the same spark application uh to reduce the start time so the time needed for spark cluster to get up and running and also automatic log uh which is helpful for uh automating uh machine learning experiments and models in the background as I said in real life maybe you would want to change some of these settings but for the purpose of uh our demos today and for your uh own preparation for dp600 you are good to go with uh with default settings I would say what do you think P something needs to be changed at this moment or not yeah I think we are good with these and maybe just to comment like previously in the slide there was like a libraries link right below the spark settings so this has actually moved now that it is contained in the environment side uh so when you go into the environment and you create your environment then you can also specify which libraries you want to install as well so uh as we can see things move around at quite a fast pace so it's good to have like a little bit of an experimental mind that if you don't find something also later on it could it it that the UI changes a little more going forward so if something has moved just click around a bit and try to figure out where where has the specific setting gone this is will probably keep happening for quite a while still yeah and if you watch this uh uh session after a few months and you can't figure it out don't blame us blame Microsoft yeah and and of course like in the fabric documentation there is always the most recent uh instructions so if you just then search for spark Library management you will be able to find that so this is where you would then when you create an environment then you specify also which libraries you want installed there as well yeah and yeah that was this short introduction just for uh spark configuration before you start doing the the uh let's say real things yes H handing back over to you thank you so of course once we get our compute set up the way that we really wish to then we want to run some spark code of course so we will go back to the slides for a moment to just look at the basics and then come back to a demo again as well so when we're talking about running spark code uh and we have kind of two different options that we can choose from we have either spark notebooks that we can use or well notebooks it's not spark notebooks sorry for my language uh and with that we are able to then use whichever of the languages we would like to use within spark or then the other option that we have is to define a spark job definition and then you define that job and then you can schedule it in whatever way that you would like the nice thing uh about the notebooks is that we have this interactive environment that we can use for code and also adding collaborative notes so we have the option of adding uh comments into the section uh just into our code cell but we also actually have an option of adding like a markdown segment there that will then render us text so the idea of notebooks is so that you are able to put in both your logic in terms of code as well as your comments and descriptions that tell what does this code do and if you really do use that efficiently then it will make easier for others to understand your code as well also it can be for at least for me previously when I was getting into spark uh for the very first time some years ago uh it was confusing to me like how does this then run you know when you sked it and things like that but again it's just the UI that makes you know the cells uh visible for us there is specific like file file formatting on the file so that then it can also run uh on a compute cluster programmatically you don't need the interactive uh visibility that we have in this slide so we are kind of able to go sell by sell here uh create both comments and then of course put in some uh logic as well the spark job definitions is different it is a non interactive script so in a notebook you can go sell byell and see the output that it produces in the spark jobs you create a script and then you hand it over it so you can here see the kind of the things that come come up when you create a spark job definition so you give a file which at this type you time you can give a python file and then you can either upload it from locally or a Azure storage account and then you could also have a reference file uh if there is kind of references from your main file and then if you have any common line arguments that that script requires then you can also put them in in in that text box that we have on the screen and if it is also using some kind of specific Lakehouse we can also specify it here as well so you can think in terms of that the spark jobs are more like a script that you provide and then a notebook is where you have these cells of code and text that you uh where you write your whole whole thing and of course the bigger diff biggest difference is that with notebooks you can work interactively whereas with the spark job definitions it's a non- interactive script then if we start to get a little closer uh to kind of the data itself so uh if we look at look at kind of reading some kind of file that you have uh one common format is of course CS V and if you have a file where you have a row with headers you can have spark will try its best to infer the schema that is in use there and well with some file formats it is better than with others in some file formats you even have the metadata of the schema coming with it and we'll talk about Park a bit more later but even for CSV it is able to infer the schema to some extent it's it's not always perfect and you might need to then uh do some adjustments yourself on that schema if it's not exactly true and of course we have many file formats that we can read uh into a data frame uh so there is quite many options that you can use as a source data the other option is that we when we read our data into a spark data frame we can set a schema explicitly so tell what is actually the context of our data so we have this same data set uh and we can say that we have this product schema and there we have uh these different fields we have the product ID which is an integer a product name that is a string type category again a string and then a list price that is a float type so we can specifically say what is the format of each of those columns as well and that way then when we read that CSV file uh we have this spark read load command that we use here we set the path of the file where are we reading it from then we give the format then we specify the schema that references that schema that is just created there beforehand and in this case we don't have the header so we have these multiple options that we can specify in our read load command here and one of them of course is the format so if you have some other format then you would have to specify that differently uh and then of course for example the Header information can be important especially with CSV format then of course once we've read our data into a data frame we might want to uh transform it in some way I would say in most cases we don't just want to like copy our data from one place and then put it in same exact format to the next so we might there's a few very common things that we might need to do uh with select what we are able to do is we are able to create a new data frame where we select specific columns from our original data frame so we then just provide a list of the column names in the select statement and that is how we're able to just get a new data frame with maybe some less data in it another very common thing that we might need to do is uh filter row and we might use the wear statement to do that so in this example we are creating the bikes data frame and we're using the original data frame to select first three different columns and then we are filtering just where uh the data frame category column equals Mountain bikes and it's it's good to notice that in a lot of cases there is like alternate ways of doing things so we for example for filtering we have the wear statement but we also have a filter statement so once you start to learn uh use spark a little more you might come across similar or functions that have S similar capabilities and then you might need to figure out which one to use and then the last last example here is the group by and aggregation functions but this one uh specifically is a group by Clause so here again we are using the a original data frame and creating a counts DF for our different counts for per product category so we're just choosing two different columns and then have doing a group by by that category column and then doing a count so what this would produce is is that it would group each category and have the count of each category there so very similar if you're coming from the SQL world you could do a group by and a count count for for your column and then get the total count for each category from that all right I know this can be very like uh weird just seeing this on the slides and not having any contacts so that's why we're going to go to Nicola to show us the demo and what does this actually look like in practice that's why I'm here thank you thank you H that was a great great overview and great introduction uh of the things that we are going to uh cover now so again we are starting from the from the completely empty workspace and uh first of all we need to take one step back or to be honest not one two steps back so in order to work with the spark we need some data and uh we need to store this data somewhere right so we need a lake house to store the data in Microsoft fabric so what I'm going to do now I'll click here on a new tab on the top and I'll create a new Lakehouse I'll give this name learn live and wait for a few seconds for uh fabric to create this Lakehouse and also uh additional items that goes hand by hand with with creating a Lous SQL analytics endpoint and default powerbi semantic model but you will learn more about those two in the next sessions so once I'm in my Lakehouse uh user interface on the left hand side I see the Explorer uh yeah thank you uh I see the EXP in the Explorer area I see two main folders tables and files table SE managed area of the lake house so the spark takes care of managing uh tables that are part of folder whereas files is not this is the unmanaged area and you can store literally any type of file here so not just structur data like par or Delta but also some some non-structured uh formats in this case we'll keep it simple for this demo and I'll use uh a CSV uh a bunch of CSV files so uh to bring those files into Lous because now it's completely empty I'll click on Three Dots here next to files and then I'll choose the option upload now in real life you you probably not doing this way so you will not bring your bring data from outside into a fabric by uploading all these files that would be really weird I would say but uh there are more convenient and efficient ways to do it like using a pipeline for example uh and orchestrate your uh data ingestion process but as I said this is just for the demo purposes so let's keep it simple I'll click on upload and then upload folder then let me find my folder here here it is orders I'll click on upload and as you may see there are three different CSV files that are included in this folder for each year so 2019 20 and 21 we have a separate CSV file uh now this was loaded into uh into Lakehouse and now on the left hand side I have this folder orders that once I uh refresh it it contains three different CSV files I can quickly do a a a preview of my data and see how it looks like so for each of them I can see what's going on so now we have data in our Lakehouse okay that was the first step and now we can start uh leveraging all these things that Hy mentioned in uh previous few minutes so while I am in The Lakehouse uh uh area I can directly from here create a new notebook you can of course do it from the workspace user interface as well but once you do it directly from here this notebook will be automatically uh this Lakehouse will be automatically attached to a notebook and you can then start immediately writing your uh P spark or spark SQL or whatever language you prefer code in in this notebook so I will click on open notebook this uh down arrow and then choose new notebook there you go so uh this is a notebook uh very basic one with just one cell at this moment but I promise we will have a lot more uh during the demo uh the cool thing about notebooks is that and what I love about them is that uh it's not that you you can just write code inside the notebook you can also uh use this format text or or uh how it's called Mark uh markdown text so in case that I want to uh here instead of code to use this as a markdown text I can simply click on this uh M icon M with down arrow and this will convert uh convert it to uh a markdown text and of course if I need to edit this this text and say uh we all love fabric for example then it will be written here and from there I can add new cells both code and markdown cells depending on uh what they need so let's start with loading some data into a data frame uh data frame is the essentially the the basic structure that exists uh within the spark and before we proceed just one more thing because H you already mentioned that you can use different languages uh within the notebook default one is p spark but if I click here I can choose between Scala uh spark SQL and R spark R and we will use Spark SQL also uh during one of the next demos but here on the top you can switch between the language that is uh used within the spark notebook okay so let's start loading our data into Data frame and uh I'll start by expanding my orders here and then let's first load this 2019 let's start from the from the first file so if I click on Three Dots here and then load data I can select to load data into spark and this will create uh this will create a a code cell let me just close this so we have more space here uh so this will create a cell uh with code uh automatically generated for me and let's run this code let's run this cell so I will click on this one uh arrow on the left hand side and this will execute just one single cell of course you can run them all here on the top but let's start with running this uh first cell and let's see what do we have here once it's completed so we are doing live hopefully demo gods are with us yeah you never know what will happen yes that's true especially when you run run it for the first time it first needs some time to start the session so for this spark cluster to get up and running uh usually takes like less than 10 seconds but yeah today is a little bit slower a little bit yeah a little bit one need seconds yeah oh yeah now it's started now it shouldn't take long yeah one neat thing that when you are running the shells if you're clicked into this shell you can just press shift enter as well and that will run the command one specific cell as well so once you get really fast and going yeah then it's very convenient yeah I'll keep that in mind as well thank you so this looks great right or maybe not uh I would like to ask people in the chat what's wrong with this uh with results from this uh uh of this data frame and uh because we don't have too much time to to uh to wait for all the answers I assume you realize that our uh colum headers are a little bit strange I would say those are the records from our CSV I don't know what do you think I need yes I would say so too it looks like it just took some of our data as the very header Row for us ex exactly and that's because in our CSV file we don't have headers and while we were creating this data frame uh we left this option for specifying header set to true so in case that I I change this now to false and rerun this cell again now fingers crossed that yeah it looks better it looks better but I'm still not happy but it looks better so now we have these c0 C1 C2 and yeah those uh names for our columns which are still not so user friendly so let's go and uh fix this as well uh remember when Hy told you about creating explicit schema for your data frames and that's exactly what we are going to do now so I'll create a new code cell and paste the code here so this time I'm importing uh SQL Library here and I'm explicitly defining schema for my data frame so in this case I want to call my columns uh like this sales order number sales order line number order date customer name and so on and then I'm specifying the data type for each of these columns so once I uh and finally I'm reading this from CSV file and I want to load this data or to load the data which is stored in my 2019 CSV file within orders file under files uh section within the Lakehouse and let's see how does it look now and it seemed that our audience is awake because they also noticed that the data typing wasn't so good in the first first version that's true so now we have everything uh properly set uh you see the the data types here for the for example this is uh a string type then we have sales order line number integer date and so on and so on so this this looks good now I don't know if you agree with me but this looks now this this looks good now yes definitely yeah but the problem is I mean it's not a problem but we just started so we have data just for 20 2019 but remember we have three CSV files 2019 20 and 21 so now I want to load uh all the data into this data frame not just 2019 so what I'm going to do I will just replace this explicit value of 2019 with uh with the wild card symbol so with star and once I click on this I hope to get data also from yes it is it's here so you see data from 2021 uh this uh uh wild card symbol you can use to uh load uh basically multiple files data for multiple files at once so you don't need to go uh through each and uh every single file one by one you can use this uh uh uh this wild card symbol to get them all at once in your data frame in this case this is just a subset of rows so I can't conf confirm that there is also data from 2020 and uh 2019 but obviously it's not 2019 19 only anymore and now I would say that we are ready to make some data exploration now that that we brought the data into Lakehouse and we brought this data uh from files into our data frame I think it's a good moment to do some basic data exploration within the data frame by using py Spark and I promise I'll cover all of these things that you mention in your in the slides uh so filtering and grouping the data uh aggregating aggregating and and everything else so let's start by filtering and I will do like this like a real professional filter data oh nice yeah very good okay so let's filter our data and I'll create a new code cell and I will run this code and then I will explain what do we have here so for filtering data you can do it in multiple different ways obviously uh if we want to vertically filter the data so to get just some of the columns from our data frame uh I can specify those columns here within the data frame definition in this case I'm interested only in customer name and email nothing else so I can omit all the other uh columns and also I can perform form some simple uh aggregated functions in this case I want to count the total numbers number of rows in this data frame you see there are 32,7 18 but also I can perform a distinct count for example if you're coming from uh SQL world or powerbi world you are familiar uh with differences between count and distinct count uh functions uh I assume and uh you see that there are 12,000 12.5 th000 let's say distinct customers and here is the list of all these distinct customers so that's a simple uh vertical filtering where we keep just some of the columns that we need uh for our data exploration what else I can do I can again click uh and create a new cell in this case I will run it again I'm enriching this previous uh uh previous uh cell with ve Clause here it is and I'm returning all the customers uh who bought this item this product Road 250 red whatever so you see that there are 133 distinct customers and total customers who bought this products so you can also include we Clause similar to what you can do uh in SQL language uh then the next topic that we are going to show how to do is aggregating in group grouping the data in the data frame so let me create a new one and I'll say now aggregate data okay I didn't do this properly maybe I need to make like this yeah oh I'm becoming professional yes so I'll create a new code cell and then uh paste this code just to quickly explain what we have here so again we are creating a data frame frame and we are creating this data frame by uh providing a select meth by running select method and we want to group the data so let me Zoom it a little bit so we are grouping data by the item and then we are running sum sorry so we are grouping by the item and then we are running sum over numeric column which is quantity in this case and what this uh query will return all the items that exist in our uh data frame and it will just some uh return the sum of the quantity for for all of these items so looks very simple my friendly advice to you is if you plan to take dp600 pay attention to these methods pay attention or the order of of uh methods and how they're being run through uh the the uh fabric notebook okay so the next thing I want to show you is how to uh essentially change the names of the columns that's also part of data exploration here again I will use P spark SQ spark SQL sorry uh for this example and in this case I'm uh essentially extracting the year uh from the order date remember in the order date column if I scroll back up let's find it so this is the order date column the idea is to extract the year from here okay so this is exactly what this one will do it will extract the year from my order date column and it will give it uh allias called year so that's the name of the column in our data frame and I want on top of this I want to group the data by year and count total number of uh of uh Records that we have for each year and finally I want to sort this by year so in this case you see that for each year how many records we have in our data frame similar if you have SQL background similar logic so you start with select then you do a group bu and then you do an order Buy in case you need to include ver clause also before group buy you can include ver Clause so this reminds a lot of uh a regular SQL language uh additionally what you can can do with P spark and fabric notebooks is to transform data files and uh I would say this is a very common task when you are uh working with data probably the data you get uh the raw data you get it's not of uh uh uh appropriate quality and you need to enrich your data data set with some uh additional logic some additional columns and so on and that's as I said a common task and you can easily do that with uh with fabric notebook so I will uh show you how to uh transform and add some uh additional columns to your data frame in this case what I'm doing here is uh I need to put let me put the whole thing and then I will go one by one and explain because uh you will see results only when I put everything so again I'm using spark SQL Library here and in the first step uh line number three so here uh I'm uh adding additional column which will be called year and essentially like in the previous example I'm using a year function to extract the Year from the order date column and I'm doing the same for for month so I will create additional column called month which will use month function to extract month from the order date column uh similar Logic for uh creating the first and last name fields from the full name of of the customer remember we had a full name of the customer in this case I want to split this column uh on uh the empty space between the first name and last name and then take everything that is on the left hand side from that blank space leave it as a first name and then uh I will take everything that is on the right side of this blank space and create a column last name and finally uh I can reorder The Columns let's say that I want to see year and month straight after the order date which makes sense uh then or first and last name before the email of uh of the customer so I can reorder uh the The Columns within the data frame and in this case I'm displaying top five rows from this transform data frame as you see here are our new columns year month first name and last name so very very straightforward and easy once you get your head uh around it but it's nothing to be uh afraid of uh you can use the full power of the spark SQ library to transform the data so by filtering rows deriving removing renaming columns and applying any other uh modifications and adjustments that are needed by your uh business case and I think that's it for this demo I'll show you how to save this transform data uh in one of the next demos but before that I'm handing back over to Hy great thank you that was quite a bit of Transformations and reading and everything happening in one go and maybe at this point we could just get one of the questions that we have in the chat so there was one question about when you run different uh languages in a notebook then what happens behind the scenes uh are there for examp example separate VMS running Python and separate one running R and how do I pass data from one language to another so you can definitely run different languages in different notebook cells even within one notebook so there is this magic syntax that you can use where you use the percentage and for example if you wanted to run SQL in a pisar notebook you could do percentage SQL or so forth so you can always change a single uh cell that you have in your notebook and it all runs on the spark cluster so there is not like a difference on that and of course on the fabric side we are not managing VMS at all but of course somewhere behind the scenes there is the compute running even though we don't see it on the surface and within spark there is kind of the spark core engine and then there are the different languages built on top of that uh so that is kind of what is happening behind the scenes when you run those different languages they are all built on top of the spark core engine and then it depends a little bit how you pass the data but for example if you do run that cell of SQL code in your notebook uh you could do a select uh on your data for example then I believe it's something like underscore SK SQL DF that you can then reference that as a data frame that output from the SQL syntax that you are running uh so that way you can then continue processing that data in the next cell for example again with pice Park but there there is definitely sometimes a little bit that you need to do to be able to transfer seamlessly between the different cells and languages and with that I think we can continue on to talk about then how do we save a data frame and go back to the slides so of course once we've done all our filtering and Transformations and things that we want to do for our data we want to then save that data frame somewhere and today when we're looking at the fabric side we are really working on the files section of the Lakehouse Explorer uh so we really have that data on the one leg uh layer there so when you are ready with your data frame then you can use the right function then to to write that data wherever you would like so there are several different modes that you can use so you can overwrite uh there's also an option to append if you just want to keep adding data uh but if you're for example working with a false data set that then would always get entirely refreshed then you would want to overwrite that then it's different if you get like new a new data set that you need to add to what you have existing so you need to always consider that in this case we are writing a paret file so in that paret section we are specifying where do we want that to go so we are have the files section of the Lakehouse uh portion that we are using and then we are having a subfolder called product data and then we're saving the data frame as bikes Park in this case and then if we have a whole lot of data we might need to partition our data in some way so what this does behind the scenes is that it actually divides your data sets into different folders so if you have for example a data folder then it will uh create a subfolder by the partition key that you are using so in this command that that we have here on the slide we are partitioning by Year and that means that then we have a different folder per year so for example we would have year equals 2020 year equals 2021 and then we are able to then again if we need to query that data further we are able to specify which of the partitions partitioned parts we want to query as well again in this case we are using the overwrite mode and uh specifying paret uh and here we are putting it all into the data folder so we don't then need to specify the partitioning folders themselves uh that is done by the partition buy method that we are using yeah so with that I think we're also ready to see how does this look in practice as well yes happy to show you uh in this demo this will be uh a short one so we are basically showing uh how to save uh uh transform data so remember in the previous step here we already transform data uh as we wanted so we included additional columns I want to save this transformed uh version of the data so uh people can reuse it for their workloads and uh as we saw in the previous example on the slide in this case writing mode will be too over right so basically if there is uh a file with the same name I will uh this uh uh this code will overwrite it you can also choose to append it uh if necessary and we are specifying parquet as a format parquet is preferred for data files I would say uh because it's uh very very uh suitable for analytical workloads it's a columnar format also contains metadata and uh it's very performant compresses data well so it's it's the facto standard I would say uh in today's analytics world with when working with files so in this case I will save this uh data frame as parket file uh within my files area of the Lakehouse transform data folder and orders uh and finally if everything went well I want to print out this message transform data save so I will run this it will take probably a few seconds and transform data saved now if I go back here don't pay attention to this one because I was testing some something previously so if I go here and uh refresh my files area you see transform data and there is my orders folder and uh fabric automatically created three different parquet files for uh for this uh data frame I'll close this and show you also how you can partition the data and essentially this helps in this case we don't have like too many rows 30,000 or or something like that but if you're dealing with huge amounts of data partitioning can be a lifesaver so it's it's not a bad idea to uh do file partitioning when you are writing data to uh to a Lakehouse in this case I'm partitioning data by year and month again overwriting and uh everything else stays the same and I keep another one sorry about that so first I'll read this data to a data frame and once it's re it's in the data frame I can use it to uh so I skipped one uh uh one step sorry about that and uh now it should run successfully transform data saved again I will refresh the files here and if I go to orders and let me just find it no that's the same it should show different months maybe I skipped something so let me check in the code quickly if not we will not waste too much time so year month all right F files partition data aha maybe maybe you have to refresh this yeah yeah I've not seen par data yeah sometimes it's go it's good to go one step back and check what you wrote uh so in this case under partition data you see that I have different folders for every specific year and under uh every specific year data is partitioned also by month and if I click here uh you see that for 2019 I have uh one parket file for each month so that's how you partition the data again very helpful in scenarios when you are deal with uh with large amounts of of data in the files and that that's all what I had for this demo handing back over to you Hy and yeah moving to my favorite part to be honest great uh so we are going to now go to spark SQL as you might guess from the comment that it is Nicola's favorite part so that is the next portion that we are going to to talk about and so spark SQL there was actually a question in the chat about what is the difference between SQL and Spark SQL that I managed to see at one point so if we remember that a par bark is a distributed system both in that sense that we have all those worker nodes but on the other hand we also have the compute separated from the storage so for example in fabric our storage layer is one Lake and then we have the compute layer which is separated from that whereas if you we look at the relational database side in that world those are most in most cases coupled and of course we have many different flavors of SQL as well uh but if we for example compare to SQL uh spark SQL is specifically really optimized for these scenarios where we have dispersed data data sets so when you we have this distributed computing model so and the other side is that is able to handle both structured and semi-structured data as well so that is kind of some of the differences but of course the syntax is quite similar which is of course nice nice when we have been maybe using spark SQL previously so as one point Point uh how we might work with spark SQL is that we might use the metast store to Define tables and Views so here we are creating a Temp View first so that means that it is available for our notebook in our spark session that we have going so the first first row is there where we are creating the products view temporary table so this will create a view in the metast store and then uh when we are creating a new table in the meta store we can use this savs table and use Delta Lake format so uh Delta is kind of an abstraction that has been made based on par so in addition to the park formatted files that we will see in the folder we get also a separate folder with some metadata related to the Delta format and that is what enables us to for example have AET transactions with this file format uh like very flat file format that we use here we're not using a relational dat database but we actually have separate files in a folder that we are using here so with this second uh line here we are creating a table named product it is also possible to create an external table so that is where we specify what path or in what path in the file section of the Lakehouse are we saving that data for our table so again we have our data frame that we want to save we use the right method use the Delta format and we save the table as so that we give it a name and then in addition we give it a path where it is stored in the file section so with all of these we will now see something more appear in our lake house uh Explorer than just the file section so we will see that in practice in just a moment what will appear when we use these commands but on the other hand we might also want to use spark SQL to query so uh behind the SC scenes there is the spark squel API that we can call from p park for example so uh uh to do that we are actually still running in a cell that uses pyp Park and there we are just able to reference spark. SQL and then within the parenthesis we give our query that we want to run and there was this question actually about word wrap wrapping the code in the chat as well so kind of this example also shows how you can do that so you can use these um what is this called now backlash backslash yes backlash I so yeah yes I think you're right that sounds correct so we can use the backslash to move our next rows kind of split our code onto separate rows we're also able to use the spark SQL directly so use the magic command that we see here to run our SQL query as regularly as we would like so with that let's let's move on to the demo yeah thank you we are moving on again to our cool notebook I already have 16 cells oh wow it's getting yeah and you see this one is called use SQL yay finally so uh in this cell I'm uh creating a new table which will be called sales orders and I want to save it as Delta uh H you already explained what Delta means so let's go and create this table and uh the other command that starts in line five will essentially give me a nice description of my newly created table in terms of column names uh data types and so on so let's be patient for a few more seconds it's still running so yeah nice so this is description of my table all the columns all the data types and uh other stuff that I might find useful but here if I refresh my tables folder finally I see sales or finally I see a Delta table here uh you can recognize it based on this small triangle icon in the bottom right corner so that means this table is stored in Delta format so that's the first thing I wanted to show you and uh another thing I want to show you is how you can essenti query the data but before that uh I'll go to sales order load data and then I will load it to spark so uh I can essentially uh then manipulate this data with using spark SQL in this case I'm just returning top thousand rows uh all the columns and top thousand rows from uh my sales orders table you see the naming here so this is uh uh uh essentially this enables you to combine uh data from a lake different lake houses and lake house and Warehouse so in case that I wanted to use this table in conjunction uh with data coming from a warehouse then uh three-part name would be uh uh required in this case I would provide the name of the Lakehouse the name of the schema and then name of the uh of the table and then I can combine also data from lak houses and warehouses uh also what we already mentioned what hny mentioned is that this magic command uh basically this one uh uh enables you when I write uh double percentage sign and then the language the specified language SQL that indicates that the spark sequal language run time should be used to run the code in this specific cell instead of P spark which is chosen as as a default language and if I run this one so all of you who are familiar with SQL probably know what what to expect essentially we are uh extracting the Year from the order date and just doing a sum over uh this mathematical operation of multiplying of unit price and quantity and adding uh tax value on top of it so now at this stage you can write SQL uh spark SQL I would say from my experience is 98% the same as uh NC SQL so yeah almost all the features and all the functions uh work the same way and uh with that I wouldn't take too much time talking about SQL anymore I'm handing back over to you Hy for uh I think last part of our session for today yes so the last bit that we want to talk about is visualizing the data and we're not going to the parbi side of fabric at this point but there is actually ways to visualize data in a notebook so you might have spotted in some of the notebook outputs that Nicola has been showing already that there is not only this table tab there but there is also this chart tab there and this is kind of a very easy way to create simple graphs so you have some different kinds of chart types that you can choose for your data then you can oftentimes you need to choose some kind of key and values that you want to use for this graph so for example in this sample uh the the key is the item so you get the different slices of the pie chart are the different items and then the size is determined by the quantity and the sum of that quantity specifically so this is like a really really easy to use way to do charts but of course we have different Graphics libraries or packages that we can also use to graph our data within notebook so for example the plot package can be used uh so that then you can maybe do some more complex visualizations and you have a lot more flexibility but then again you need to then be familiar with the specific code that you want to use how do you need to set the different uh variables and things like that that are needed there so there you have to have a little bit more knowledge of how to use that specific package uh in your code but both are options and you can really get started with simple visualizations within the notebook itself as well but of course if you run it programmatically uh then you know you don't have the same right away visibility for it as you have with the interactive mode that we have been working here at the moment ah yeah I think with that we actually are then ready for the last demo yes that's true we are ready for the last demo some simple visualizations in directly in spark notebook so yeah no powerbi today uh but and no pie charts yes but I think it will be maybe one let's see uh so I run this uh very uh basic SQL query select star from sales orders and here I can switch between table and chart view so I will select chart this looks very interesting and I can quickly get some insights from is of course I'm kidding but yeah I I can customize this chart and then it will be much easier for me to understand what's going on so in this case I will switch from sales order number to the item and then for values I will select quantity instead set standoff sales order line uh number and finally I don't want averages I want some and I will click apply and now you see this makes much more sense so of course this is not something that you will uh show to your uh SE level management but for quick data exploration quick understanding about data distribution this is a great starting point so literally like 30 seconds you you can understand what's going on but what I what else I wanted to show is also some of the libraries that uh P spark offers uh for data visualization specifically one of them and I think the most popular please can correct me if I'm uh if I'm I'm wrong is uh uh mat plot lib so this one is yeah commonly used for data visualization tasks in this case I'm creating uh a simple data frame and then uh by using M plot lib in the next code cell let me show you and then we will walk to it so first I will uh move this data frame to using Panda pandas library and then I'm creating a simple uh bar plot bar chart and defining what should be displayed on x-axis and what should be display uh within the bars and finally to show this so this is a basic example how how you can use M plot Li library and also for some uh more advanced stuff uh again you can customize this uh the look and feel of these visuals in uh in Python notebook additionally in this case I'm uh including grid lines uh I changed the color for for my bars and also added titles uh uh title on my visual and so on and so on so it gives you a lot of flexibility what can be done uh also one more thing to uh that I want to show you is uh something that is called Figure again I'm using mat plot lib and in this case I'm returning Revenue by year so this is a figure generated by uh mat plot lib uh one other thing before I move to show you another library is uh in this case I am clearing the plot area and yes say yes to plot to pie chart in this example so I can also place two different visuals uh within one uh result set let's say uh in this case Revenue per year and orders per year again specifying different things uh like legend for example what is the legend or pie chart and so on and so on uh of course this is not a deep dive on data visualization with P spark but we wanted just to give you a hint about what is possible and how you can quickly uh visualize data that that's coming from uh from a notebook and finally uh while matplot lib uh Library enables you to create really complex charts uh mult of multiple types it can sometimes require more complex code and over the years uh many new libraries uh uh popped up that have been built on the base of matplot lib uh to yeah abstract its complexity and enhance its capabilities so one of them is called Seaborn and I will show you now how you can use Seaborn to visualize the data so this one is I'm uh importing Seaborn and this time I promise is not a pie chart it should be a bar plot yeah I was worried because you promised no pie charts yes so yeah I I like this one like different colors different everything and again I can uh enhance this code with additional customization options so uh adding grid lines uh changing the the how it's called opacity and so on and so on so there are many possible options uh we encourage you if you're interested in uh ending data visualization with python in more depth to find a proper session or proper uh resources for learning this but as I said it doesn't give you uh reach possibilities as a data visualization tool like powerbi but still provides you with a quick insight about what is going on uh with your with your data in in the lak house and I think yeah that was everything that I want to show you with this line visual uh I'm handing back over to you Hy great uh thank you so much uh we saw many different options there that we can do with the visualization before we head on to the summary since we are starting to get out of time in just a moment so there was one really good question that is what is the difference between Delta table and external table uh actually both were Delta tables so just one was where we let the lake house manage where is the data stored behind the scenes and with the external table on the other hand we specified where is that data specifically stored and then we can see it in the files section in our Lakehouse Explorer as well so they are actually both Delta Lake format so from that sense they are the same just the Lo how the location is managed is different we also got a very good comment that uh one other option for the word wrapping is to use a pair of brackets and that way you don't actually have to use the back backlashes slashes when you change rows so that's another option i' I've heard different preferences from people on this which one they like as usual yeah as usual yeah well with that we are able to get to the summary of this session so what we went through is first we looked at how to configure Spark in Microsoft fabric then we looked at uh using both notebooks and Spark jobs and a little bit what is the difference between those two then we looked at spark data frames and how to use those to analyze and transform data and also write data and then we looked at using spark SQL to query data in tables and Views and lastly we looked at the visualization side as well so quite many topics and and there is quite a lot of details to look into in each of these and what will definitely help you is if you go Hands-On with the exercises that are in the learn module and in the learn module as one addition you will also have uh kind of this knowledge check portion as well where you have some questions and you can answer so you can then challenge yourself on whether you can pick the right answer from from those options that are put for the questions there as well yeah no prizes but you will feel good exactly when you get the green check marks you can be really like okay I learned something perfect so one more time uh all the references here's the link to the learn live module that we went through today so you can use the QR code or the link there at the top then remember we have a next session uh tomorrow actually so there is work with Delta Lake tables in Microsoft f fabric so you can continue this learning process uh Delta lake table so we will more go into detail into Delta Lake and how do you define tables and so forth not us specifically but the next speakers and then uh ah there we go so that will be happening tomorrow very much Builds on what we went through today and you will continue to see spark appear once in a while there as well yes and as a last reminder remember the 50% exam discount that is out there and as we are starting to wrap up here now is starting to be the last chance for you to get your questions in we will just chat away here for a few more moments that we can check are there any last questions in the chat on YouTube and see what else might be out there yeah I I'd like to thank our moderators I saw they did amazing job in the chat thank you so much thank you so much both of you and of course thanks to production team in the background yes definitely it makes this all so much easier and definitely you know like even presenting on this topic you always get to uh learn so many new things on the go as well absolutely absolutely yeah all right but it uh just checking in on the chat to see is there any last questions we're seeing quite a bit of uh thank yous out there so trying to see if there's thank you for joining exactly yeah I think thank you Hy for being amazing co-presenter with me today thank you yeah thank you it was really nice chance to get to present with you because you know we've been to quite many events together and this was our first time presenting together so always a pleasure that's true always a pleasure yeah true seems that there's not any more questions out there uh so I think we are ready to wrap up for tonight to to this afternoon is it night or afternoon not sure anymore but thank you everyone thank you Nicola it was a pleasure and please do join the next session as well and continue this fabric Journey thank you so much for joining thank you Hy thank you everyone for joining thank you have a good evening e
Info
Channel: Microsoft Power BI
Views: 4,543
Rating: undefined out of 5
Keywords: microsoft learn live, fabric, data-analyst, data-engineer, intermediate
Id: BejPbIMYWko
Channel Id: undefined
Length: 87min 1sec (5221 seconds)
Published: Wed Apr 17 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.