Learn Together: Use Apache Spark in Microsoft Fabric

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone welcome to today's live stream on using Apache spark in Microsoft fabric I'm super excited to be here my name is Shaban Watson I'm a Microsoft data platform MVP and independent consultant today I am joined by mathus falland who is on this side Maas let's do a high five hi shamet all right introduce yourself please yeah I'm Matias fent I'm living in Switzerland I'm the CTO by at corporate software we're building great data products here and I'm also a Microsoft MVP and I'm really really enjoying to be here all right great so uh I also want to give a shout out to our wonderful moderators that are in the chat right now we have two fantastic my Microsoft data platform MVPs Kai Zer and Osama wahab Khan who are in the chat uh this is a live session so if you have any questions feel free to post your questions in the chat and they're going to answer your questions mathus and I are not going to see your questions in the chat if there are some that they cannot answer uh and if there's time at the end they're going to pass them along to us okay so a couple of uh slides before we get started this is a part part of the Microsoft learn series on uh learning Microsoft fabric if you want to follow along with the examples that we show later you can go to this link uh the lab and everything that we show everything that you see here is going to be from that module um if you are planning to take the dp600 which is the newest Microsoft certification for Microsoft fabric to become an analytics engineer um make sure that you check out the cloud skills challenge Alles there are discount codes available there that you can take currently there's a there is a 50% discount code that you can take off of the price of the exam um and even if you don't intend to take the exam these challenges are a good way for you to get started because they have a very guided path for you to take and get yourself familiar with Microsoft fabric without getting lost um in in Microsoft learn um also there is an exam cram session coming up for dp600 that's going to take you over uh the kind of questions how to do some timee keeping and uh give you some tips and um ideas for the exam uh and then finally uh don't forget about the Microsoft fabric career Hub this is a great website for you to go and learn about everything about fabric I have actually made this my homepage because on top of it there are a lot of links to everywhere else that I want to jump to so for example from this site you can jump to the Microsoft learn section that's dedicated to Microsoft Fabrics that's what I use it for um as well all right so this is the second session of the series of learn together make sure that you attend as many as these as you can especially if you're planning to take that exam it's very important that you know everything that's included in all of these modules and um if you already have powerbi in your organization you can enable Microsoft fabric on the existing powerbi capacities that you have if you don't have powerbi at all you've never worked with it and you want to get started with Microsoft fabric you can go to um we're going to have the link later so you can go and um go to this link upgrade to fabric and this will actually allow you to sign up for 90 days to use fabric fabric at no charge to you whatsoever even if you have never used powerbi before all right so this session in this session we're going to talk about spark we're going to talk about how to use spark to um inject to process data to analyze it visualize it we're going to talk about pie spark and Spark SQL and the code that we show here is going to be common to all of the spark implementations even outside of fabric but our Focus here today is with Lakehouse in Microsoft fabric all right so what is spark spark is um Apache spark is an open-source parallel processing framework that lets you throw data processing jobs at a group of notes together that's called a cluster so every time you work with a spark you're working with a cluster there is a a main note that takes your job and then it will distribute it across all of these other worker nodes this process of Distributing the work is abstracted away from you in most implementations of uh Microsoft spark so for example in Microsoft fabric when you are working with spark all you have to do is to create a notebook and run your notebook and it will get attached to Cluster with spark running and it will get the data processing job done for you it um supports many languages it supports Java Scala pypar and SQL and in Microsoft fabric it is fully managed for you um in Microsoft fabric the way you work with spark is that you create a Lakehouse that Mia is going to show later and then you attach a notebook to it and every workspace in Microsoft fabric comes with a spark cluster and the admins for the workspace they can manage the settings of the cluster under the data engineering and data science section of the workspace setting they can specify things such as the node family this is the type of the virtual machines that each node will be running on so you can control that and in most cases memory optimized is the right choice for most scenarios um you can choose the runtime version of spark and you can also go into more advanced spark uh properties from that from there um they can also the administrators can also add libraries so when you start with a workspace in Microsoft fabric most of the commonly used spark libraries are already loaded but if there are additional libraries more specialized libraries that you want to work with your admin can go into the library section and load those extra libraries for you all right so we're going to go to mythus and mythus is going to show us how to create a workspace a Lakehouse and some of the settings things that I just talked about yeah so let's go thank you um well so we are here on the fabric um homepage um you have on the left side the different sections that you can browse on the on the left bottom uh you can change the view so we are currently in the data engineering view which is fine for us and uh we have the workspaces on the left side uh where we can switch or uh workspaces or we can create new workspaces so let's go there I have already for this demonstration created our workpace so everything runs smoothly but for demonstration purposes I will create a new one so there's a big button new workspace I can hit it I give a name to that workspace My Demo workspace there we go um of course you can add a description to that workspace um you can assign that workspace to a data domain if you want to so if you follow the data mesh concept this is really nice give it a specific image and uh you can attach it to a license mode uh in our case as chaban pointed out we have a trial so I run it on the trial um work license mode um if you already own a fabric capacity or a powerbi premium capacity you can also select that one uh to run the workspace on that capacity and you can switch the um license mode of a workspace later if needed um so that's basically all we can um apply and then the workspace is created the workspace initially is empty but we have here the new button which is is pretty pretty uh long and from time to time something new is even added there so uh we have different options first you usually you create a Lakehouse or a warehouse in our case a Lakehouse is great Lakehouse is for reading data in a warehouse scenario can also write data and when we hit L house we need to give it a name demo uh where demo Lake housee there we go create and this will create our Lakehouse um right on our workspace and the fabric um and you see we have the files and tables section both is empty we will upload files later and or I've already uploaded it for demo purposes um and we will create tables um and uh when we go back to our workpace we will see that we have our demo Lous and um also the semantic model was created which is uh basically the power query model um and the SQL analytics endpoint where I can run SQL queries directly on The Lakehouse is Alo also automatically generated you see it uh in this graphical representations is like a child item uh on the demo Lous um I can also create uh a notebook we need to create a notebook to run our spark code um so I hit uh notebook and now I don't I'm not asked for a name it's a little bit a different experience uh I simp L um get into the notebook VI view I just jumped in and the notebook has a default name called notebook one I can rename it here on the top so basically uh that's that's these are the basics for creating a workspace and a Lakehouse and a notebook and so well now back to shabam to get some new Theory and then we will uh demonstrate the the again in our workspace okay um myth is did you also show uh can you also show the workspace settings for the spark settings how to find them yeah that's a great point Thank you for pointing that out um so well on my demo workspace I have the workspace settings here and um we have this data engineering uh data science section with the spark settings um we have the pool the default pool it's a startup pool assigned uh at that side um we can create a new pool if we have different needs for that pool for example this one has sizing from 1 to 10 notes it's a not size medium and it's memory optimized this is a great choice for for most of of the workloads but if you have different needs you can create that um I don't have the library uh settings in my demonstration fabric here so I cannot show you that but you have seen that on the slide and you have a further options with the environment where you can Define this as a default environment you can select the runtime version um you can select if you want to use high concurrency which means that multiple notebooks can run at the uh the same spark the same spark application so the start time for each session uh is is faster and uh you have automatic log um which is enabled by default so metrix parameters and so on are automatically locked well these are the um workspace settings so well back to the theory okay so when you want to run spark in a Microsoft fabric you can create a notebook and put your code in there and you can the notebooks give you the ability to run spark in to run your code interactively against the cluster that's attached to workpace so you use a notebook it's going to have multiple cells it can have uh comments in there you can add images in there so it's a very nice environment The Notebook for you to uh interact with the data and also you can add comments in there for your team to also look at and collaborate so they're a fantastic tool for looking at data and analyzing it um but you can also use a spark job definition to run the spark code on a schedule uh so you can run a spark code in a note you can run the spark code in a notebooks using either a spark job definition so you can schedule them from here you can also put them in a data flow that we're not going to show today but just know that that's your other option so the main data structure that you work with in spark it's called a data frame so think of a data frame as a table in memory so you load your data in a data frame and from that point on you can further analyze it display it um and transform it if needed so what you see on the screen is an example of uh let's say that you have a CSV file in the lake in The Lakehouse um and uh it has header so for example if we have product ID name category and P list price uh followed by the actual values you can load this into a data frame in your notebook by using this code so we will use the method of read and load and then you give it the path to the file the format of the file and whether or not it has a header so this will uh use the first rise the header and load your data in a data frame in this scenario spark is actually inferring the schema we didn't give it the actual schema but if you need to specify a schema you also have that option so um let's uh assume that you have a file very similar to before but this time you don't even have any Heathers so before you load this into a data frame what you can do is that you can define a schema but the by the code that you see on the screen this is very simple you're basically just giving it the column name and the data type and this time when you WR when you load the file into the data frame you specify the path the format and the schema and in our case this time we don't have a Heather so this will load the data uh into the data frame for you uh one thing I want to mention is that because we are working in a Microsoft fabric Lakehouse environment we can reference this path like you see on the screen so this files is a relative path that takes us to the file section of the Lakehouse every Lakehouse comes with the tables and a file section so in this case the that's what this is uh pointing to once you have the data in a data frame you can further apply transformations to it so for example you can limit the number of columns so you can say I only want product ID and list price columns and you will assign that to another data frame so what you see here the first example is creating a Priceless data frame from the other data frame that we just loaded um and you can also Cascade these methods so the second example ex Le is showing you that you can select a couple of columns followed by this is where we're going to apply further Transformations which in this scenario is a wear Clause so we're limiting now the rows to mountain bikes there is um huge number of these methods out there so for the full uh collection yeah you can refer to Microsoft learn or to uh spark documentation um but one of the other useful ones to know when you're getting started is the group ey which is very similar to group ey in SQL so essentially we select a couple of columns and then we Group by one of them and then we apply the aggregate function after the group by in this case it's a count so this is going to count the number of products in each category all right so um we're going to now go back to matheus and he's going to show some of the codes that I showed you in uh yeah so let's have a look at this code Snippets um I'm still in my fabric window I switch to my um demonstration um workspace and you see I have the demo Lakehouse here and I have my notebook um when we look at the demo Lakehouse um you see that I have already uploaded some files three files the csvs 2019 2020 21 csvs uh these are orders and when we have a look into that csvs you can see that there are no column headers um in in this files yeah um so let's go back to our um Lakehouse uh to our um workspace and get into the notebook so notbook is noing and basically in the not book you have cells yeah you have two different kind of cells um so yeah now we see it first one here as you can see um is is um written text yeah it's in markdown format so it's very simple um if you if you use a hashtag uh you can um uh you can uh write an an header if if we add a second one here o there you go uh this is this a header uh heading to right um so you can format it and you can make it bold and all that stuff um but you have to hit edit to change the cell um oh there's something disconnected let's hope that it comes back again um we have a different type of cells um the code cells these are this one you can edit this cells directly yeah so you can just place the cursor and write or delete or whatever you want to do and you also help as you can see yeah um and you can convert um this cell so if you for example think this is a markdown cell uh but you uh need a code cell you can simply hit that button and vice versa convert one cell to the other type um the the main main benefit is that you can run the cell directly I have uh run all the notebook uh in advance because first of time and second well um then we can jump around and everything is there and everything works um but basically if I hit run cell uh the code is executed and the result is displayed just below that cell so uh when I create a notebook and write my markdown and write my code and run the notebook I have three different uh types in that notebook uh of of content the first one is text second one is code third one is the result of my code um on the left hand I have the resources in my Lakehouse so you see again here the orders I can go there and uh have all my three csvs and I have some other folders that we will um um create by code later but as I said I've already executed the code so you can see it here right now so let's uh go a little bit into that code um first we read the csvs and in this case the 2019 CSV with the option header equals true uh which is not true in this case um because you can see it on the result um the first row wasn't uh doesn't contain the header it contains um data yeah so now the First Data row is is basically our header which is bad um so we need to correct that that's that's very simple uh because we simply need to set the header to false um and I also changed the 2019 to an asterisk which means that we read in all three CSV not only just the 2019 CSV um well looks better we have now all the data but we we have a problem because uh well the column headers are not really helpful c0 C1 and so on uh these are like um autol labeled columns um so um as chapon shown before uh if if you have this um this uh kind of formatted data you need to provide a structure um and we do that so we Define the different columns uh Fields sales order number sales order line number uh order date customer name email item quantity unit price and text with different types so we have some strings uh we have integers we have date and we have float two times um and then we read again this CSV and instead of defining the option how to handle the header we propagate the schema to that data and now it really looks great so you see we have all our data and we have the correct assigned columns here sales order number sales order line number order date and all that stuff um so to display data uh we display it with uh the um display function and we this we load it in data frames data frames is the type of variable that holds the data um in in uh python um so uh we display exactly that with the display command and we get in this case the same that we have got before um and um well so you see data is in place um we have all three csvs in place we can check that simply uh by the date you see this is 2021 and if I would scroll down I would see that it's 2020 and and so on yeah um so we now can explore the data in the data frame exploring means we get some data and we need we want to know the data we want to understand the data we want to play around a little bit with the data so what I do first is I say okay I want only to see two columns customer name and customer email um and um well this is my table you see it customer name customer email but I I want to get some numbers an idea what's this kind of data so please print me the customer account um oh there's an error uh which shouldn't happen maybe this is related to our backend pool um so we can execute it again let me run um so I want to print the customer um count and the distinct count which means that I can see how much customer I have in the table and how much distinct customers I have in the table um and you see now oh that's a bit weird um I think you got to go from the start and started well at least we know this is a live right this is really live one thing I really like yeah one thing I really uh like and I wanted to point out is that even in your first cell before you ran it um the spark uh pool had the cluster had started at 9 seconds previously and this time I think it was around five seconds so the reason that can happen so fast is because Microsoft has these spark clusters ready to go and the moment you run the first cell in your notebook they will get attached to one of these clusters so that's why this can come back up so quickly yeah yeah exactly so and well now our um cluster is uh running perfectly uh you see I have the customer count here and I have the customer distinct count so there are uh more customers in in the um in the uh table than distinct customers which is what we would expect and we see um the table here um by the way there was an error because I didn't delete the um folders I will do it later when we are in theory and then I will rerun The Notebook this doesn't impact us now um we have the customer name and the customer email um displayed as we wanted to so now we can play a little bit more around with the data um so we can select the customer name and email uh and apply an a we Clause a work Clause work works pretty similar to to SQL wear Clauses we say the data frame uh cell item equals z row 250 red Size 52 so all customers that bought this row 250 uh red in Size 52 um and uh then we can again print the customers count and the distinct count and in this case um the numbers are the same so we have unique customers um and we can display the customers that bought this um amazing item um we can do even more we can for example get the product sales uh which means that we want to get the item and the quantity and group it by item and sum up the quantity um when we display that you can see for each item here uh the quantity it's not ordered by the quantity but uh you can get an idea how much items we have sold um and when we move a little bit along oops a littleit jumping around um well what's happening here that's a bit crazy let see it's not found okay um well so I need to reexecute the notebook and as Shan pointed out this is really fast so comes up that this runs really fast um and then we can explore the data a little bit more um and somebody is playing around with that notebook which cannot be little bit weird um so let's see uh we already got our data back and um let's move on explore data and the data frame you see everything at there um there's a sum we just uh have seen and now we can go a little bit deeper and say okay we want to see the yearly sales so we select the order date but we extract the year out from the do uh from the order date and naming the column year uh which makes absolutely sense and group it by year uh and get then the count so what what the result is we have the year 2019 2020 21 and we have the count of all sales in that year um that's that's pretty amazing we can now start to modify a little bit The View so for example not only just create a year column but also a month's column yeah so we extract the Year from the order date and we extract the months from the order date let me scroll a little bit there you see it from order date month in the color month uh we even split the name field to first and last name so customer name contains first and last name and we split it by space split column by space here you see the space and then uh we filter and reorder The Columns um reorder The Columns means we write it in in the order we want to see them um and now we display the um data frame but we limit it to the first five um items so you see first five items States order number sales order line number order date year month first name last name email item quantity unit price and tax all there yeah um so well this is basically uh um one way to explore your data you have much more capabilities of course with python um so but I think you we've got an idea what you can do and how you can uh transform your data and and play around with it and get an understanding of the data uh you you need to to work so well thanks for that and back to shabna with some exciting Theory okay well thank thank you for showing us um all that good stuff um so by the way if anybody's planning to take dp600 again it's super important that you learn everything that's um in this module including some of the pie spark and Spark SQL syntax those Transformations these are um things that you're expected to know as a the Fabrics analytics Engineers so make sure that you pay attention to um this module for the P spark and Spark SQL uh syntax so once you have the data in a data frame uh what you can do is that you can uh write it back to the um to The Lakehouse uh this is because often times you want to write it down for uh down further Downstream uh processing or for other um Downstream reporting tools to pick up and um analyze it so you can do that by using the right method um after a data frame so use a right method to save it you provide the method in this case we are saying overwrite so the mode is overwrite we want to write over everything and start from scratch and then you also provide the for the format in this case we are writing as parquet which is the standard format when you're dealing with large amounts of data and then we give it a path so notice that the path that we are providing here is a relative path it goes to the files section of the Lakehouse that Maus was showing you before now when you're writing data when you're write working with large data uh one of the things that you can do to optimize further Downstream reporting and usage of that data is to leverage partitioning partitioning is a performance optimization method that lets you split your data into multiple folders and Downstream um products or applications or um they can they can reference that data and they can only hit the folders that they need so for example uh what you see here is that we are writing the state of frame we are partitioning by the year column in the data frame we are overwriting everything the format is data and we're providing a relative path so this is again a relative path it goes to the file section of the Lakehouse you can have a full fully uh qualified path here but in this case we are actually writing to the Lakehouse under the file section and what you see in the picture is that this will create a folder called data and underneath it you're going to have these subfolders with the name of the co column that you partition with in this case year and the value so it's going to look for distinct values from that column and it's going to Partition by that and once you have the data in the like so now you have the data in the format you wanted now you H you can analyze it with a SQL uh once you write it to the spark metadata store so let's see how that works so you can create objects in spark catalog what that means is that they will show up as tables and Views in uh on in your Lakehouse and that will enable those users who prefer to use SQL to analyze the data to do so you can um use a data frame to write the data so we are still working with the data frame when you have a data frame you can create a temporary view from it so the first piece of code that you see here what is showing is that we are taking a data frame and we are creating a temporary view called Product view this will show under the views in The Lakehouse um but this is going to be the so I'm sorry this does not show under the views in the louse this is only live during your notebook session so once you close your notebook this is gone this just a temporary it's gone this the next one is what will actually persist there so you can write it in as a table and that will show under the table section of the lake housee so so far in the demo my has been focusing on the file section but you're going to see this that when you write a data frame and you say save as table it goes into to the uh table section and also the format is Delta now you can also write a table to an external path which is the last example that you see in this screen uh the difference is that this this is called an external table so when you write a data frame to the table section of the spark of a Lakehouse you are creating what's called a managed table it means that spark has the definition and the actual data is also in the Lakehouse in your in your workspace but when you create an external table it's only the definition of it that is in the in the catalog and the actual data is sitting somewhere else uh one slight difference that you need to be aware of is that if you delete a manage table the data will be deleted as well but if you delete an external table the the data the underlying data does not get uh deleted so these each H each one of these have different use cases that you can use together to complete a data processing and transformation uh job so once your data is in the spark catalog in the form of the tables the nice thing is that now you can use the spark SQL to analyze it you can do that in the notebooks by two different methods you can either use the uh P spark so the first example that you see on the screen this is p spark what we call the spark SQL API to pass a SQL statement to it and then we will get the results the other way you can do this is that you can change the language in your notebook either for the entire notebook or just for that cell that you're working with to spark SQL and now you can actually put SQL in there without having to pass it through the SQL API so you can notice that this one at the bottom is uh a lot easier to read now this percent percent SQL this is called a magic command and this will let you change the language of one individual cell so if your entire notebook is py spark but in a few of the cells you want to use spark SQL you can do this and vice versa so I think we are going to go back to Maus now to continue with the demo Maus back to you yeah so let's continue um we will jump in use spark to transform data files yeah so may you remember we stopped with this fantastic um transformation uh we created year and month columns first and last name and reordered The Columns and then displayed the first five this is the code um and this is a result right um so now we can take this data frame and write it um to a file uh it's a parket format um and I already executed this cell and when I refresh the file section you will see that I have the transform data the transform data folder um and within there there is an ERS folder and when I click on that and expand it uh you see that there are three parket files created out of my data so this means write data as parket over write in this case is not relevant because the file didn't exist before and we have a success message print uh data uh is saved successfully yeah transformed um so this is our transform data basically the idea what we have done in this notebook is we have loaded the data explored the data transformed the data and stored it to a different location so uh this is a little bit like The Medallion structure right you get some data in and you elevate the data to the next level silver or so and yet then you store somewhere um so just to prove that everything went fine we can read exactly that data that we have just saved um and display it and this is how it looks like and surprisingly it looks exactly as the data that we have uh displayed before because this is what we have WR into the file um now you can decide to just not write the orders in the parket format but to store it in subfolder this is called partitioning and I have executed that already and as you can see by the code um I partitioned by year and month also with the mode override uh in the subfolder partition data in the files section of our Lakehouse um so let's have a look how this looks um you see there is a folder year equals 2019 uh 2020 and 21 and Below of that we have uh folders for each month because I Partition by year and month so 10 11 12 and so on um so why why should we do that so first of all uh you have some some structure in the files we we can get by the way a view how it looks in the month post and you see there's a parget file um so uh first of all you have some ordering there which is great uh the second thing is um we can now read that data and when we read it we see the beauty of that stuff because you can see we read uh most mostly the same way as we have read the uh transform data but now we can say um please pull out the year 2021 data and months is aver so this is like a wear Clause directly in our data load which means if you select an intelligent pattern for partitioning data you can query the data on the load uh directly and this makes things faster smaller and of course more responsive yeah um because you can filter the data directly when you load it in the data frame when we display the data um not surprisingly it uh equals um year 2021 and month everything so we have November here um and uh probably a little bit down but there's a lot of data it's all November um so yeah uh this works and this is this is really cool um and now we have basically the the same data in this case uh living three times in the F section first in the all those folders is our raw data or Brony um then we have the transform form data and the partition data the content of both folders in this case is the same because it's just for demonstration well um we use py spark py spark is great for displaying for transforming all that stuff but we can also use SQL SQL is a very well-known language um maybe some or all of you can uh write SQL um I can also so it's it's great to use it uh if if you just have a command in your head how to how to write it yeah so we can do that um first of all we can describe the table then you get a meta definition of the table this looks a little bit weird because I have increased the zoom factor for the demonstration so you can read it uh better it's a little bit bigger everything but of course this breaks up the nice um table here um if we execute just standard SQL yeah um select all columns from demo lost TS all limit th you see this is the result and as Shaban already has explained uh you can do it with spark SQL uh spark SQL method and load it just directly in the data frame this is cool but even cooler is to use the magic command person person SQL because after that command you can just write on the SQL um a little bit as if you would be in the SQL Server management Studio or the SQL uh query tool of your choice um and as you all the all the great standard s stuff is here select the year from order date as order year the sum from unit price multiplied with quanti quantity um plus tax as gross revenue um from sales orders Group by year order date order by order year and you see um the query is even a little bit longer than the result but it's highly aggregated we have the order year with a gross revenue per year um um carried by SQL with a magic command so well that's um that's a capability that you can do uh with with pice Park and even SQL U by embedding SQL like this one or by running SQL with the magic command like here result of course is the same and so let's get some exciting new Theory from chadman okay thank you for showing uh that part U by the way at the if you looked in the cells that matus was showing you can either use the magic command or there's a drop down at the right bottom corner of each cell where you can also use that to change the language that you're working with in each cell all right so once you have the data in a data frame you can visual it visualize it uh using uh different methods uh first of all there is a built-in method in the notebooks itself every time you run a data frame you display it you will see the results and you can also switch to a chart tab where you can see uh a very simple chart from uh the data and there is some flexibility here there is uh some boxes that leted to uh change the format of the chart what goes in the rows and columns and uh things like that but if you want to do further uh Advanced visualizations uh you have access to all of the uh py spark libraries and some of the other libraries uh that are open source and available so for example one of the common libraries that people use for visualizations is matte plot lip Seaborn is another one and with them you can create some of these really fantastic looking uh visualizations so um Maas back to you to to show us uh some of these wonderful visualizations uh in the demo yeah I will do and as you have pointed out we have this small drop down here uh to to switch the language right this is the one um so well we I I just talked about SQL and um the magic command and so let's let's do some some SQL here select everything from sales ERS this is our data we are a little bit used to it right now I think and um I can uh also display it as a chart um so um if if um you hit here the chart uh item instead of table see it here uh it's it's displayed as a chart um and well it looks great but not so useful in this in this uh display mode so we need to adjust it a little bit we will leave it as a bar chart uh send the key to item there we go um and the values to quantity there we go um and the series group is fine and the aggregation is sum let's apply that W that looks much better um much better much more useful yeah I think so um so you see the the different items here uh with a sum of the quantity and you can even save that as image as JPEG PNG or SVG um so if if you need it for example for PowerPoint or something like that this is really handy so this is like the buil-in visualization you can switch back and forth um and um compare yeah if if you have questions to the data you can go back and see the raw data so wow that's cool um and um let's let's get a little bit more out of it so we have the seq query here uh year order date um as order year some unit price quti to you know that a little bit uh this is a very raw table visualization with the order year and gross revenue so let's display that data in a nice um uh bar um view for that we can use M plot lip M plot lip is like the standard it's a little bit more complicated than others other libraries we will have a look at a more simpler one seor later um so for first of all we need the pendas data frame uh not a spark one so uh that's simple we use the spark data frame convert it to penders and then we can go ahead um and we uh Define the AIS the x-axis is the order year and we then for the value we need the gross revenue um so we get 29 and 2020 2021 uh with a with a Scala here um on X and and Y AIS um now we can uh customize this the chart yeah so we can say okay we need a title title makes sense so everybody knows what's in that chart um Revenue by year we can label the x and the y axis with year and revenue we can adjust the color the line style the line width um and so on a little bit Alpha which makes it transparent um so this is our result yeah looks much more appealing I think um and we can we can do even more um so let's have a look here see we have now the revenue um we have a small rotation in here 2019 2020 21 you compare it um so well um this is all capable with M plot lip and we can even create more than one chart so we have a pie chart here um together with a bar chart um with M plot lip um you see this called subplots yeah um and you have to to uh describe what you want with with the titles and the type um pie chart as you can see here bar chart uh and uh then add a title to the whole figure this sales data Revenue by year orders by year um so now you may ask yourself hey why I I have really really great tools like Excel why should I do it here so several reasons first of all it's a a really handy way to get into your data to um see changes on your data because you can uh it's some some some times it's much easier to to understand stuff when you visualize it that's one reason the second reason is um it the graphics are created each time the notebook is uh uh is uh is run so uh as shabnam explained you can do it also by schedule which means in the background um so if you need uh current um visualizations you can just pick it out here and put it uh to a PowerPoint or Word document or whatever and um since it is described by code um there is a pretty high consistency um so the the the um the charts will always be and look um the same so it's easier to compare if you make a year-to-year comparison uh between the the chart um then you can do it uh really nice yeah that's it's quite different to an Excel because Excel can be edited by a lot of people uh this one this notebook is code based it's a little bit different audience so I I mentioned that M plot lip is quite complex uh caborn is a simpler um library that you can use um and you see we have essentially two lines of code if if if you skip the show we have one line of code to create this bar chart x mean it's order year Y is gross revenue data is data frame sales that's it now we have this fantastic bar chart 2019 2020 21 um if you compare that to the um met plot lip code this this is this is really nice to read um we can do more um as you can imagine um so for for example now we uh apply a seam with a wide grid if you compare it we didn't have a grid here we now have a grid here um so it's easier to compare the bars to each other um we can create a line chart um of course so this is this is really really simple to use and really simple to to create different types of charts uh it depends on your background if you used to M plot lip that's great if you are new to that topic maybe Seaborn is easier uh to to get into so well these this is the the visualization part of the demonstration I hope you enjoyed it and uh well back to sh thank you so much uh By the way you didn't actually have to install the Seaborn or matte PL lip all you had to do was to put in their import because they're pre-installed so you just had to do that in order for you to be able to uh use them by the way we didn't show this in this module but you can also embed power VI report right there in your notebooks it's really cool and like Mao said uh these visualizations help you as you are preparing the data so that you can understand your data better and you can also use it for exploratory analysis if that's the that's the goal all right so we're going to do a little bit of knowledge check to see uh how much you guys were listening and hopefully uh we're going to see most of the mo most correct answers so here we go um so if you scan this QR code here or you can go type in this uh link uh that I believe our moderators are also putting in the chat you can get to this poll and you can participate so the first question is you want to use Apache spark to explore data interactively in Microsoft fabric what should you create a a spark job definition b a data Factory pipeline c a notebook so here the keyword is interactively so let's see we're going to give it uh a couple of seconds to see if we are getting some answers okay I see some answers are coming in yeah we are getting a mix of answers so we're going to run run the code interactively not on a schedule so that should hint you between choosing between a and c and interactively it means that uh we're not going to schedule it in any way right okay so let's see what the answer is the answer is C so you use a notebook to write code and explore data interactively you can schedule a notebook to run um as part of a pipeline um or you can schedule it to run with the spark job definition okay so next question you need to use spark to analyze data in a CSV file what's the simplest way to accomplish this goal so again the key word here is the simplest way there's many ways but we're looking for the most easiest way a load the file into a data frame B import the data into a table in a warehouse C convert the data to parket format okay I can see some answers are coming in so let's go ahead and see what the answer is the answer is load the data into a data frame like uh Maus was showing us you can also load it into a table in a warehouse but this is not the simplest way again these are some of the examples of things that you need to pay attention to if you are taking an exam you have to pay attention to the clues in the question that will point you to the right answer all right awesome we got some correct answers there now the the last question is which method is used to split the data across folders when saving a data frame a split by B distribute by C Partition by this was in in one of the uh demos that meos showed as well so let's see we are getting some votes in all right let's see all right I'm going to reveal the answer awesome everybody got this right so great job uh it is the partition box it's an optimization method so that your reads further on can use less IO and they'll be more efficient and faster all right so that was the last of our uh questions just to summarize uh what we were talking about we were talking about how you can use spark in a Lakehouse in Microsoft fabric how you can run code interactively or on a schedule we showed you some P spark code some spark SQL and finally some visualization techniques um if you are uh planning to review what we showed you make sure you go back to the learn live link that's on the screen to finish the exercises on your own and uh don't miss the next session coming up in the learn live series it is working with delta T Delta Lake tables in Microsoft fabric this is going to be tomorrow um about at the same time that this session started so make sure you continue with that because they're going to focus on the tables um and continue on this session all right um and this shows us the tomorrow's session again one more slide and here we go so again as a reminder if you do not have access to Microsoft fabric at work you can get you can go to this link and you can sign up this will give you a development environment if you don't have access to anything that you can work with this will give you a development environment which is actually what Maus and I were using to uh do the demos we have used this exact same method uh to create our demos for you um so that brings us to the end of this session um are there any were there any questions that um we need to answer we have a few minutes left um by the way I want to give a shout out to Matt Clemens he's been our uh wonderful producer uh keeping us on track he's been in chat this whole time uh helping and posting links so take a moment to say hello to Matt Clemens and also huge thank you to our moderators Kai and Osama yeah thank you so much that's it Matos we did it can do the high yeah let's do all right awesome awesome okay okay looks like we're getting one question mythus are you going to read it or do you want me to read it so interesting to know that you can embed powerb reports in your notebooks how to do that by calling a specific Library uh yes there there's a there's a uh specific library that you can do um I don't have the link handy for you U to show you but it's uh it's super easy to do yeah and not to mention that the whole workspace Concept in fabric uh is like the the um evolvement of the powerbi workspaces so powerbi is really really tied to the whole fabric idea yeah so next question is is there any cause that you can recommend to become a py spark ninja oh that's uh that that's a difficult I'm not a py spark ninja myself so the beauty of Microsoft fabric is that um people who are very familiar with SQL they can easily work with spark SQL so the amount of pie spark that you need to learn to get start start with Microsoft fabric is not a lot so really what you see in the Microsoft learn modules will get you started then you can gradually build from there so that's how I would approach it is um just go as the need arises look up the method and add to my knowledge base um but otherwise the the Microsoft learn has a lot of um P spark code examples in it too really the data science section yes I would emphasize that on Microsoft learn you have a huge amount of resources um where you can start and learn and get even deeper into all that stuff so for Pyon and P spark and the whole fabric Thing yeah it's really cool content so I think we have got even more questions what uh are the things to be focused on this module for the exam oo everything everything in this module don't uh the exam by the way is not easy it's it's difficult you you need to study for it before you go into it make sure you at least go over all the modules by the way the exam is open book so you can have Microsoft learn open on the side in the exam environment you just need to know how to navigate to it so make sure you remember how to navigate to for example this module if you get a question that involves syntax that you're not sure um you make sure how you can navigate to it um so pretty much everything that you see in this module uh questions about writing a data frame saving a data frame the data Transformations you can get questions on any of those topics yeah yeah oh we get even more questions well how does directory work in fabric notebook what is the default directory when I download a file with OD specifying Lakehouse fire directory so well the default section of the fight section of course as we have seen usually you create subfolders to keep things a little bit well uh organized but you don't have to um I think by the way the question was upload and not download so if you upload stuff ends up in the file section yeah yes you you upload files one of the the methods to get data into a Lakehouse is by using the upload functionality and when you do that it will bring up um the file section that you can upload to all right one more question let's see how does um no that was last is there a difference with spark Lakehouse and fabric versus synapse versus ADF um so the the one one big differ maybe not a difference but a performance uh optimization that I personally love every single day I start a fabric notebook is that it run the the pools start much faster so currently if you work with Azure synaps analytics to get the notebook started you have to wait maybe two minutes two three minutes maybe four minutes sometimes but when you're working in fabric it's often times less than 10 seconds so that's so many much more time of your day back to you uh so that's one thing I've noticed uh uh other than that the the version of the spark I believe it's the same thing I believe in even with Azure snaps analytics you can when you create those pools you can go to the latest one right with your experience yes yes yeah we have seen that in the workspace settings where you can select the version yes exactly yeah um well let's see how the more questions um is it possible to run python in notebooks is it possible to combine python with pisar I um I am not entirely sure about that um looks like Maas are looking that up to see yeah yeah I'm not sure about that I think the version of the the version of the um the version of the Python language that you work with the spark is called P spark yes maybe that clarifies it yes exactly and uh if we have a look at the language switcher uh indeed we have uh Scala spark SQL R and Pi spark so uh we have to stick with P spark here yes and and actually on that note notice that the sequels also not tsql spark SQL so there are some differences there so for example you cannot do select top 100 uh star you put at the end of it you have to say select star from and then at the end of it you use the limit keyword to do that so there are some minor differences here and there yeah exactly um so let's check um [Music] do we have even more question oh well this is a good one I think if I pass the beta exam should I retake it when it is gener available no I know if you if you pass it um then you are not you don't need to retake it um basically what happens with the beta is that the beta has more questions than will be in the actual GA exam and once the exam goes GA those questions will be finalized so you will be evaluated on the same set of questions as people will take uh during the ga So based on that subset of questions you if you have reached the 700 uh minimum um you can pass the exam so it all some of it depends on during the beta evaluation which one of the questions stay and which ones we'll leave uh so fingers crossed we'll all pass yeah good luck okay so let me bring back [Music] the summary slid and this is where Matt showed me how to go into the uh Slide the the index of the slides and jump to the summary so I'm trying to do that okay here we go all right so um everybody thank you so much for joining us uh make sure that um you go to the Microsoft career Hub check all the links that are there check the rest of the uh Series in this session um rest of sessions in this series um and uh this session is recorded So if you need to rewatch it you can see the link um at the bottom you can watch it on demand later on um so that's all again thank you everyone for joining us um and uh hopefully this uh session helped you and we'll see you later yeah enjoy the learning path yeah all right
Info
Channel: Microsoft Power BI
Views: 6,018
Rating: undefined out of 5
Keywords: microsoft learn live, fabric, data-analyst, data-engineer, intermediate
Id: bqeiOFhUGrA
Channel Id: undefined
Length: 70min 29sec (4229 seconds)
Published: Wed Jan 24 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.