Process Excel files in Azure with Data Factory and Databricks | Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

!remindme 1 week

๐Ÿ‘๏ธŽ︎ 2 ๐Ÿ‘ค๏ธŽ︎ u/fml86 ๐Ÿ“…๏ธŽ︎ Jul 21 2020 ๐Ÿ—ซ︎ replies

I have been watching your videos on Azure, and your videos were of great help. Thank you so much for doing this.

๐Ÿ‘๏ธŽ︎ 1 ๐Ÿ‘ค๏ธŽ︎ u/ravikarrii ๐Ÿ“…๏ธŽ︎ Jul 23 2020 ๐Ÿ—ซ︎ replies
Captions
hello guys this is adam and today i want to answer one of the most commonly asked question when moving your data to the cloud which is how to process excel files in azure if you want to find out stay tuned so today we will focus on building etl processes and leveraging out-of-the-box functionalities of azure in order to process excel files but let's start with the business justification why excel files why we use and we want to have excel files processing capabilities in our etl solutions the reason here is fairly simple it is one of the most commonly used formats by the business users because all of the ease of use features that it has all the collaboration options it is supported by many many platforms as an input source but additionally it allows you to connect to many data sources and process and visualize your data because of that and its rich analytical capabilities it is one of the most commonly used analytical tool on the market and additionally this is one of the most commonly used format for extracts in external tools and systems so if you're using external tool provided by external company and you will want to export your data most often you will get this in the excel file format and while the business users love excel files developers don't really do and the reason for that is because excel poses a lot of challenges excel is a freeform editing tool as such it allows users to put data anywhere in any format in addition when your users extract data from external systems sometimes you will see that those systems do not follow the standard formatting for excel files which will give you additional troubles when you try to parse those files and when processing excel files you should be ready to parse different internal formats like xls which is xl97 xlsx which is open xml version of the excel files xlsb which is the binary format and xlsm for the micro format so if you're building your own excel parsing solution you should be ready to parse those excel files most of the libraries that you will find on the market and parse usually either xls or xlsx or both but very rarely xlsb and xlsm formats when it comes to azure you have quite few supported services that allow you to parse excel files like data factory which recently got native functionalities to input excel files in your pipelines second of all you have ability to import ssis packages within data factory so if you're building your entire etl in ssis you will out of the box be able to parse the excel files although if you just need a simple extraction of excel files ssis packages might be quite a big overhead for you additionally you have data bricks and hdinsight big data technologies that allow you to import external libraries to parse the excel files and because you can't write the code in those solutions yourself you can actually extend those packages and do some custom logic additionally any service within azure that allows you to develop custom code in java.net or python like app services for web application functions for serverless programming any container service or machine learning service any of those services allow you to build custom code therefore leverage custom libraries to parse excels quite effectively probably those are the best when used with real-time web applications additionally you have logic apps so logic apps have some standard connectivity to sharepoint and onedrive which allow you to parse the excels that are located there out of the box but our focus for today is data factory and data bricks those services allow you for some quick wins and parsing of excel files quite effectively without much time investment but before we move on i want to give you one personal recommendation here if you can avoid logic apps at this time that's why logic apps have native connectivity to sharepoint and onedrive and can leverage excel services online to parse and work on the excels they are quite quirky i think that this is the best word to describe it if you use it you will get some quick wins but if you need to extend your functionality you will definitely encounter some of the development challenges that will take a lot of times to fix at this moment i would say if you can avoid it then avoid logic apps for parsing and working on excel files so let me start with data factory data factory is one of those quick setup solutions because within just couple of minutes we will have data factory provisioned but additionally we will be able to parse the excel files really by using no code at all i think it's the only no code solution out of the ones that i listed but because excel parsing in data factory is quite new feature currently it's only supported as the source so you cannot use data factory to export excel files at this time of course as this video will age definitely check out if that's still the case in the future and keep in mind that data factory usually targets simple extraction so if you have very complicated excels data factory might not be right for you but if it is if it's a simple extraction you can use data factory quite effectively when it comes to supported data sources most of the standard connectors for the flat file systems are supported whenever that's azure or amazon or google cloud or just a flat file systems like ftp when it comes to data flows so if you're using data flows to create and true etl solution in data factory you first will need to move your data to either blob or data lake with that said we can actually move to our first demo let me move to azure portal to show you what i have in azure portal i have free services provisioned for today i have storage account data factory and data bricks if you never used any of those services feel free to check my introductionary videos on all of those i'll link them down in the description below but also on the screen but also let me show you the file that we will be working on today i have few files that we will be using for the demo today all of them are called data because they are exactly the same file except i save them with different extensions i will have xls xlsb xlsm and xlsx as i open that file i can show you what we have inside inside we have four tabs car stub with some car data movie stub plane stub and one tab that is a bit more complicated containing two tables in the middle of the sheet because i also want to show you how to handle this scenario where you have multiple tables in a single sheet so let's go back to azure portal and let's start with uploading the data to the storage account so inside of storage account go to containers i already created a demo container inside of demo container i will upload my files to select the upload click on the icon to pick the files from your drive and i will upload all of those four files into the storage account now that we have all the files uploaded we can now start working let's go back to my portal let's select data factory and open outer end monitor so let's open azure data factory standard editing tool in data factory always go to the manage tab because the first thing that you need to do is to create new connectivity inside of the connections tab and link services so you need to create new connectivity to the blob storage select new select azure blob storage call it anything you like in my case i will call it myblob and pick your storage account from the subscription as the connectivity once done and if it's working correctly hit create once the connectivity with the blob storage is established and the newslink service was created go to the outer tab where you will pick data sets and select new data set once you do it select again the blob storage in order to match the data set with the ink service hit continue and now on the select format tab you are able to pick the format and the new thing that was recently introduced is the excel once you select the excel file hit continue give it the name my excel in this case select the link service so the connectivity to the blob storage and choose the path i'll use file picker to do that so select my demo container and the data xlsx file from the container hit ok and if everything worked correctly on the sheep drop down you should see all the sheets within that excel like cars movies persons and planes for now i'll select cars and since my data has headers i will select first row as headers because i will be using the same data set for multiple sheets i will select import schema as none and hit ok once the dataset is created we can close the properties panel on the right hand side to get bit more real state i'll of course additionally hide this panel on the left hand side so we get as much space as possible in this case we are now opening data xlsx files in the cars tab in order to verify the connectivity and proper for excel formatting hit on the preview data if the excel is formatted properly you should see preview of your data from the sheet which indicates that everything is correctly formatted and you can start using that file immediately if you want you can select different tabs like movies tab again to preview the data if you see the movie's data one thing that i want to show you first of all is change to that persons tab if you remember in my excel person's tab is this more complicated one with two tables if you hit preview here you will see that more or less it grafted the data because it was able to detect the tab table in the middle of the sheet but it grabbed a bit more rows than necessary which is not ideal because in the end data factory is start getting very simple extraction so if you have those bit more complicated cases you might have issues with extraction in this case you should be able to use the range here as the range parameters allow you to specify exactly where your data is in this case this would be from c9 to e13 if you type c 9 to e 13 we will be able to see our table parts very nicely without any extra rows but you can see how easy it is to parse the excel files using data factory i literally didn't have to write any single line of code to do that let me close that tab and just for the proof let's finish this example by creating a small pipeline i will create a small pipeline in which i will copy my data so i'll use move and transform copy data activity as a source i will select my excel and as a sync i will create a new data set so a data set that will output my data back to the blob storage and as you see in the output you no longer have the excel on the list because the output is not supported for the excels right now so i will use csv file for that hit continue call it my csv select the same link service first draw as headers because we want in our output to have the headers select demo container and call my file adf output csv hit ok but also remember not to import the schema for the extraction part so select ok and run debug once you run it you don't really have to wait for this to finish you can actually immediately go to blob storage to start reviewing your result although it's good to wait to see if the extraction was successful once you see the successful message on the screen you can go back to your blob storage go to the resource group go to the blob storage the container demo container and you should find adf output.csv if you click on it and hit edit you will see your person's data exported as a csv and that was fairly simple so let me go back to the presentation and let's start talking about data breaks because data breaks is a good solution if you want this more complicated solutions to be done because in data factory you have quite simple cases and you cannot parse all the files for instance if i would go back to data factory for a second and if i would go to my excel data set and in here if i would change from data xls x to xls select the sheet again to cars and clear the range because remember it will automatically detect where the data is and hit preview it will work flawlessly as you see we are still able to parse our files but if you change from xls to xlsb and hit preview data you won't be able to do it because even if you click on the sheets it did not detect any sheets same goes for excel with macros it will not detect the sheets because those formats are not currently supported in data factory and because data factory is targeting those simple extractions and simple functionalities with the excel files i decided to add data bricks to the list because with data breaks you can import custom libraries and handle those more complicated more complex cases with the excel files when it comes to data breaks the setup is still quite quick because you can provision data breaks within couple of minutes with just couple of minutes more you create new cluster and add in library and start working on your excel files immediately one of the best things about data bricks is that it is commonly used for etls on the market right now whenever that's azure or any other cloud databricks is really becoming the standard for etl these days it also allows you to extend things with code because you can write custom code in databricks therefore be able to extend the custom functionality and handle those excels with a bit more difficult format and lastly there's plenty of libraries because this is the multi-language tool if the library that i will show you will not work then nothing is really stopping you from importing any other library and trying it out for today i prepared a library called crealytics this library can be used in scala 2.12 and 2.11 and that's our demo for today for that purpose let's go back to azure portal and let's navigate to the portal itself and open databrick service inside of databricks service let's launch the workspace this is the place where you manage your databricks environment once in the environment the first thing we'll need to do is create a new cluster to do that simply click on the new cluster icon here in the common tasks give a cluster name i will call it demo cluster i will pick the standard mode and pretty much everything as default even the runtime i will leave at 6.5 which is kala 2.11 i'll disable auto scaling for the cost purposes i will change the terminate after to 60 minutes to save some money and definitely i will change the workers to one run this is because i want to save some money on the virtual machines running underneath and i will change the size of the virtual machine to the smallest available which is f4s this way i can save some money while using data breaks after about 5 minutes the cluster was created i can select it go to libraries tab and install new library in this case we're sourcing from maven and passing the full library package id hit install and just wait couple of minutes probably around one two minutes to install the package once the package is installed we can go to the workspace and start creating our notebooks so let's go to my user and create new notebook this will be the scala language notebook and the notebook will be called demo let's hit create and inside of that notebook the first order of business is mounting the blob storage so we can use it as a local drive and to do that i prepared a sample code for you but this is basically a copy paste from documentation we need to provide the container name storage account name and a key in order to mount this blob storage as a local drive under mount my blob path to do that we need a container name container name can be grabbed from azure storage account from the containers rubbing the name of the container in this case it's demo container paste it here we additionally need a storage account name which is this name right here paste it here and lastly we need a key so this is the storage account key remember to later put that inside of the key vault and secret scopes don't leave it as a plain text like this but for the demo purposes this is fine once this is done simply click shift enter to run this block and mount the storage after about 20 seconds the storage should be mounted and we will be able to verify whenever it's working and to verify that i already have prepared the comment for that so i will insert a new cell and paste that comment usually what i do i am just using dbutils file system to list all the files that are stored on my blob as the previous comment finished in 25 seconds the boolean was true so it was successful we can run this command to verify connectivity as we see we have all our files listed on the blob therefore the connectivity is established and we can start working let me go back to presentation and start talking about the functionalities of this library so first of all we need to always start with reading so reading capabilities are fairly simple using this library you simply call spark read passing the format parameter by passing the full path to the library additionally you need to add one option header this option is required it has to be either true or false and then use load function with the path to the blob let's see how this works let's go back to our databricks notebook let's open new cell and paste in the exact command that you've seen on the screen so i already have prepared a sample and in this case it's spark read format from creolitic spark excel option header as always required in my case my tables have headers therefore the option is true and the load is my mounted storage data xlsx so this is exactly the file that you've seen on the screen that we've been working since this for entire session and now once you load this into the data from frame called df it's display that data frame to confirm everything is working fine once we run this we should see our excel data presented on the screen and as you see this is our cars data this is the first tab in our excel because what you will probably notice is that we never specified the sheet name because by default if you don't specify the sheet name the first sheet saved in that file will be the one that will be loaded by default additionally if you do import over the crealitics namespace on the excel you can use a bit cleaner notation to load your files this is purely syntactic sugar so it's your choice whenever you want to use it or not in this case it would be very similar to what we had just now but instead we are pasting the code where we are first importing the crealytics namespace and then using spark read excel so no longer passing the format we are now passing this as a parameter as a function name the same like you would pass csv here or json but this time it's excel pass the header and the header is true so options are now passed as part of initialization of this function this is now a parameter and calling the load function like previously passing the path to the blobs so now just display that data frame and we should see exactly the same data so that basically means this is an equivalent of those both notations are fine it's up to you to choose which notation do you like more let's go back to presentation when reading you might as well want to control how you read and which tabs do you read to do that you can use references like reference to read a specific sheet with specific range to use references you need to use option option allows you to specify parameter called data address and specify the sheet name and range so let's see that in action let's add additional cell in which case we are grabbing the same code as we grabbed in the previous example so as you see this is the spark read excel as previously the new thing that we will add in here is add a new option all the options are used next to each other so we will pass the news paste a new option here this option is called data address and we will pass the sheet name called place and a1 for the starting range where the data lies so if we look back at our excel and go to the plain sheet you can see that our data starts in the left hand side upper corner in a1 cell that is why we will pass a1 here as a starting cell so let's hit ctrl enter or shift enter and as you see we were able to load the data from the plain stub very very easily so using data address you can specify a specific range in specific sheet to load your data if you don't know your sheet name you can also read by sheet number in this case pass the index of the sheet within your file like zero for the first sheet in the excel file and then also pass the range where your data starts it's really as easy as going back to our script and changing the planes to 0 to read the first sheet from the excel in this case this will be our car's data by going back to presentation we can also talk about being able to reference data by the table name this is pretty cool feature because in the end we have those a bit more complicated options with multiple tables in a single sheet if i will go back to my excel go to the persons tab notice in here we have two tables one table and the second one the first table is an excel table that means if you go to table design you will see that table is called persons and the second table is called demo table if you would use any function with an excel like i don't know vlookup and click on that entire table notice what it says on the formula bar on the top person's all that means i can use this as a reference so if i will copy this and i will go back to the data breaks instead of pricing data address as a sheet and a range i can actually use this name inside of the data address when i do it and i simply hit shift enter notice that i was able to read that table and i didn't have to specify where that table starts i just use table reference which is very neat feature if i would want to read the second table which is called demo table i simply grab its name and put it here demo table all rows and headers shift enter and we're reading another table this is how easily you can use table references and parse multiple tables within a single sheet by using this library but please note not all files support this feature for instance xls x and xlsm both support the feature of referencing tables but if you try to use xls it will fail but still with xls you can still use the other notations like sheet notation passing planes a1 to read their file so this library supports multiple files multiple formats except the one with binary so unfortunately xlsb is currently not supported in this case if you would want to run the same them as we did in data factory simply change this to xlsm or xlsx and output dcsv to do that i will use the data frame and use reportation 1 to combine all the partitions in a single one right in a com databrick spark csv format override with headers as a person csv so maybe let's call it plains csv this will output the plain csv file on our blob storage if we go back to the storage account go to overview open our containers demo you can now find plain csv this is hdfs format so there will be a lot of parts but since we were partitioned we have a single part here containing entire csv export so this easily you can grab the data out of the excel files and export this using data bricks let's close data factory since we'll no longer need it so what else do we have in here triolytics library allows us to use a workbook reader class to get some metadata about the excel file itself for instance we can use this class to read the list of the sheets within the file this allows us for a bit more dynamic approaches where we will export all the sheets at once something you couldn't do inside of the data factory and it's fairly easy because you only pass the path and that's it so let's use that inside of the databricks let's open a new hub and let's pass the comment to read all the sheets so this will be a workbook reader over the mount blob data xlsx and let's read all the sheet names let's run this and as you see we have cars movies planes and person sheets i will give you a small example what you can do with this because you can then do a very small for each over all the sheets read the file by dynamically passing the name of the sheet and then doing repartition and outputting this as a csv file when you do it using this nice loop and you run it you will be able to iterate over all of the sheets and export all of those sheets at once using a simple script this way you can convert your xls file into multiple csv files if we go back to the blob storage and close this tab let me close the preview of the previous file refresh it go back up you have databricks output so the new path that we specified in this grid which is here mount databricks output name of the sheet dot csv if we open this we have now all the four sheets as a separate extraction and each sheet containing one part with all the data inside of it so this easily we can use a dynamic approach and workbook reader to parse our excels and export all the sheets to csvs you can of course do a lot more with that but this is a simple example if we go back to presentation what else does creolytics give us additionally what you have is schema support so you can create a struct type that will allow you to enforce the schema from the excel because excel is a free form editing tool so very often you will see issues with the data inside of the dexel one of the ways to avoid that is to define the static schema before you read that file and once you define the schema you need to add that schema definition using schema option and to do that you simply go to databricks and do small imports so import spark sql and sql types you create a struct type in this case we'll be creating a movie schema so we'll be importing schema out of movie sheet with three columns movie id title and genres movie id will be of double type which is a numeric type and we'll have two string types which is as simple strings so simple text fields both are non-nullable so we will expect all the data to be filled in and once you have the schema defined you can simply use the same code as previously right so you can use spark read passing the format passing the sheet name passing the header but additionally what you need to do is pass the schema and provide the schema that you defined within that script once this is done simply load the data once the data is loaded you can then display that of course this does not change anything on the screen that we will see right now except we change the name of the columns let's see it's movie id title and genres but what we did we enforced the types and we did some initial validation over the file which is pretty neat feature the last thing that i want to show you is how easy it is to write using crealytics because it's as simple as reading except you use write function and at the end instead of load you save to the path let me go to azure portal for the very last demo in this case i'll need a new script in that script what i will actually do is do a small aggregation because in the end i want to be able to maybe read the plane's data which in case let me read the planes data to be sure so in our data frame right now we have planes data loaded once you have that data loaded we can add additional line to use spark sql functions and create data frame group by maker and calculate the count so let's display our aggregation so this will count how many planes are there by maker once we have the data inside of the aggregated data frame and original data inside of data frame called df we can start exporting the data to do that we open just yet another tab we put in there aggregated dot write pass in the excel format include the headers in the output the option in this case the data address is the name of the sheet that will export the mode will be override in case there are any existing files there we'll put it in our blob storage in the output xlsx file additionally what i want to show is that you have ability to add multiple sheets so once we save this in aggregated sheet you can use df so our original data again use write passing the all the same parameters but change the sheet name in this case i will call it original because this is the original data and remember the mode in this case should be append because the first one will create or recreate the file but append will add to an existing file so we'll have output xlsx with two sheets simply press ctrl enter and in just couple of seconds we'll get our excel created once this is done in just seven seconds let's go back to our blob storage let's go one level up let's close this preview let's go to the top level and we now can find output xlsx here if we download this file we can review the contents so open the xlsx file and we will see our data notice that we have aggregated sheet with our aggregated data and we have the original sheet with original data that this was based on this is how easily you can use crealytics to export excel files as well and you can do it on multiple sheets so let's go back to presentation i think that's it for today but what i want you to remember is that createlytics gives you a couple more extra options that we didn't talk about first of all you have three empty values as nulls so you can define what values should be treated as nodes inside of the axles you have infer schema which you can use to control whenever you automatically want to pick up the schema based on the data an exercise allows you to pick how many rows the inverse schema works on timestamp format allows you to specify the default format of the timestamp data within the excel max rows in memory allow you to specify how many rows you want to process at the time this is very important when it comes to performance reading because this is a stream based operation when it comes to xlsx you can read quite a lot of rows at the time quite fast but this library allows you to control that as well and if your workbook has a password there's additional option that you can use to work around that which is pretty cool as you can see processing excels in azure is not that hard we have a lot of tools and a lot of options it's up to us as an architects or data engineers to decide on the right approach for today that's it if you like my work like the channel support it by subscribing liking and commenting and see you in the next episode
Info
Channel: Adam Marczak - Azure for Everyone
Views: 47,583
Rating: undefined out of 5
Keywords: azure, crealytics, spark-excel, spark, excel, xls, xlsx, xlsm, xlsb, data factory, data, etl, datarbciks
Id: pc8Kv-lRD8k
Channel Id: undefined
Length: 34min 14sec (2054 seconds)
Published: Tue Jul 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.