Azure Databricks Tutorial | Data transformations at scale

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Very cool! Thank you!!

👍︎︎ 1 👤︎︎ u/sebastianatmicrosoft 📅︎︎ Aug 19 2019 🗫︎ replies

Captions

hey There! Tthis is Adam again and in this video I'm going to be talking about Azure Databricks. Dne of the leading technologies for big data processing it's fast its scalable and it's easy to use in this video I'm gonna show you why is that so stay tuned so Azure Databricks, what is databricks I think the easiest to explain databricks is... it's the big data technology that Microsoft brought as one of the services in Azure, it's a very cool platform that is based on Apache spark was it cool because it is created and designed by the same people who actually created a Apache spark and since Apache spark is one of the leaders of Big Data technologies on the market it really promises fast transformations in the cloud so since it's based on Apac he spark the key features that you would get from this are first of all spark SQL and data frames this is a library that allows you to work on your structured data as pretty much tables in any system that we've been working on already additionally you have some services that allow you for streaming of the data so if you're doing IOT or life even applications this is one of the great examples of how you can perform transformations on the live system you also have machine learning library which allows you to do machine learning type of transformations prepping and training models using spark itself you also have graphx so if you're doing some social media type of applications then it is also a great place to do so and everything is basing on spark core API which means you can use R, you can use spark SQL which is a little bit different than normal SQL it is more limited but it's still very powerful so you know SQL that could be a very good feature for you to use without needing to learn any other different language you also have Python, Scala those are the two main languages that you will be using when developing in databricks but you also have Java if you need to do that databricks as a platform has a lot of features itself besides being Apache spark based it also has a runtime. Runtime combines all those features together into singular platform which delivers you workspaces a places where you can collaborate with your friends and colleagues on your scripts if you have multiple scripts you can combine them into workflows workflows are can be nesting of the scripts... scripts coming another scripts basically a simple ETL and also you have a DB i/o which is databricks input/output library allowing you to easily connect multiple services both in Azure and not only like Apache Kafka and Apache and sorry in Hadoop but also databricks has something called databricks serverless what this really means is that when you work with databricks you just specify what kind of server do you want how powerful it is how many of those servers do you want what is the runtime that you want on it and that's it... databricks as a platform will manage handling and creation of that clusters for you without you needing to manage them at all and lastly there is something called enterprise security so databricks is integrated very well with Azure and Azure Active Directory so handling all those accesses credentials authorization everything is basing on Azure AD so you can just use your corporate credentials and identity to use databricks itself there are a lot of storage solutions that it can connect to but the five main ones that has native connectivity is a blob storage data lake, data lake in both version 1 and 2, SQL data warehouse Apache Kafka and Hadoop, we already mentioned some of those but also there are several applications that you can use databricks for, the most common ones are machine learning scenarios streaming scenarios data warehousing so your typical ETL prepping the data and power bi which is very common case recently but there are many other applications that you can use databricks for since, this is a collaborative platform it is really easy for users to use it there's a UI that they can use there's it's very... I would say it's very simple... it's very simple once you know the platform but since there is a UI you don't really have to be technical savvy in order to use it so your typical data scientists engineer analyst once they learn the platform is very easy for them to use databricks as well the typical scenario that you would see databricks in is during the prep and train for machine learning or your typical prep which is part of the ETL you normally have ingestion layer so either data factory Kafka IOT Hub, Event Hub or something gathering your data from external systems and putting it either on a blob or data lake so this is where the databricks come in usually databricks will grab the data from the blob or data lake transform it or train the models if it's machine learning scenario and put it in some sort of database it can be their SQL database cosmos DB data warehouse or maybe even analysis services or you of'course can put it back on a blob stored if you want you that's up to you since there's a scripting platform you can actually design this logic as you fit t o so without further ado let's go into the portal and start doing some demos so in Azure we will need couple of things first of all we're gonna need to create a databricks in order to create databricks hit on create resource and type databricks and hit create when you find the template and just give it a name I'm gonna call it a4edbintro you also need to provide a resource group this is where your databricks workspace will be residing so I'm gonna pick my resource group that I created previously I'm gonna leave location on Europe this is the closest data center to me in a pricing tier I'm gonna select trial please note that this is the trial the 14-day is free DBU that you're getting is only the licensing cost for the databricks that you're not gonna pay for but you're still gonna pay for the virtual machines itself so I'm gonna leave the last option deploy a virtual network as no because this is not the part of the training for today so I'm gonna hit create and the second resource that we're gonna need today because we're going to be transforming data and as we seen during our presentation we need to ingest it from somewhere I'm gonna grab a blob storage for that where I will gonna be uploading files for our processing purposes so I'm gonna hit plus and find a storage account it's quickly here I'm gonna provide a name but first I need to select a resource group so the same resource group as previously I'm going to also leave North Europe and I'm gonna provide a4edbintro name and everything else gonna be left as default of course you can change replication if this is just the ingestion layer so you going to get a better performance so if I'm gonna leave locally redundant storage this is the cheapest and the fastest storage that we can get so let's hit review and create and hit create provisioning of resources takes about minute or two so I'm gonna speed this up so the provisioning finished so we can go to our databricks resource so let's go to the resource group open our resource group and those are that two resources that we just created so I'm gonna hit on the databricks resource and I'm gonna hit on launch workspace this is gonna bring me to the separate portal, the portal where you're actually gonna be doing all the work there's actually literally almost nothing that you can do related to databricks here in the Azure portal itself other than some virtual network connectivity so let's go back to the databricks platform this initialization takes about a minute or two as well but I've seen rare scenarios where this sometimes takes up to an hour so if this happens for you just patiently wait or maybe come back later so in the azure databricks portal you can do here a couple of things but the most important that we're going to be doing today is running the scripts and creating clusters, cluster... clusters are the workloads that you're going to be running on the servers that will be executing your scripts so what we need to do right now is create new cluster you can quickly hit here new cluster and provide details of your cluster I'm gonna call it demo because the name is just important for me right now but in case you're collaborating always take meaningful name what is the cluster mode standard or high concurrency if multiple users are working on the same cluster it is advised to use high concurrency if this is a cluster for your ETL transformation just pick standard next you have pool I'm gonna leave this as none and we have database runtime version for the demo purposes leave this as default but as you see you have a lot of databricks runtimes so if you need Scala or spark in a different version you can always find it here so I'm gonna leave runtime 5.4 I'm gonna leave Python version as 3 and the next two options are very important first of all you have enable autoscaling so if you're doing workloads that sometimes needs more and less power you can enable autoscaling and this will pretty much bring from 2 to 8 servers for your processing needs in our case I'm gonna disable this and change workers to run to 1 the reason for doing that is because our scripts are very very small and if you're gonna take more servers not only you're gonna pay more but the execution will take more time because the time it will take to split the tasks into two servers and combine the results together is larger than the actual work that needs to be done so I would advise always for training purposes to just grab out non autoscaling cluster with only one worker and there's very good feature here called terminate after that means if you have a cluster and you're finished processing workloads and you went home you stopped processing it will automatically delete the cluster for you so you don't pay anything at all and this is why even though you pay additional license for the databricks itself usually end up paying less time for the usual services that transform the data so I'm going to leave this to 30 minutes so if for 30 minutes I'm not using this it's going to delete it and I'm gonna pick the smallest server available and I think this is the f-series f-4s is the smallest server that you can use and hit create cluster after doing so your server will be up probably between one and a five minutes so again I'm going to speed this up so since our cluster was created and it's running notice first of all you have two notes two nodes are because there's always master and a worker worker is the server a virtual machine that runs all the scripts that you create and the master is virtual machine that orchestrates all the scripts execution and splits them across multiple workers if you have more than one so let's go to workspace and start creating our scripts so hit on the workspace go to users hit on your username and this is where I'm gonna be doing my work right now this is my personal workspace where I can create notebooks so I'm going to right click hit create notebook and notebook is basically your scripting language so I'm gonna pick Python and I'm gonna type demo one this going to be name of my notebook where I'm gonna be writing my scripts and of course this is not the demo of how to write Python and Scala scripting therefore I prepared all the scripts before beforehand so the first demo in Python that we're gonna run is prepared by Microsoft, Microsoft gave the data that you can run very quickly to see how data briefs works so I'm just gonna copy pasted so first of all i'm gonna create four variables and we're gonna create the blob account called Azure open data storage to the container relative path and provide SAS token SAS token is your key that will gonna use to authorize to this container and pull this Boston data in order to run this block you simply need to hit control enter or you need to hit this run cell button here and hit run cell our command az you see was very very fast so right now since we executed this doc this block is a series of commands written in Python you can actually combine multiple blocks into single notebook to create a new block you go here insert new cell and paste new script oh sorry that's the same script so I'm gonna copy the next one this one is nothing to worry about I would say this is the configuration that really needs to happen anyway this is just for your information this is the path to the blob storage this is a very well not very user-friendly in terms of how to combine this but documentation very nicely specifies how this should look like so usually just grab this from example and replace some values so we'll just need to run this so that means our remote blob path is blob protocol city data container, azure open data storage and safety release city Boston right so let's grab data from this container I'm gonna grab the next script and it says spark read so basically we're telling spark to read the data next we're gonna say parquet this is the format of the data it's just like CSV or JSON file there's also parquet files and do it from the path from the previous step then print some information and create the view a view are used if you want to use SQL so let's hit control enter and run this in about five to seven seconds we should you get it data combined it was actually a bit faster four, under four seconds so if this work right now we can go to the next block and maybe display something about this data what I'm gonna do is maybe I'm going to count those rows of the data so I'm gonna hit ctrl enter and run the job so we have hundred twenty seven thousand rows of data that we read that we can already work on and since we created a view called source we can also do one of those powerful things that I was talking about which is switching language to switch languages just type %SQL and when you do that you can start typing your SQL comments like this so select star grab all the row from the view called source and only grab the first ten rows hit ctrl enter and see the results it's really cool that we were able to connect to external blob storage use SQL to query the data and display it on the screen but what's more is that if you're the analyst and you're doing this data analysis you can grab your data here using download CSV file and maybe pass it along to someone or maybe hit this button and visualize it using bar chart or maybe a pie chart of course as pie chart you have additional options just like in excellent pivot tables you can use those here and change the values maybe change the aggregate to count and hit apply and this is our chart it's pretty cool isn't it so that was the demo for the Python let's I don't feel like this demo really shows what we did because we not only use external storage but we use very quick commands we've got some results so I do really want to do more tangible more hands-on experience with scripting so we're going to create one more script in Scala so let's go to workspace right click create new notebook call it demo 2 and this one's gonna be in Scala so what we're gonna do here is we're gonna first upload our own data so what I'm gonna do is I'm gonna upload those rows of data there's 25 rows of data saved here as a JSON file this is a small radio JSON information so I already download this, downloaded this file what I need to do right now is go back to Azure go to my resource group go to the storage account that I created and this is the reason we created it because I want to have this place where I can put my own data transform it and put it back so I'm gonna go to blob service create new container I'm gonna call it staging I'm gonna leave it as private and I'm gonna go to staging and upload my file called small radio Jason and upload it so they uploaded successfully right now we have on our container small radio JSON file that's it what we need to do in terms of Azure let's go back to scripting so this time we're writing code in Scala so I prepared some of the code already so first of all for this code to work we need a container named storage account name and SAS token the same token that we were given by Microsoft so we're gonna generate that in a second first of all we need a container name our container name is called staging so let's copy-paste that and put it here we need a storage account name so let's go back and grab our storage account name and we need SAS token to get an SAS token go to blob service go to shared access signature blade and select the service that you need access to I only need access for blobs I'm gonna leave it at blob and I'm gonna leave everything else default this means that storage that I'm giving access to has only access until tomorrow 4:00 a.m. in the morning so I'm going to generate the SAS I'm gonna copy it to clipboard and I'm gonna paste it here this is very important that in this case this is very shortly token so if I would run the script tomorrow it would fail so remember about that I'm gonna hit ctrl enter which is gonna initialize those variables I'm gonna hit plus and I'm gonna add new line and this is very important what I'm gonna do right now I'm using databricks urilities to mount a storage so just like you map your normal drive on your computer or Windows or Linux we are doing the same thing here except instead of mapping a drive we're mapping a blob storage this is a very very cool feature because using just normal windows shell; oh sorry... linux shell commands you can copy data to this mount and it will land on a blob storage so I'm gonna hit ctrl enter this sorry let's rename it differently that's called staging I'm gonna hit ctrl enter the reason it failed because I already had a demo mounted before because I was testing this scenario so let's call it staging this command takes about 23 24 seconds to mount it and from this point onwards we will be able to use this mounting point to load our data after almost 22 seconds our staging is ready so let's grab another line of code I'm gonna grab a line of code that will allow me to read from a mount slash staging small radio JSON and what I'm gonna do is I'm gonna display this data and hit ctrl enter so as you see it was really fast but what is cool here is we used it like it would be normal file on our drive except it was on a blob storage and yet we were able to connect loaded data in this plate it's very very cool to be honest next what we're gonna do is gonna create another block and paste it in and here we're gonna use a select statement this is a similar to selecting SQL where we're grabbing our data frame consisting of our data and selecting only few columns first name last name gender location and level and displaying that data frame to see the results hit ctrl enter everything seems fine that was the first transformation that we did using databricks so let's do the second one by pasting this command and grabbing the results that I saved previous step specified specific columns data frame and from this result and grabbing and renaming one of the columns called level - subscription type let's hit control control enter and again see the results the column was renamed what else you can do here is the same as previously maybe let's create a view called renamed control enter and the view was created so how about we do some simple aggregations maybe count how many subscription types are there so I'm gonna type %SQL and select count by subscription type hit control enter and we got ten free subscription and fifteen paid ones and if you want to save this as a result in in some data frame later on you just encapsulate this select statement into spark SQL and assign it to aggregate variable hit control enter and now in aggregate variable we have a that aggregation that we created a second ago so since we have that we can already go further and maybe save that on a blob storage so let's grab one more line and grab aggregates of the result of from the previous step right mulled over right so if there are any results on the blob storage existing already overwrite them write it as a JSON file so this is the specification of the format it's pretty much as easy just changing this to CSV and we're gonna have a CSV file so let's hit control enter and it finished so if you did finish and we didn't mess anything up we go back to our blob storage we go to the overview tab go to the blob service staging open output aggregate CSV we have a file that is empty zero bytes because this is just a metadata file and then the aggregate we have some information here but this is a splitted csv file actually sorry this one see this is a CSV file ten free ones and fifteen paid ones why is it split because this is how spark works it's... splitting data into partitions for more effective processing of course you can control that if you want to have single or file you can always combine this data using simple commands but I just wanted to leave this as brief ones what is cool is that databricks and spark itself knows how to use those files and how to combine them in a singular data frame the other files that you have here is success which is the status of the processing you have the started file which says when the processing started you have committed file this is the file that says how many files did you create during this processing so as you see it says we created two so it's pointing to this and this file so this is how spark knows which files to pick during and that was processed and how to process this further so if we go back to our presentation and talk about very last thing that we just summarize the information that we learned here first of all you have Azure, in Azure in the resource group you have databricks workspace this is the resource that combines features about the service but only very limited features because in reality you have the separate feee... service in the portal called azuredatabricks.net and there's a region and worst place again so it's separate portal and you do pretty much everything in that portal... with regards to databricks what you can do there is first of all you can have workspaces shared ones or per users workspaces you can manage clusters and manage jobs jobs clusters are basically a cluster that you create to run a job and immediately delete it after the job, you can have machine learning experiments and many many more one interesting fact is that whenever you create a cluster there's another resource group created in Azure that is hidden from you if you're not an administrator where your cluster resides so as you know we had a two node cluster that means there's a resource group in Azure right now which has two virtual machines in that research group which is basically our cluster and this is what you're gonna pay for, for those virtual machines and we're done just couple of minutes we were able to transform data from the blob storage into the blob storage we are sure that will this will scale up to gigabytes or even petabytes of data because Azure databricks is Spark based so it's big data technology so that's it for today if you liked this video like it if you really liked it leave a comment or subscribe to see more and see you next time you

Info

Channel: Adam Marczak - Azure for Everyone

Views: 180,822

Rating: undefined out of 5

Keywords: Azure, Databricks, Python, Scala, tutorial, big data, machine learning, azure databricks tutorial

Id: M7t1T1Q5MNc

Channel Id: undefined

Length: 28min 34sec (1714 seconds)

Published: Mon Aug 19 2019