hey There! Tthis is Adam again and in this
video I'm going to be talking about Azure Databricks. Dne of the leading
technologies for big data processing it's fast its scalable and it's easy to
use in this video I'm gonna show you why is that
so stay tuned so Azure Databricks, what is databricks I think the easiest to explain
databricks is... it's the big data technology that Microsoft brought as one
of the services in Azure, it's a very cool platform that is based on Apache
spark was it cool because it is created and designed by the same people who
actually created a Apache spark and since Apache spark is one of the leaders of
Big Data technologies on the market it really promises fast transformations in
the cloud so since it's based on Apac he spark the key features that you would
get from this are first of all spark SQL and data frames this is a library that
allows you to work on your structured data as pretty much tables in any
system that we've been working on already
additionally you have some services that allow you for streaming of the data so
if you're doing IOT or life even applications this is one of the great
examples of how you can perform transformations on the live system you
also have machine learning library which allows you to do machine learning type
of transformations prepping and training models using spark itself you also have
graphx so if you're doing some social media type of applications then it is
also a great place to do so and everything is basing on spark core API
which means you can use R, you can use spark SQL which is a little bit
different than normal SQL it is more limited but it's still very powerful so
you know SQL that could be a very good feature for you to use without needing
to learn any other different language you also have Python, Scala those are the
two main languages that you will be using when developing in databricks but
you also have Java if you need to do that
databricks as a platform has a lot of features itself besides being Apache spark based it also
has a runtime. Runtime combines all those features together into singular
platform which delivers you workspaces a places where you can collaborate with
your friends and colleagues on your scripts if you have multiple scripts you
can combine them into workflows workflows are can be nesting of the
scripts... scripts coming another scripts basically a simple ETL and also you have
a DB i/o which is databricks input/output library allowing you to
easily connect multiple services both in Azure and not only like Apache Kafka and
Apache and sorry in Hadoop but also databricks has something called databricks
serverless what this really means is that when you work with databricks you
just specify what kind of server do you want how powerful it is how many of
those servers do you want what is the runtime that you want on it
and that's it... databricks as a platform will manage handling and creation of
that clusters for you without you needing to manage them at all and lastly
there is something called enterprise security so databricks is integrated
very well with Azure and Azure Active Directory so handling all those accesses
credentials authorization everything is basing on
Azure AD so you can just use your corporate credentials and identity to use
databricks itself there are a lot of storage solutions that it can connect
to but the five main ones that has native connectivity is a blob storage
data lake, data lake in both version 1 and 2, SQL data warehouse Apache Kafka
and Hadoop, we already mentioned some of those but also there are several
applications that you can use databricks for, the most common ones are
machine learning scenarios streaming scenarios data warehousing so your
typical ETL prepping the data and power bi which is very common case recently
but there are many other applications that you can use
databricks for since, this is a collaborative platform it is really easy for users to
use it there's a UI that they can use there's it's very... I would say it's very
simple... it's very simple once you know the platform but since there is a UI you
don't really have to be technical savvy in order to use it so your typical data
scientists engineer analyst once they learn the platform is very easy for them
to use databricks as well the typical scenario that you would see databricks in
is during the prep and train for machine learning or your typical prep which is
part of the ETL you normally have ingestion layer so either data factory
Kafka IOT Hub, Event Hub or something gathering your data from external
systems and putting it either on a blob or data lake so this is where the databricks
come in usually databricks will grab the data from the blob or data lake
transform it or train the models if it's machine learning scenario and put it in
some sort of database it can be their SQL database cosmos DB data warehouse or
maybe even analysis services or you of'course can put it back on a blob stored
if you want you that's up to you since there's a scripting platform you can
actually design this logic as you fit t o so without further ado let's go into the
portal and start doing some demos so in Azure we will need couple of things
first of all we're gonna need to create a databricks in order to create
databricks hit on create resource and type databricks and hit create when you find
the template and just give it a name I'm gonna call it a4edbintro you also need to provide a resource
group this is where your databricks workspace will be residing so I'm gonna
pick my resource group that I created previously I'm gonna leave location on
Europe this is the closest data center to me in a pricing tier I'm gonna select
trial please note that this is the trial the 14-day is free DBU that you're
getting is only the licensing cost for the databricks that you're not gonna
pay for but you're still gonna pay for the virtual machines itself so I'm gonna
leave the last option deploy a virtual network as no because this is not the
part of the training for today so I'm gonna hit create and the second resource
that we're gonna need today because we're going to be transforming data and
as we seen during our presentation we need to ingest it from somewhere I'm
gonna grab a blob storage for that where I will gonna be uploading files for our
processing purposes so I'm gonna hit plus and find a storage account it's
quickly here I'm gonna provide a name but first I need to select a resource
group so the same resource group as previously I'm going to also leave North
Europe and I'm gonna provide a4edbintro name and everything else gonna be
left as default of course you can change replication if this is just the
ingestion layer so you going to get a better performance so if I'm gonna leave locally
redundant storage this is the cheapest and the fastest storage that we
can get so let's hit review and create and hit create provisioning of resources
takes about minute or two so I'm gonna speed this up so the provisioning finished so we can
go to our databricks resource so let's go to the resource group open our
resource group and those are that two resources that we just created so I'm
gonna hit on the databricks resource and I'm gonna hit on launch workspace
this is gonna bring me to the separate portal, the portal where you're actually
gonna be doing all the work there's actually literally almost nothing that
you can do related to databricks here in the Azure portal itself other than
some virtual network connectivity so let's go back to the databricks
platform this initialization takes about a minute or two as well but I've seen
rare scenarios where this sometimes takes up to an hour so if this happens
for you just patiently wait or maybe come back later
so in the azure databricks portal you can do here a couple of things but the most
important that we're going to be doing today is running the scripts and
creating clusters, cluster... clusters are the workloads that you're going to be
running on the servers that will be executing your scripts so what we need
to do right now is create new cluster you can quickly hit here new cluster and
provide details of your cluster I'm gonna call it demo because the name is
just important for me right now but in case you're collaborating always
take meaningful name what is the cluster mode standard or
high concurrency if multiple users are working on the same cluster it is
advised to use high concurrency if this is a cluster for your ETL transformation
just pick standard next you have pool I'm gonna leave this as none and we have
database runtime version for the demo purposes leave this as default but as
you see you have a lot of databricks runtimes
so if you need Scala or spark in a different version you can always find it
here so I'm gonna leave runtime 5.4 I'm gonna leave Python version as 3 and
the next two options are very important first of all you have enable autoscaling so if you're doing workloads that
sometimes needs more and less power you can enable autoscaling and this
will pretty much bring from 2 to 8 servers for your processing needs in our
case I'm gonna disable this and change workers to run to 1 the reason for doing
that is because our scripts are very very small and if you're gonna take more
servers not only you're gonna pay more but the execution will take more time
because the time it will take to split the tasks into two servers and combine
the results together is larger than the actual work that needs to be done so I
would advise always for training purposes to just grab out non autoscaling cluster with only one worker and there's very good feature here called
terminate after that means if you have a cluster and you're finished processing
workloads and you went home you stopped processing it will automatically delete
the cluster for you so you don't pay anything at all and this is why even
though you pay additional license for the databricks itself usually end up
paying less time for the usual services that transform the data so I'm going to
leave this to 30 minutes so if for 30 minutes I'm not using this it's going to
delete it and I'm gonna pick the smallest server available and I think
this is the f-series f-4s is the smallest server that you can use and hit
create cluster after doing so your server will be up probably between one
and a five minutes so again I'm going to speed this up so since our cluster was created and
it's running notice first of all you have two notes two nodes are because
there's always master and a worker worker is the server a virtual machine
that runs all the scripts that you create and the master is virtual machine
that orchestrates all the scripts execution and splits them across
multiple workers if you have more than one so let's go to workspace and start
creating our scripts so hit on the workspace go to users hit on your
username and this is where I'm gonna be doing my work right now this is my
personal workspace where I can create notebooks so I'm going to right click
hit create notebook and notebook is basically your scripting language so I'm
gonna pick Python and I'm gonna type demo one this going to be name of my
notebook where I'm gonna be writing my scripts and of course this is not the
demo of how to write Python and Scala scripting therefore I prepared all the
scripts before beforehand so the first demo in Python that we're gonna run is
prepared by Microsoft, Microsoft gave the data that you can run very quickly to
see how data briefs works so I'm just gonna copy pasted so first of all i'm
gonna create four variables and we're gonna create the blob account called Azure
open data storage to the container relative path and provide SAS token SAS
token is your key that will gonna use to authorize to this container and pull
this Boston data in order to run this block you simply need to hit control
enter or you need to hit this run cell button here and hit run cell our command
az you see was very very fast so right now since we executed this doc this block is
a series of commands written in Python you can actually combine multiple blocks
into single notebook to create a new block you go here insert new cell
and paste new script oh sorry that's the same script so I'm gonna copy the next
one this one is nothing to worry about I would say this is the configuration that
really needs to happen anyway this is just for your information this is the
path to the blob storage this is a very well not very user-friendly in terms of
how to combine this but documentation very nicely specifies how this should
look like so usually just grab this from example and replace some values so we'll
just need to run this so that means our remote blob path is blob protocol city
data container, azure open data storage and safety release city Boston right so let's
grab data from this container I'm gonna grab the next script and it says spark
read so basically we're telling spark to read the data next we're gonna say
parquet this is the format of the data it's just like CSV or JSON file there's
also parquet files and do it from the path from the previous step then print
some information and create the view a view are used if you want to use SQL so
let's hit control enter and run this in about five to seven seconds we should
you get it data combined it was actually a bit faster four, under four seconds so
if this work right now we can go to the next block and maybe display something
about this data what I'm gonna do is maybe I'm going to count those rows of
the data so I'm gonna hit ctrl enter and run the job so we have hundred twenty
seven thousand rows of data that we read that we can already work on and since we
created a view called source we can also do one of those powerful things that I
was talking about which is switching language to switch languages just type
%SQL and when you do that you can start typing your SQL comments like
this so select star grab all the row from the view called source
and only grab the first ten rows hit ctrl enter and see the results it's
really cool that we were able to connect to external blob storage use SQL to
query the data and display it on the screen but what's more is that if you're
the analyst and you're doing this data analysis you can grab your data here
using download CSV file and maybe pass it along to someone or maybe hit this
button and visualize it using bar chart or maybe a pie chart of course as pie
chart you have additional options just like in excellent pivot tables you can
use those here and change the values maybe change the aggregate to count and
hit apply and this is our chart it's pretty cool isn't it so that was the
demo for the Python let's I don't feel like this demo really shows what we did
because we not only use external storage but we use very quick commands we've got
some results so I do really want to do more tangible more hands-on experience
with scripting so we're going to create one more script in Scala so let's go to
workspace right click create new notebook call it demo 2 and this one's
gonna be in Scala so what we're gonna do here is we're gonna first upload our own
data so what I'm gonna do is I'm gonna upload those rows of data there's 25
rows of data saved here as a JSON file this is a small radio JSON information
so I already download this, downloaded this file what I need to do right now
is go back to Azure go to my resource group go to the storage account that I
created and this is the reason we created it because I want to have this
place where I can put my own data transform it and put it back so I'm
gonna go to blob service create new container I'm gonna call it staging
I'm gonna leave it as private and I'm gonna go to staging and upload my file
called small radio Jason and upload it so they uploaded successfully right now
we have on our container small radio JSON file that's it what we need to do
in terms of Azure let's go back to scripting so this time we're writing
code in Scala so I prepared some of the code already so first of all for this
code to work we need a container named storage account name and SAS token the
same token that we were given by Microsoft so we're gonna generate that
in a second first of all we need a container name our container name is
called staging so let's copy-paste that and put it here
we need a storage account name so let's go back and grab our storage account
name and we need SAS token to get an SAS token go to blob service go to shared
access signature blade and select the service that you need access to I only
need access for blobs I'm gonna leave it at blob and I'm gonna leave everything
else default this means that storage that I'm giving access to has only
access until tomorrow 4:00 a.m. in the morning so I'm going to generate the
SAS I'm gonna copy it to clipboard and I'm gonna paste it here this is very
important that in this case this is very shortly
token so if I would run the script tomorrow it would fail so remember about
that I'm gonna hit ctrl enter which is gonna initialize those variables I'm
gonna hit plus and I'm gonna add new line and this is very important what I'm
gonna do right now I'm using databricks urilities to mount a storage so just
like you map your normal drive on your computer or Windows or Linux we are
doing the same thing here except instead of mapping a drive we're mapping a blob
storage this is a very very cool feature because using just normal
windows shell; oh sorry... linux shell commands you can copy data to this mount and it
will land on a blob storage so I'm gonna hit ctrl enter this sorry let's rename
it differently that's called staging I'm gonna hit ctrl enter the reason it failed
because I already had a demo mounted before because I was testing this
scenario so let's call it staging this command takes about 23 24 seconds to
mount it and from this point onwards we will be able to use this mounting point
to load our data after almost 22 seconds our staging is ready so let's grab
another line of code I'm gonna grab a line of code that will allow me to read
from a mount slash staging small radio JSON and what I'm gonna do is I'm gonna
display this data and hit ctrl enter so as you see it was really fast but what
is cool here is we used it like it would be normal file on our drive except it
was on a blob storage and yet we were able to connect loaded data in this
plate it's very very cool to be honest next what we're gonna do is gonna create
another block and paste it in and here we're gonna use a select statement this
is a similar to selecting SQL where we're grabbing our data frame consisting
of our data and selecting only few columns first name last name gender
location and level and displaying that data frame to see the results hit ctrl
enter everything seems fine that was the first transformation that we did using
databricks so let's do the second one by pasting this command and grabbing the
results that I saved previous step specified specific columns data frame
and from this result and grabbing and renaming one of the columns called level
- subscription type let's hit control control enter and again see the results
the column was renamed what else you can do here is the same as previously maybe
let's create a view called renamed control enter and the view was created
so how about we do some simple aggregations maybe count how many
subscription types are there so I'm gonna type %SQL and select
count by subscription type hit control enter and we got ten free subscription
and fifteen paid ones and if you want to save this as a result in in some data
frame later on you just encapsulate this select statement into spark SQL and
assign it to aggregate variable hit control enter and now in aggregate
variable we have a that aggregation that we created a second ago so since we have
that we can already go further and maybe save that on a blob storage so let's
grab one more line and grab aggregates of the result of from the previous step
right mulled over right so if there are any results on the blob storage existing
already overwrite them write it as a JSON file so this is the specification
of the format it's pretty much as easy just changing this to CSV and we're
gonna have a CSV file so let's hit control enter and it finished so if you
did finish and we didn't mess anything up we go back to our blob storage we go
to the overview tab go to the blob service staging open output
aggregate CSV we have a file that is empty zero bytes because this is just a
metadata file and then the aggregate we have some information here but this is a
splitted csv file actually sorry this one see this is a
CSV file ten free ones and fifteen paid ones why is it split because this is how
spark works it's... splitting data into partitions for more effective
processing of course you can control that if you want to have single or file
you can always combine this data using simple commands but I just wanted to
leave this as brief ones what is cool is that databricks and spark itself knows
how to use those files and how to combine them in a singular data
frame the other files that you have here is success which is the status of the
processing you have the started file which says when the processing started
you have committed file this is the file that says how many files did you create
during this processing so as you see it says we created two so it's pointing to
this and this file so this is how spark knows which files to pick during and
that was processed and how to process this further so if we go back to our
presentation and talk about very last thing that we just summarize the
information that we learned here first of all you have Azure, in Azure in the
resource group you have databricks workspace this is the resource that
combines features about the service but only very limited features because in
reality you have the separate feee... service in the portal called azuredatabricks.net
and there's a region and worst place again so it's separate
portal and you do pretty much everything in that portal... with regards to databricks
what you can do there is first of all you can have workspaces shared ones
or per users workspaces you can manage clusters and manage jobs
jobs clusters are basically a cluster that you create to run a job and
immediately delete it after the job, you can have machine learning experiments
and many many more one interesting fact is that whenever you create a cluster
there's another resource group created in Azure that is hidden from you if
you're not an administrator where your cluster resides so as you know we had a two node cluster that means there's a resource group in
Azure right now which has two virtual machines in that
research group which is basically our cluster and this is what you're gonna
pay for, for those virtual machines and we're done just couple of minutes we
were able to transform data from the blob storage into the blob storage we
are sure that will this will scale up to gigabytes or even petabytes of data
because Azure databricks is Spark based so it's big data technology so that's it
for today if you liked this video like it if you really liked it leave a
comment or subscribe to see more and see you next time you
Very cool! Thank you!!