Azure Databricks Tutorial | Azure Databricks for Beginners | Intellipaat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone welcome to this session on azure databricks azure database is basically a apache spark based big data analytics service by microsoft azure today in this session we are basically going to understand how we can use this service to solve data science and data engineering problems but before we start the video guys if you haven't already please subscribe to the intel apart channel and also click on that bell icon to never miss out on any updates from us so now let's go ahead and start off with the agenda all right guys so first of all we'll go ahead and understand what is azira data breaks and then move on to discuss why do we actually need it after that we're going to talk about how does azure data breaks basically work and then talk about the various data break utilities which are out there which you can use in order to uh you know implement your big data workloads finally we're going to talk about how you can integrate azure databricks with azure blob storage and implement a project that we are going to go ahead and show you in this session so guys this is the agenda for this session i hope it's clear to you now let's go ahead and start off with the first topic which is what is azure databricks what is azure databricks before we actually understand what azure data bricks is we need to understand what data bricks is because they are two separate entities uh databricks was founded by the original creators of apache spark it was developed as a web-based platform for implementing apache spark basically so normally you would implement apache spark with a locally hosted cluster with the data bricks software you get the option of interacting with your spark framework using a web-based ui it provides automated cluster management and ipython style notebooks so the best thing about databricks is that the cluster management is automated completely so it's completely software based so you absolutely do not need to manage any aspect of the cluster manually or all of it is automated and on top of that you get ipython style notebooks so if you're coming from a data science or data analytics background so you must be familiar with jupiter notebooks so that is what essentially i python notebooks are so you get them pre-installed with data bricks so you have your cluster you have your notebooks pre-installed you launch your notebook on top of the cluster whatever code you execute in that notebook will utilize the full functionality of the cluster that you have if you're using the spark libraries so the two features we need to focus on when talking about data breaks would include automated cluster management and coding notebooks so now that we know what data bricks is we can move on to what azure data bricks is so when we move data bricks to a cloud service for instance microsoft azure it's called assure databricks it is the jointly developed data and ai cloud service from microsoft and data breaks for data analytics data science data engineering and machine learning so essentially all of activity is related to data you can use azure data bricks for that especially if you're dealing with big data so we can say uh when we combine data bricks with the microsoft cloud service which is microsoft azure we get our assured databricks service so if we look at the architecture that is followed by azure databricks it is a cloud service so it lets you set up and use a cluster of azure instances or assure virtual machines and all of these virtual machines are combined together they have apache spark installed on top of them and it functions in a way uh similar to your big data cluster where following a master worker nodal dynamic similar to how you would install your local hadoop or spark cluster or your big data cluster so if you look at a diagram now we can see that on the left hand side right over here we have our remote site client this is the device that we are using to access our assured databrick service then we initiate our remote access to our azure databricks console uh say your web browser and then uh that's all we need to do here is where we will be basically interacting with right we will be able to see the actual servers we wouldn't be dealing with any of the complexities of maintaining the actual uh instances that we are running for our cluster the nodes that we are running for our cluster we don't need to deal with that azure databricks manages all of that for us so once we send our commands to our azure database service it will send those commands to the cluster that we have for our the spa cluster that we have and we will get our use cases resolved and when dealing with cloud services the main advantage is abstraction so we do not need to deal with any of the complexities of the cluster so that is why we have the invisible icon over here because we're not dealing with the cluster directly so now let us move on to why azure data breaks why do we use azure databricks as a service as opposed to any other means of analyzing data or big data so since azure databricks is a cloud-based service a microsoft managed cloud-based service at that it already has several advantages over traditional spark clusters which means that locally hosted spark clusters are locally hosted data brick clusters uh when we are using azure data bricks we get several advantages over them so let us look at some of the benefits that we get when using azure databricks so firstly an optimized spark engine so what does this mean data processing with auto scaling and spark optimized for up to 50 x performance gains so 50 times performance gains right so say if you're hosting a local spa cluster in your office space or in your server room you don't have such complex features as auto scaling or optimization of your spark engine you don't have those features automatically available to you you have to manually adjust a lot of things you have to manually install or remove nodes from your cluster so when you're dealing with a microsoft paid service when you're dealing with a microsoft cloud service you get auto scaling features with it so depending on the use case that you have depending on the workload that you currently are dealing with microsoft has the option to scale up your cluster or scale down your cluster so it helps you reduce costs and it optimizes the the cluster requirements the cluster hardware settings according to the complexity of the job that you have at hand then uh machine learning so there are pre-configured environments with frameworks such as pi dodge tensorflow and scikit-learn already installed so we will get into this when we get into the data bricks utilities but uh basically the point is all of these libraries such as pi dodge tensorflow and scikit we can get pretty easily when we're using data bricks instead of having to manually download those libraries and installing them into our environment and then ml flow track and share experiments reproduce runs and manage models collaboratively from a central repository so ml flow is a feature provided to you by microsoft show which helps you uh basically collaborate with your teammates that you may have in your data analytics team or your data science team you can share your experiments you can try to reproduce the runs with the models that have been created you can manage the models that you've created and you can manage them uh from a central repository so you can push and pull all of the models that you've created and deposit them into a central repository so that your teammates can access them as well then one of the best features that you get with using data bricks in general not just azure data bricks is choice of language so especially in microsoft data bricks when you're launching a notebook once you've set up the cluster and once you're launching the notebook you get an option to start your notebook in the preferred language that you would require right so you can start up the notebook in scala are spark sql or dotnet so whether you use serverless or provision compute resources you are able to do that so the best thing about this is that you can get started with your use case pretty easily you are not dealing with a lot of setup over here you're not dealing with a lot of administrative job right now you're only dealing with the use case you're only dealing with analyzing data processing data visualizing data or whatever your use case may be so the next feature would include collaborative notebooks so quickly access and explore data so notebooks obviously can be opened by others as well you can simply share the notebooks the other people the other teammates can look at your code they can execute that code they can make changes to that code they can make modifications better or efficiently optimize the code so all of this is basically accessible in a cloud environment pretty easily you don't have to manually share anything so quickly access and explore data find and share new insights and build models collaboratively with the languages and tools of your choice delta lake so microsoft assure gives you the service of delta lag so you can bring data reliability and scalability to your existing data lake with an open source transactional storage layer designed for the full data life cycle so when you are using your data bricks with delta leak you can have your data already indexed for you for faster transactions integration with other azure services so the best thing about cloud services is that there are other cloud services for various other things as well so this is the case for aws this is the case for google cloud and this is the case for microsoft azure as well so when you're creating your azure databricks instance when you're creating your cluster and your notebook you can also integrate with other azure services so these could include azure data factory azure data lake storage as we discussed before azure machine learning if we are dealing with complex machine learning models and power bi if we are dealing with business intelligence other features would include interactive workspaces so the easy and seamless coordination between data analysts data scientists data engineers and business analysts so people coming from a data background a coding background or even from a business analyst background can have a smooth collaboration between them because these workspaces are interactive and easily accessible by one another and enterprise creates security so another good thing about cloud services is that you get security by default and these companies they are liable they are legally liable for any data breach or data loss or any attack by any attacker right so it is their job because we are paying them to protect our data so we get enterprise grade security already available with our microsoft azure data bricks workspace the native security provided by microsoft ashore ensures protection of data within storage services and private workspaces so we already get that regardless if we are using a microsoft azure storage service or if we are using workspaces and databricks itself we can be absolutely sure of enterprise great security and production ready so you can easily run you can easily implement and monitor your heavy data oriented jobs when you're running big data jobs on your notebook or your cluster and you can get the job related statistics that you're currently running very easily as well because a cloud service such as databricks microsoft azure databricks helps you monitor that how does assured data bricks work so using assured data bricks is actually not that hard because you only have to deal with a cert a few ui elements that you get with microsoft azure on the default you have to enter some details and according to that your cluster or workspace will be started so it's pretty pretty easy and we'll see that when we reach the hands-on section but before we actually get to the hands-on section we need to deal with databricks utilities first so this is a data bricks exclusive uh utility uh set so this is basically the set of commands that you already get uh pre-installed with your uh data bricks setup so outside of your normal python spark or scala spark setup that you have you have these external commands in order to install libraries or packages or in order for you to integrate various services that you may have so say you have a file available on your data storage service in your microsoft azure and you need to import that file so you cannot directly uh do that when you're using python with spark so you need some external element attached to it so that is where db utils comes in so db utils helps us help us perform a variety of powerful tasks which include efficient object storage chaining notebooks together so if you have multiple ipython notebooks you can chain them all together and perform their functionalities in a sequence and it also helps you to work with secrets so these secrets are with regards to authentication these could be passwords certificates or keys in general like public and private key pairs so all of those things you can basically configure into your db utils in order for you to authenticate into a another microsoft azure service and then you can essentially import your data from there which we will see uh in our hands on so uh one thing to note over here is that db uters are not supported outside notebooks so you can only use db utils when you've launched a python or a scalar notebook with your cluster that you've created with your spark cluster that you've created so db utils are not supported outside of notebooks integrating azure data bricks with azure blob storage so uh azure blob storage and azure data bricks are both services provided to you by microsoft azure so these are two separate services but as long as you're using the same resource group you should be able to combine these two so why would you need to combine these two let us see that microsoft azure provides a multitude of services it is often beneficial to combine multiple services together to approach your use case so uh basically the advantage of combining multiple services together is that you don't need to engage your local hardware into anything like currently i'm using my laptop to look over this presentation so i want to keep my laptop clean or maybe my laptop is of a lower hardware configuration maybe i don't have enough disk space right so in that case i would want to use the cloud services to handle all of my use cases so i want to use the cloud service to handle my big data storage needs so i would store it into a an azure service that i have i need the cloud service to process my big data so for that i'll use azure databricks so all of my use cases are being solved by the online services provided to me by microsoft azure so say if you want to integrate two services such as data bricks and blob storage then this is the workflow or architecture for that so you're the user you have your device present to you available to you and what you'll do basically is you will interact with a coding notebook right so this coding notebook is a ipython or a jupyter notebook that your azure databricks will create for you so the way this works is you interact with your azure databricks console you launch your coding notebook you type some commands into your coding notebook the coding notebook commands are then sent to the databrick service and the database service sends those commands to your azure cluster right so the cluster that you've created after you've created your data breaks workspace and you've created this notebook on top of that cluster so those commands go into the cluster then uh depending on the authentication provided to the cluster with regards to your blob storage account uh the authentication details are sent to the blob storage service the data is fetched from the desired directory it is brought back inside the cluster the data is processed and you see your output on your coding notebook now uh it just works two folds or also so you can directly upload your data to your microsoft azure service once you've uploaded that data over there you can then access your coding notebook and access that data from your azure databricks service and then process it and then get your results and from then on you can also store the output that you've created back into your blob storage space so all of this is very easily handled and seamlessly integrated together so now now that we've discussed the architecture let us look at how to actually implement this with our hands-on section so now that we've gone through the introduction to azure databricks we will now be heading over to the hands-on section of things where we'll be integrating azure database with another microsoft azure service which is azure blob storage so let's get to it simply uh go to the assured database service on your assured portal click on it and then simply click on add so once you've clicked on add you'll get a lot of options so subscription will be pay as you go create a new resource group as you want uh remember the name just be sure that you remember the name so i can type in something like the databricks resource group and even give anything that you want but for now let's just use the uh databricks practice resource group so this this is a resource group that has already been created and then you can name your workspace anything so let us call this dataflex workspace one and select the location that you would prefer so it's generally a practice to select the location that is the closest to you so just select west india and then we can simply decide the pricing table so in the pricing tab go for the trial one because you're going to be practicing you're not going to be actually implementing any organizational needs so go for the trial version uh and then click on review and create just look at everything that you've selected see if it's right and then click on create so once you've done that and the deployment is underway you will have to wait up until your deployment is actually ready for you to be able to go to your resource and do what you have to do and once your deployment is complete you can go to your resource and launch your workspace that you require so you can click on launch workspace from here and then you can essentially log into the database portal it will sign you in and from there on you have these familiar options so you can create a cluster create a notebook and essentially get started with your databricks journaling in databricks i was confused with an aws service in databricks we don't actually use the hdfs we're using dbfs so it's it's uh exact same thing as hdfs except uh dbs dbfs has this network element to it which helps us combine cloud instances so databricks file system in theory in in practice it works in the same way as vhcss so there is no hadoop hadoop is not coming into this hadoop is hadoop is not coming it's not required so then how this is happening uh so uh in in the spark uh practice that you may have done in the past the theoretical information spark is not dependent on hadoop even though what i've learned is spark can use either mesos or hadoop only two are allowed two you can use that can use any any any distributed file system spark doesn't have a limitation to me it can use any distributed there's no i think uh there's not a compatibility issue that is what spa that's why it's part of flexible it doesn't have a contact compatibility issue you can use the local file system as well you can use an online file system as well you can use a distributed file system as well so all of those things are possible with apache spark i'm sure you've used local files when you're practicing this yeah that's correct but in actual practice it has to be this application has to be rooted to all nodes right that has to be done either through misos or hadoop that's what i thought okay yeah so this is going to be just that way when once we ingest the data it is automatically replicated across depending on uh depending on the configuration it's automatically replicated when the data is ingested it's replicated across divided into parts replicated across the nodes and then we process it with our commands so uh our deployment is complete so the next thing you will do is click on go to resource right so now we will launch our uh service so for that uh click on launch workspace i think there's one other step we have to do this will log us into the uh databricks platform so this is the very conveniently and well uh well wrapped around the terminal that you get so a lot of things might seem familiar here right so we have these options uh we can create a new notebook so we cannot actually do that but we haven't created a cluster because we need something to run our notebook on right so the first thing that you will do is create a new cluster and now we can specify things so let us just call this databricks session this is what the cluster is the cluster mode would be high concurrency so high concurrency happens when a lot of people are using the cluster at the same time so that a lot of jobs can a lot of jobs could uh the scheduling is optimized in high concurrency where the jobs are scheduled in the manner of resources in standard when you're only using it for practice purposes you would go over standard or jobs that are not that frequent if your company doesn't have to do frequent jobs then it would use the standard mode the pool is keep a defined number of steady instances on standby to reduce cluster startup time all right so this is another feature i haven't used this before i think this is it's just a standby yeah to reduce cluster startup time so it is preemptively start up the services right uh we can decide this past uh scala and spark version that we want so there are a lot of options uh right now we're going to use the stable one which is 7.2 and then we can also enable auto scaling or receive a lot of scaling so as i discussed in the ppt uh in the minimum workplace we can specify and the maximum workers we can specify so at the minimum and in the minimum when it's doing the most uh least required requirement uh requiring job then it will use the minimum number right and when the workload gets really really high so it will get somewhere between two and eight we can specify when it increases to anything we want right obviously the price would increase as well you are actually using 17. uh and if you are using it for practice you don't need to enable auto scaling as well you disable that and you specify one more okay okay right so now you don't have to pay literally anything or terminate it after uh attached minutes of inactivity so if you're not running any commands on it uh and you've somehow forgotten that you've let that resource run you've forgotten you blocked out and you went somewhere else right so uh if you don't want to be charged for that charge for inactivity because even though assured will not cost you anything when you're not using it what will happen is there is a certain amount of storage space in every instance that you are creating in the cluster so that storage space is not entirely free there's a nominal cost attached to it but it's still there so if you're using one worker it's negligible but if it's if you have 150 workers and all of that will add up right so we can simply uh provide the minutes of inactivity after which it will automatically terminate the cluster will automatically terminate after that so i'll just put it to something like 60 right i'll just put it to 120 i'm going to disable it so let me do that and uh we can choose the instance type so there are a lot of in like in aws there are a lot of instance types and assure as well so these are the instance type for the worker what you want for the workwear and the driver is the master basically that is so much memory is allocated is it yeah so memory means ram and yeah this is how much ram it has how many cores it has right so you get that so let us get with the default and the driver uh let us same as worker because it's the minimum bare minimum the weakest possible instance where using right but you can obviously if you have a higher workload you can use higher instances with high mp so it goes 56 dbs up to that much ram if you click more and there are more 256 gb and then in the region that we are operating in uh we don't have these available uh in the west india region but uh you can opt for uh these if you choose another region because we don't have that architecture in the indian indian server house okay for microsoft whatever it has in mumbai um let us now uh create it so we can click on create list let me just check once yeah create it so this will take some time to deploy as well uh so we'll wait for that so it's currently pending because it's being set up the the the nodes are being initialized spark is being installed on those nodes who knows that nodes are being added in the same cluster right so all of that is happening automatically on its own in the in the background and the servers of microsoft azure and it will when once it has created once it has reserved those resources and has installed on it then the state will be uh i think running already and then we can use our cluster for our workloads so we'll wait for that so this pay as you go means it's not like monthly there is no subscription charge or anything it depends on how much you use that's all yes yes so workload depends on the workload depends on the number of nodes you're using depends on the number of depending on the quantity of storage you've preserved for your cluster that is one of the key advantages as well right for using cloud services say if you're a business owner and you have big data needs right so uh you have two options if you use the cloud services uh you only pay for the amount of users that you have but if you're owning some hardware locally in your own office then you have to if you rented it you bought it you have to pay for the entire thing and then you have to pay for the electricity cost you have to pay for maintenance you have to pay a server administrator uh to maintain that server uh to add any new cluster add any new devices to it add any new hardware to it upgrade it directly so this all of this is managed by ashore in this case so it's now in finding state and there are two nodes the first is the driver node and the work right all right so uh we can now simply go to the assured data pics right we can click on that and then it will ask us which cluster we want to use we will use the cluster that we've just created yeah yeah so it's there in the options and we can select the language that we've started want to start a notebook in so i'll select scala okay right and then we can simply call this notebook anything uh let us call this databricks notebook and click on create so this will start almost immediately right looks like yeah it's just like a bitter notebook you have to press shift and enter to run every command and then you can add new code blocks and all that right okay okay so now what we will do i will tell you how to uh upload files and integrate uh this notebook with a storage platform like an assured service a different assured service so we're going to be using assured blob storage so this is like amazon s3 it provides a storage service to you i sure also provide you with the storage service which is the assure storage accounts and you can create uh data containers on it and then you can use those uh in your various uh services activities not just with uh data breaks you can use it with other assured services that use it as well so let us go to the uh ashore dashboard once again because it is not part of databricks uh what i'm going to be doing right now not from here actually there's a ui element for the go to meeting thing so that is why i'm not able to see my tabs yeah so this will open that up yeah i'm back in my portal right from where i can launch all of the services so uh this is storage accounts you will click on that because you will be adding some you'll be still you'll be required to store something right so uh these are already created by by peers and everything so these are some are big storages some are blob storages so depending on whatever everybody is doing say they create that and you can create your own click on add so as i was discussing before the ps3 remains as it is because that's the only option and now you don't have to deal with any permission issues because once you click on drop down the menu of resource group you can just find the resource that you created so all of the permissions are already intertwined right you don't have to go and allow you don't have to create some sort of iam role like you would have to do an aws to integrate yeah so uh databricks resource i think it was that and uh simply check it over here so resource pro data breaks practice it's not i think it's not data but it's whatever it was okay okay so you can name this anything you would like let's call this databricks session storage and uh the location it is better if you keep the location same for all for latency reasons you can keep it different if you want but it will take a longer time to say if you're analyzing a lot of data right which requires immediate precise uh low latency uh transports immediately then having something something being in the us and something being in india it will take some about two seconds to actually uh uh requesting this phone yeah so that is not a good case and uh account guide you can choose anything so we are going to be using general purpose v2 the payment is not an issue over here because uh it entirely depends on how much data you're actually uploading right so uh right now i'm just going to be uploading a megabyte of data it's not going to cost me anything i can use the v2 and uh the supported replication you can put it to locally redundant so the difference between uh jio redundant uh and locally redundant is that the jio redundant uh the data is replicated across regions so in in in west mumbai uh there will be one uh instance of that yeah and in the u.s region there would be other or what not so that is something that storage accounts would manage automatically but in locally redundant what will happen is it will uh for our use case because we are not dealing with any data that is uh that we are fearing to lose uh that that would cost us money if we use it we don't need that sort of uh and we need that sort of a backup for that we use yes where multiple from multiple places people are using the storage service that is not the case in our case so we'll use local later where the data is replicated but in the same server space like it will be depleted three times in mumbai itself yeah yeah so we're going to using that and the access to you can simply look at it uh so the account access here is the default year that is inferred by any block without any explicitly set here the hot access tier is for ideally for frequently accessed data and the code is for when you don't have to access that data frequently so another not a pricing issue in that case for our use case so let us click on create and review so review all of the details uh i think the right question they are database storage and then again so this will create our storage account now we have to create a container within our account so let us wait for the deployment so those settings we can change if we want is it yes for example that hot to cool we can make if you want so uh say that your organization deals with the data analytics but it does not deal with it in real time like it doesn't want it like on a minute to minute basis right so you want it on a routine basis they say uh your uh your tool whatever you've assigned whatever your code is it checks for the data every six hours right so your storage service uh could simply uh be a cool storage service where uh if it's being checked every six hours so i think there's some sort of configuration assigned to it the request and response is pricing is adjusted according to that so uh the deployment is complete and we can go to the resource now so all of the permissions are automatically managed just by using the same resource group we don't have to worry about that and now uh we can create our table uh not a table but our container where we will store the data similar to we would store how we would store it in amazon s3 so click on containers when you opened up the resource and now we can simply click on plus container so add a new container and then we can name this container anything we want so let us call this a spark proc spark assignment and uh let's click on create so we have it over here right so let us go back to the presentation for once and get over these topics a little bit so uh we have databricks utilities so these are uh everything is virtually the same and you're using databricks and spark uh it's everything is essentially the same the only new thing that is there is database utilities so these are the additional uh additional utilities that they are not exactly libraries these are utilities that are provided to you by uh database not by microsoft it's not a microsoft thing this is just a data bricks thing right so it it is basically a utility to operate within a cloud environment right so help us perform a variety of powerful tasks that include efficient object storage chaining notebooks together so you have commands and various notebooks that you want to chain together for your particular use case you can do that so all of this is provided to you as you can see in in on the bottom we have commands right like install pi-fi torch install wi-fi scikit-learn so you simply have to write that one command it will install it within your workspace so since this is not uh you will not open up apache spark you will not write a spar code and type installation commands right you will not do that you generally write what you have to implement for the data right so you definitely need something that would help you configure your workspace right so that is what db utils is used for i hope you get the idea so these are not supported outside notebooks uh you cannot use it with the putty right obviously because since you've used spark with uh putty or anything you would have not noticed that there's something called this is specifically for notebooks yeah so uh the db utils are available on three platforms uh python scala and r uh the difference uh in in python and scala if you create python and scala notebooks uh the advantage is that you would be able to use db utils uh file system utility as well where you can manipulate the file system of your cluster of your cluster nodes with our db utils does not come with a file system manipulator you can only use the other features like installing particular packages and whatnot r is a limited language not much of a development language uh let us integrate it now go back to the webpage right so uh now we have our uh container created so we can simply create a click on this container right and we can click on upload so this would uh open up my windows file system i can select a file that i want to let me go to the desktop so i have this file yellow.csv already there it's a csv file i simply uh added to my files and then i uh click on upload so it's a 1.44 mb file so it uploaded automatic uh instantly right but uh you can upload larger files as well but it will cost you a lot if you if you're uploading extremely large amounts of files right it will cost you more so that is how the pricing works a 1.4 mb comes in free here it does not cost you anything okay so um so we have this right now we need to integrate both of these we have our data in place and we have our notebook in place right we need to integrate now so let us go back to the notebook you see yeah so we're back so it takes one small thing i just want to know the hierarchy so the workspace is on the like the top level is it under workspace container comes or how it is or okay uh so let me uh okay so this is how it works right yeah so you have your terminal over here this is your uh ashore or a show dash a short service basically database service that's called the service and on the other hand you have your blob storage or container like this is your storage container actually i have the diagram for it just someone has forgot about it and made this okay so this is how it works i'm sorry i apologize i was actually making it so this is how it works so uh yeah okay so uh there's a multiple multitude of services and you need to combine so it's often beneficial to combine multiple services together to approach your usb so you're the user right away yeah all right so there are two things you're interacting with right you have two things direct with direct access to the two things right so uh right now we opened a separate service which was the container service so this was the microphone for blockchain right we uploaded our file the yellow csv so this is up till here we are interacting right up till here we are interacting with this so this part is not not our concern this part is not right okay so and then uh i showed data breaks we interacted we created our quoting notebook so this is what we're interacting with so now all of this is not our concern right so this is all happening in the background the coding notebook will interact with the databrick service and the database service will interact with the cluster we've created right yeah okay depending on the instructions we're giving in the coding notebook right so it will get uh combined or parsed whatever if you go to databricks database send that instruction to spark and say it's saying take that data from a show at a short broadcast take that data from this location from block storage uh the information will be processed probably by the driver node and it will basically ask the block storage for that data and since the permissions are granted it's in the same resource group there will be no permission issues block storage will simply grant access to that data and we'll get that data in our cluster so uh we now or the cluster will give that response back to databricks and databricks will uh simply will get the output sort of thing that we get and every time we execute a code block in a jupyter notebook we get some sort of an output so we get that post back instruction that yes we've received the data it's been loaded into an rdd or whatever we've created where this logical workspace came into picture what is its boundary i didn't get that for example my question i can rephrase can i have one workspace and many i can can access many different storage containers yes yes definitely so this is just one instance of a block storage you can just have another block of storage on top of it uh that you can access separately right and you can access it with the coding notebook as well it's as long as it is in the same resource group as our cluster right ah okay that resource group binds everything together okay yes yes uh it is very essential that everything is in the same resource so otherwise you have to do a lot of configuration to get past the authentication and update the data security is a very prime concern with these big organizations like amazon facebook so it's very essential that the permission thing is matched so then i would create more than one workspace for example for different clients is it like to give an example one workspace i will create for one client another am i talking in sensors or any uh no that's not an issue uh uh i think uh say you are working as a data analyst you're not working as a data scientist right so uh it may be the case that you will not have access uh particularly to the data scientist workspace but the data scientists could have the access to your workspace right okay so who are things that can be configured uh and multiple workspaces uh are created say uh the data analyst workspace is there so your own the only thing that the data analyst workspace will contain you launch your workspace and you only are creating notebooks which are related to data analytics not to machine learning ai or data science right so this works what is something for our only for the code is it yes yes exactly okay it's like any eclipse workspace something like that like that only hopefully yeah yeah okay yeah you can go ahead or you could go about disassessment there are various ways you can implement it it depends on the organization entirely say uh the data the data analysts group contains all data scientists data engineers and everybody so you create your cluster right you created the cluster similar to how we created the cluster and the data scientist would create his own notebook the data analyst would create his own notebook but they are using the same cluster oh okay that is also a way you can implement it so that is more important how to do that okay so the architecture is clear yeah yeah so uh we will now integrate so uh this uh this is a little bit tricky so the first thing that we have to do is actually generate a token so that we can actually access this data for a particular amount of time in the same resource group right the container name and the storage name we don't have to worry about because we've already uh they're just string variables that are there in our that i'll just show you in a moment so i'll just paste this format over here so this is something that you don't have to memorize this is something that will be provided to you just copy this and paste it over here so this is in the format of scala so we have to replace all of these things firstly let us look at the container name right so now we are going to refer to the storage uh we did something data break session that one right so uh open up the portal go to uh the storage accounts go to our recently created storage and then uh we can simply we go to containers and we have our start assignment so this is the uh container name right we copy this yeah and we replace it over here so we've created the variable this is string variable a normal string variable container name and it has the name spark assignment right the resource group also help us in the way that say if we are creating two resource groups there will be no name conflict somebody else working with me on the same assure portal organizational portal is using the container name spark assignment we will not have a conflict because they we are operating within different resource groups i hope that's good okay let's go back and look at the uh storage account name so storage account name is moment so let me go back so the storage account name is databrick session sources go back to the jupiter notebook and replace the storage account name with that right now the sas token so this uh this is the part where a little bit a little bit of technicality happens so uh open it up the statistician storage uh then we have to go to this shared access signature this option that it's present in every storage account so we go to that and now we can generate a token for a limited amount of time so it will not be valid forever so say if i am accessing this data right now if i'm able to do it right now i will not be able to access it after the expiry time so here we can set our expiry times right obviously for security reasons okay yeah yeah that is up and that is the time where the token is valid so we will allow the resource type service container and object and now we can simply uh generate our sas token so here we have our sis token right uh simply select that uh copy all of it and we can go back to our jupyter and paste it over here so we have our sas token right here just a moment so this is something very uh it's only for data breaks right uh not so sas tokens are uh these tokens are also say if you're using another service that uses storage accounts and storage within them so uh there might be a requirement for sas tokens if you're directly accessing the data without actually so these are all all things uh so to avoid any sort of automatic script running right if you're doing it manually with the ui it's all fine but if you're running code so that is where you need tokens involved basically validity and some validity period involved with those tokens so i think it should be fine although i had a little bit of doubt uh with just a moment i think this token itself so it's valid for one day the start period is first and the end period is like a second of august so yeah all right so i think it's fine uh all right so the rest of it remains the same right so it basically creates new variables new string variables where it's creating these links right by uh appending all of these strings together so we have our url we have our configuration right and uh once we run this db utils now here is the dv utils comes in because this is not something you would do in spark or anything right so this is where a databricks comes in and tells you that okay we are using cloud services now we will help you uh get your data from some other service or some other storage place right so we've provided the source as url the mount point as uh mnt spacing this is something you're creating right so you can use this uh address to basically uh assign to this this particular storage storage container that we have we we are assigning this string to that we are assigning this string to that storage container that whenever we use the string we are actually referring to the story so let's continue right you can put in anything over here any string value that you want and then we've specified the configurations uh we mapped the config with the sas token that we just created right so all of this is just to memorize basically and then not not memorize to basically copy paste and select template code yeah and then you plus a press shift plus enter and it will first the code will first refer to the first communicate with the cluster as we have shown in the diagram it first communicates with our data breaks and then it communicates with the cluster the code executes on the cluster uh and then once all of this is set up we will get our similar to how jupyter notebook works we'll get our output output message over here so let us wait for that so you just executed only those 13 lines is it uh yes okay so we're just attaching uh with the storage account yeah and we're attaching to the container not the storage account to the container where we are where we are having our file so uh once we've done with that then it becomes pretty simple then i think you will get into the territory where you remember now it's running the command there's one spark job so once that's finished executing we have successfully assigned a particular address to our data container location so we can see uh all of the details are there right and now we can start right so let me just uh uh look at the file names i forgot all right so okay so it's yellow.chp oh yeah that i remember yeah okay so uh create a van so val is basically uh how it works in python except in python you don't have to write anything before your variables so you write val and depending on the value that is assigned to a particular variable uh it will change the data type as well so exactly how it works in python and then uh you equate it to now you will remember how this is something you might remember spark.read and then you give in the format so the format right now is a csv and then you given other options as well uh like you want the headers to be there then you given the option because we're going to be implementing spark sql on it where we can essentially write this as well the infer schema and we assign that as true as well and for me efficient uh data uh replication requirements we will add the fail fast mode yeah and then we given the location now here is the part where the only thing you need to notice is this particular address right the mount point we can use that one copy this paste it over here and put in uh yeah yellow or csv so it should work now let's see and then we press shift plus enter oh it's an instrument at the moment against a moment options no there is option or options you have so let's just change that so one of the other advantages of using a notebook is that you can simply go over here and just remove this and you don't have to write it again uh appreciate the assist uh did not support the s over there so now we created our uh data frame right so let us see that uh df dot show and just limit that to four something yeah okay so let us just press enter so we have our data available so it's it's quite big for the screen so that is why it has shifted in that yeah but it is there so it is essentially some taxi job data some taxi online taxi service data uh where we have the pickup time and the drop off time the passenger account trip distance and everything so it's there so uh using this data frame yeah so we can simply write something like everything is something like this a df um create a temporary view out of it okay this is something you might be familiar with temp view so this is exclusively this now we are in uh we are done with everything that is supposed to be about data bricks now it is simple coding uh or in python uh and wherever you want to use so uh let us just name this temp view temp view let us name this taxi data so we can refer to it by uh spark sql and just press enter shift plus enter so we've done that we will also click on the play button instead of doing that but that's obviously takes no time so say if we want to select uh find out the number of rows uh number of trips total right so we can just count the number of rows so spark dot sql select and then we're given the count uh we are counting all so we'll just give that uh from and since we've named the temporary view as to accelerator and then we given the dot show so we shift plus enter and it will give us the information that we need so it's about 10 000 records okay yeah so similar to that uh you can spark sql and write anything yeah the reason i'm using spark sql is because you might not be familiar with the inbuilt scala functions that are there so uh spark sql gives us a common platform to communicate yeah yeah so now we can do something like if you want to find out the total revenue that is generated by all of the trips so we can do something like select sum and then given the column name let us look at the column name uh total amount let us take that data not sure so this is the total amount yeah just a moment and all of this if you want to simply uh say now you want to write some data back into the student service so that is there's an also another method for that uh since we've already created our mounting location right this we can use this to also write back the data okay okay right yeah so the way we do that is uh use this function uh use this uh aggregate right mode right simply copy this and paste it over here right uh so overwrite if you uh initialize the override mode so if it has something with the same file name it will overwrite that so this is usually done say uh what will you do when you have a routine procedure to follow right say uh you want to constantly keep replacing the data that you have with new results like on a daily basis say because the statistics change every day for your users so it will uh if you don't write the overwrite over there it will basically throw an error right because the file name already exists so you have to put in the overwrite method it depends on the use case entirely yes since or you could simply specify some counter or and you can implement that in the string a concatenator would concatenate it over here the counter will increase so file name should be different you don't need to write overwrite right so if i simply do that and uh just a moment the aggregate is not an actual i think it's not the actual function uh it's something for a default we go to containers we go to spark assignment so we have our output right here right so uh we simply look at the folder structure so it was up until uh mnt staging yeah this is what we want we created another i feel it's like hadoop this is this generally spark is used for big data cases not for 1lb files right so uh when you have big files they are divided into parts right so that is what you get essentially when you open i think that was about it
Info
Channel: Intellipaat
Views: 48,048
Rating: 4.8025751 out of 5
Keywords: azure databricks tutorial, azure databricks for beginners, azure databricks, learn azure databricks, what is azure databricks, azure databricks training, azure databricks course, microsoft azure databricks, microsoft azure databricks tutorial, azure training, azure course, intellipaat azure
Id: TScSdIJ-_Oo
Channel Id: undefined
Length: 56min 51sec (3411 seconds)
Published: Sun Sep 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.