Azure Databricks Virtual Network Integration & Firewall Rules

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everyone let's welcome to the second episode and second video about the azure data breaks and today we will speak about the azure data breaks networking specifically when we are integrating educator breaks to our existing network first I'm gonna go through the whiteboard and I'm having here this from the documentation this diagram to go through it but I want to discuss first what what are we trying to integrate the integration part happens when you are creating a workspace and a workspace as you know it's just a place for authoring and collaboration between the team for any code in this workspace to run this workspace you need to be attached to a cluster you create a cluster you run the cluster and then you run your code inside this cluster this cluster is a cluster of virtual machines and these virtual machines will be inside the v-net that's what we are integrating here so typically before without before this feature released when you are creating a workspace what happened is the workspace creation create and managed resource group and the managed resource group will have three resources one is a storage account to act you as your dbfs and second is a v-net and a third is a network security group or NSG now whether with the v-net integration these two are not created and the workspace will be connected to a minute that you are having pre-existing in your environment now this is critical because we need to understand the web interface itself for the data breaks and the cluster management basically all the control plane that does not exist in your V net your V net once you create the workspace will not have anything once you look into your workspace and you create your first cluster your first cluster will exist the network card for this cluster will be attached to your V net but you but for data breaks web interface that will not be inside your own workspace as you know and if you recall if you load them before to the web interface for data breaks you will find that it's the first part the subdomain is the azure location the azure region let's say it's Canada Central and the domain name is as your data breaks look net so it's it's always eager to the breaks book net this is the hope interface and all the control plane is installed inside and Microsoft managed subscription inside the V net controlled by Microsoft only the data plane or your own cluster will be the one deployed inside your V net now for the requirements for this one the documentation lists some requirements but I'm gonna go through the important ones that's the second part the requirements so you need to have two subnets and if you have these two subnets allocated to a workspace you can't have another workspace for it so if you require to have two different workspaces you need to have four different subnets does this mean that the workspace will have exclusive access and you can't have any other machines inside this subnet know you can this submit can have any other resources other than data breaks workspaces so it's two subnets per workspace however these subnets can have other machines attach it to these subnets now how it's done it's done when once you do this part of the installation part of the installation that the workspace will have access to your subnets basically you are doing delegation from you to the workspace so this workspace or educator plex workspace resource provider will manage these subnets on your behalf if you have already NSG attach it to these subnets the the resource provider will add new rules inside this NSG if you don't have one and you are using the the portal or Orion your arm template you are creating a subnet that will be created automatically so the portal experience will take care of this will create a subnet and then we'll add the rules if you have your own subnet already attached then this will be the rules will be added automatically for you and for the workspace to maintain this situation and nothing will happen that will risk the the workspace configurations the workspace will create on the on this subnet policy a virtual network intent policy and this intent policy will make sure that no one can can be or be able to create the rules created on the NSG that's the the how it's done for routing and peering and anything else that's allowed so your v-net here that's your v-net this v-net can be peered to another v-net you can have user-defined routing however user different routing because the two subnets one of them will have the public IP s the other one will have private IP s only the public IP s will be exposed outside your V net and the communication will be from the control plane to your cluster through this public IP so if you do routing you need to make sure that you do exceptions for the control plane so when the traffic comes here directly to the to the public IP and you have routing that means you have a file all here for example that means all the traffic going back will go back to first to the firewall in the follower will return it this is a symmetric routing and that will not make your cluster healthy you can do it however you will find that your cluster is always failing who you are provisioning a new cluster that's a very brief overview quickly on the whiteboard and we will have the demo now for the details of this in this first demo we'll see how to use the portal to create a new data breaks workspace and attach this data preps workspace or make it associated with the v-net first let's see the v-net and explore the veena that we that we have created I created a very small V net this is the small style that you can have size that you can have for data breaks and in this one it's slash 24 this is the whole V net and it has two subnets one of them is called private and it's slash 25 that means it has only one 2123 available IPS the second one is called public so let's see the experience by the way when we see here you see there is no delegation there is no security groups attached to them record details no energy or routing tables and no delegation this is the initial state now let's go to the creation and echo to my resource group create a data prax create use the existing resource group you can add a central premium and now in this option I'm going to deploy it inside v-net but we know that this is only the data plane or the control plane I'm gonna choose why the breaks test - and the public subnet is called public and the private subnet is called private the IP ranges here for the public one this is 128 and this is co / 25 that's all what we need and we can create the workspace right now however let's inspect the automation options which will show us the arm template that will be used options yes here so they data press with the net tuition options and the on templates you will see we have three resources to be created one is an SG the second one is a nested template and in the second one here we are just giving the delegation permission to the Microsoft data press / workspaces this is the resource that we are giving or the service name that we are giving the delegation to so from now on this service will be able to do changes inside my subnet this then is G this is the delegation the last part is the data breaks itself okay let's go back here and I'm going to create it [Music] deployment succeeded now and let's go to the resource you see here the data breaks and this is the workspace I'll see the managed resource group the city manager source group and the name is go to create it from my workspace name and I see here as well the virtual network so let's go and see the virtual network again go to the subnets now we have we see some changes happening here I see that there is a delegation Microsoft data break slash workspace so I'm delegating the management of the subnet to the service this is on the subnet level only not on the whole virtual network and I see there is one network security group created although these are two different subnets with the private one will have only the private IPS for the network card with a private eye piece for the cluster and the public one will have the public IP s however the network security group rules are the same for both of them that's why in the creation it will create only one MSG with what the same rules and both of them let's check this one see these rules so here is the network security group and these are the rules that are created by the data breaks all of them won't be starting by Microsoft the data breaks - workspace and then something to describe that the rule so let's see here first for the inbound security rules we'll see the first one is actually at application of the one that is built in the system rule that allows all the communication between all the the network cards inside the same v-net sorts from any protocol any port the source virtual network to the virtual network that's allowed this is not new this is already built-in in any virtual network however the new ones would be these two so these are the traffic coming from the control plane to the data breaks clusters to control them and if you try to delete this one for example you'll see it's failed and the failure here because of the network intent policy that was created to make sure that no one can temper with these rules to the point that data breaks clusters will not be function if you want to see this intent policy when you go to the resource group here I'm gonna group by the type and I will show the hidden types it's by default heaven and I'll go down here and one of these should be here you go Microsoft of network / Network content policy these are the intent policies created for my data prax if you go here you can see the details of them and tenth policies are not documented publicly are not used like no customer can use the intent policy for themselves it's used by some services to make sure that the network configurations not to be changed and like from my experience I know this is used in data practice and this is used in the address equal manage innocence as well and it's used by the service itself to make sure that no one can change them so there is no changes here when you delete the the workspace this intent pulse you'll be deleted then you can delete the MSG for example or remove any rules from this energy if you would like that's the first demo and the second demo I'm going to show you when we are creating a cluster where to see this cluster so I'm going to go back to my data breaks workspace and here the manage resource group you see in the manager so swoop in this case we have only a storage account if you refer back to the previous video where we are seeing creating data breaks work space without the v-net integration we used to see two more resources here which is the v-net the energy because we are using our own v-net then the creation process does not create these two resources however we don't see any other resources you will don't see any red clusters we will see that cluster once we get back to our data brakes and login to the Plex itself this is the data proxy interface and as you can see it's always the region dot as you later Prix dotnet that means it's not inside your V net you cannot control the access to this one other than the controlling the access by as your ad conditional access in this case I see customers they are trying to limit the access to the data Prix from inside their own their own offices only so no data scientists or data analysts can access data breaks from outside the office in this case this can be done by using Azure edy conditional access we can discuss this in a later video but for now we see the database interface the clusters and see the trusters here I'm gonna create a quick clusters just to demonstrate this and we'll see that what is created the network of these clusters and which IPS are connected to or which network calls are connected to which submit refreshing here here you go so create a plus sir give it a name the zero one I'm gonna accept everything minimum workers or two and I have a driver as well so that means we'll have three machines initially created one as a driver and two machines as workers can I create this cluster I pause the video for roughly I think three four minutes until the cluster is created now I'm sure that everything is created and ready for me I will get back to the arrow portal this is the DataFlex blade and I will go to the manage resource group and the manage resource group now we see more resources let's group them by type and let's look at them carefully here so the first part will be disks so we have these discs attach the machines we have network interface or come to them in a moment public IPS I have three public IPS so it seems there is a public IP per machine so every single machine has its own public IP storage account is the search account that was created from before and we have three machines as we expected two workers and one driver although all the machines names are like as you can see it's around only generated like digits and characters and data bricks has the metadata for these so the labels can communicate with them for the network interfaces you will see like this is the name of the machine and then private neck and then public neck you'll see all the private necks are attach it to the private subnets that go to my virtual network here I will see these are all the IPS attached to this virtual network all the devices all of them are network interfaces because these are typical virtual machines and you will see we have three attached to the private subnet and three attached to the private subnet the three for the private subnets are mainly used for the interconnectivity traffic between the cluster itself and the public interfaces will have a private IP and also a public IP so when you go to one of these this is a network interface and I will go to the IP configurations I will see here there is a private IP and there is also a public IP and this public IP will be used for the communication between the control plane and the the data plane or your cluster now this is important because if you are doing routing user-defined routing for any any traffic going outside your V net to go through a firewall the traffic will come through this public IP and then when it goes back out it will go back out through the firewall that means a symmetric routing and thus this wolf will make the cluster fail so if you are doing routing which we would see in the coming demo we need to make sure that we are doing exception for that traffic coming from or to the the control plane and this demo I'm gonna show how to use arm templates to deploy an azure data breaks the workspace with v-net integration and we will see that this network already or this subnet already has network adapters attached to it I will see the changes that will happen while deploying this first let's examine our template the template does not have anything other than one resource which is the world space itself when I go to this resource I can see here the type of the resource is a workspace and I see in the parameters I'm passing the virtual network ID the subnet name and for the public subnet and private subnet for both that's it like I'm not applying any NHD I'm playing anything else the delegation is not happening as well here so I have to prepare my network before I start the deployment I'm gonna go here and show you the parameter file in the parameters telnet or the workspace would be called workspace with v-net to the manager source group is deep pricks with v-net - and these are the two names for the subnets now let's see our V net so this is the V net and we are deploying to two subnets the front-end or the public would be called deep rifts to front and the private would be called deep brakes to back as a preparation I already clicked on the properties of these subnets and I created the delegation to the azure data breaks workspace this is mandatory I have to do this before here the same thing on the other one and for the energy this is as well need it because my template does not have an SG so if my template template does not have any G and I try to run it without having any jig already attached it to my subnet my template click my template deployment will fail so I prepared this and for the NSG here let me show you this one it's empty I didn't add anything here we go these are the default as you can see the priority is 65 thousands and up these are the default rules I didn't add anything extra other than this I want to show you as well in the overview of this virtual network I have already a machine that has a network called attach it to the brakes to back-end that means this subnet has already machines attached to it however this will not prevent my deployment from succeeding get back to my arm template then I go here and I prepared already the deployment PowerShell and I'm executing right now [Music] let's go back to my portal and I will see in the deployment you should see a new deployment no I'm sorry there's one parameter missing which is the custom virtual network ID I will get this one from here [Music] this is my source ID here we go the deployment started get back to my source group I should be able to see my deployment running right now and the deployment succeeded see here this is my workspace created and let's go back to my virtual network now let's see in the subnet these are the two subnets everything as no changes here the the an overview I don't see anything new as I told you there is no nothing deployed inside my V net until I create my first cluster as we saw in the previous demo and the one that I'm gonna see here I want to show you is the changes that happened in the security group so I'm gonna take this one for example and here we go these are the changes the priority 100 101 and 102 T's or are all created by the data breaks deployment now this is the end of this demo and the next time we'll be using as your firewall and how can we do user different routing with the softness that has the data breaks deployment all right so this is the last demo and desdemona show how to use data breaks with a V net also with using Azure firewall so I deployed this data breaks workspace and I'm using the virtual network playground v-net and I have the private subnet name is d breaks BK let's go to the subnets I have the the private one is the deeper xbk and the public one is d breaks front in both of them let's go to the private one first I have a routing table the first routing table is called private RT and in a private RT this is the the the prowl private routing table or the routing table that they called it private the routing table I'm routing everything that was outside my v-net to my next hop watch as this IP this IP is the IP for my firewall so I deployed here as a firewall and this private IP for my firewall so every time my cluster is initiating any traffic outside the v-net it has to go through this firewall first and the firewall will allow or deny the traffic okay the second one is the public which is this one and the public one as we saw from the previous demo we have the the network card that has the public IPS that means we have traffic coming from the control plane and if we routed everything going out to the firewall that means the trafficker come from one way which is the public IP s and then will go out from another way which is the firewall and the firewall because it's the aedra firewall is a staged stateful firewall in this case it will drop them so we cannot use the same routing table I create another routing table and the second routing table here let me show you I'm routing the same route which is everything going outside will have to go to the private IP for the firewall however before it I created to two other rules routing rules and one is for the control plane and one for the web app where do they get these IP is that's already documented in the azure data Prix documentation you will see the article user-defined routing settings and I have this in my in my article as well a link to it in this one we have the control plane netting IP and the web app and in these two we have IP is defined for each region that we have data breaks available m in my case I'm using Canada Central so these are the IPS that are music and that's what I'm using here in there details so anything coming to or going from my my my cluster or my my virtual network on my subnet specifically to this eyepiece it will go to the Internet meaning it will not be routed to the firewall but it will go directly to the destination which is these two destinations this is very important otherwise your cluster will fail if you created it okay so that's the first part everything from this public subnet will be routed to the the firewall except these two IPs now the second part let's go to the firewall itself this is my firewall and these are my rules netting I'm not gonna use nothing because data breaks clusters does not work with the firewall netting this is destination netting in the Hydra firewall so I'm not gonna use this one however I will use Network rules and I will use application rules basically I followed the documentation the user-defined rules here so when we have for example the meta store IP this is the meta store that has the metadata for our here for example the meta store for Canada Central it's stored in this one it's a my sequel database and this is the link for for this one in this case I need to have these rules allowed these rules allowed here in my firewall actually I'm using application rules if I click on the my rules here I will have the meta store and I'm allowing the traffic for my sequel the rest are well documented so the artifact blob storage primary artifact of storage secondary log blob storage event hub endpoint and all these are well documented however in my testing I found that I have some some URLs or domains that are requested by alpha by Azure data breaks does not fully documented on documentation that's what I mentioned in my article as well let me show you and my as your firewall might ring so I sent all the diagnostic logs for hydro fire wound into my local ethics workspace this is why I look analytics and I see for the last four hours for example I had some issues when I was testing one of the quick starts and I found that this quick start for example that the part of the quickest arson in data breaks it uses data samples sample data in a folder or a mount point called data breaks data sets when you query data from this one you are actually getting the data copied from Amazon AWS s3 pocket into your cluster so I needed to allow traffic going to st STS Amazon AWS and these two specific pockets I needed to allow them otherwise I had denied now I don't have denied anymore for the last me check the last four hours my deny traffic is going down as you can see because I allowed this if I go back to the last 48 hours and all my testing I had some issues before with API to snap craft I had some issues with CloudFlare all these are added so my complete list here I have the net packages snap crap that I owe I have requests going to terracotta I have requests going to CloudFlare I want to updates as well when we are creating the cluster I found that the cluster is requesting packages from from the open to update so I've made sure that I'm adding the open to update because the base image for data breaks clusters are open to based the samples I added the two fully qualified names for this and I'm adding all these in my documentation I will make sure to update my get rapport with an arm template that includes these as well so with this I could manage now to have my QuickStart completed and I could get the data and loaded it into my data frames and so on regarding the network rules I only added two things one I found it again it wasn't documented in the documentation regarding the the network time protocol I found lots of traffic was denied on the port one two three so I added this port using the UDP protocol this is used basically for the cluster to adjust its own timing for the operating systems and I also added service tags for different agile services for example in the future if you are using event table if you are using event table as a source for your data breaks cluster and you're reading data from it then you need to add the traffic allowed for the event hub I added sequel and also here some specific like region specific services the event hop inside Canada central the event table inside Canada East you can be specific for the service tags regarding the agile firewall address I will understand the service tags if you are using a different firewall then you have to define the public eye piece for at four different agile services yourself here I'm using the same thing for sequel so any sequel instance inside the canal Center region or kinda East region I'm allowing the traffic to go to them since the traffic will be mainly originated from the cluster to the cluster the Thrilla sequel then in this case I just need to be in the network rules and there is no need for any bound traffic coming from the sequel server to data Prix if you have a case where you're using the the originator of your traffic a sequel server and you are somehow pulling data from your data Prix then in this case the traffic has to go through the public IPs and the return of traffic shouldn't go to the firewall so in this case you need to add the public eye piece for the azure sequel like service for the in the exception for your routing table with that I'm concluding with my video here I will I will follow up with that with this for for this with the article I'm gonna write and I will add the article link in the video description to list everything and I will make sure to add all the routing exceptions and the tests that I found also with the services and the firewall rules into this thank you
Info
Channel: Mohamed Sharaf
Views: 3,552
Rating: undefined out of 5
Keywords: Azure, Databricks, Azure Databricks, Azure Firewall, Virtual Network, VNET
Id: U7Iw6g1_Rfg
Channel Id: undefined
Length: 35min 49sec (2149 seconds)
Published: Sat Dec 28 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.