Azure Data Factory Self-hosted Integration Runtime Tutorial | Connect to private on-premises network

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys this Adam and today I want to talk about how can you connect your data factory pipelines to on-premise data solutions or those hidden within virtual private networks using something called self host that integration runtime stay tuned so let's jump into self hosted integration runtime if you remember my diagrams integration runtime is something that runs your pipelines with a data factory so it is your compute infrastructure that is used for data factory to run data flows data movement activity dispatching so calling external services and if you using SSIS this is also the place where SSS packages are being executed what in this case is the self hosted integration runtime well let me explain by showing you an example if you have data factory and you have integration runtime the one provided by cloud so whenever you execute any pipeline in data factory either automatically provides you with integration runtime this is auto result integration runtime what it can do it can connect to any kind of service with an azure itself without any issues and if your resources are protected with virtual network in the firewall setting it can still connect to those if you are using managed identity to authenticate to those resources and you have enabled access for products and services if you don't then your data factory will not be able to connect to those resources either using the integration runtime that it gives you and if you want to connect using the same integration runtime to on-premise environments the case will be the same because this is a public service it cannot connect to your on-premises Private Networks so to solve both of those scenarios you can do something else you can actually install your own integration runtime on a virtual machine or on premises therefore a self hosted integration runtime so you host your own integration runtime wherever you want in the first case you install it on average machine within the same virtual network as those resources have configured in a firewall setting allowing you to connect from the data factory to firewall protected services and for the on-premise environment you can do pretty much the same so install your own integration runtime within your on-premise environment to connect to your on-premise data sources if you have VPN setup then that integration on time can also be in a cloud and still connect there as long as that virtual networks are configured in a way to do so and the very last scenario for the self hosted integration runtime would be installing your own drivers to connect to different sources that are not normally supported like s IP so to summarize you are using self hosted in runtime whenever you're connecting to private virtual networks private networks of on-premises environment and bring your own driver scenarios one of the key differences for self hosted runtimes is that it can move data between private network sources and it also can dispatch activities so call external services for the private networks but it cannot be used to run data flows so if you're using mapping data flows that's definitely not a scenario for you and it cannot run SSIS packages at this moment those are the key differences that you should take into consideration whenever you're deciding to host your own integration runtime so time for a demo today demo is quite bigger but it's end-to-end scenario for integrating on-premise environments and what I did I created a script that will create a two virtual machines one will be our SQL VM which will be imitating our on premise data source and one more virtual machine within the same network where we were gonna install our integration runtime by default the script creates SQL Server on that virtual machine and gives it a private IP only so you cannot connect to this SQL VM outside of this network next we're gonna install the integration runtime ourselves on this virtual machine and we're gonna make sure that data factory can connect through this integration runtime to the SQL Server through the private IP pull the data in and put it in a storage account if that said we can go into the demo but before we go in and start creating stuff in a portal I want to show you what scripts that I prepared for you first of all there's couple of scripts there's an environment create it's a CLI script that will create all those resources that I was talking about this is the first thing that you need to copy and go to the cloud shell in Azure pull it up and once the terminal connects to simply paste in your script after about seven to ten minutes the script should complete at which point you should get information about two virtual machines there private public IPS username and a password for each one of them so let's establish connection to the VM er which is for integration runtime let's copy its public IP and let's go to remote desktop connection manager to create new remote desktop connectivity so paste in the IP let's call it VM ER just to be sure that this is the VM we are connecting to you let's copy the username into the logon credentials into the username remove the domain and copy the password that was generated and paste in the password hit add and simply connect to the server accept the certificate and let it connect once everything opens up you can close all the windows and you're ready to start working it so the first order of the business from the VN or the servers that we are gonna be installing the integration runtime the first and foremost the important stuff is to test connectivity to the target server that you will be connecting to since in my case I will be connecting to the VM SQL over this private IP here I need to test connectivity there and the way I like to do it is using PowerShell so inside of partial simply type test - net connection pasting the IP and type the port which is for SQL Server is one for free free press ENTER and check what are the results as you see we were able to get the successful response of true in which case we can continue working that since this server is able to connect to our server and the second thing is optional I do it when I'm working with SQL servers so I prepared a script for you which you can use to install either SQL Server management studio or you can use that as your data studio so that we can connect to our database and once the installation is completed you will be able to find our data studio on this machine and we're gonna use that to connect to that database so what we need right now is the server in which case we're copying again the private IP of our SQL pasting in here selecting SQL login this is what is currently set up for the demo selecting the user pasting it in selecting the password and pasting it in and hitting connect this will connect because that database is empty by default once you successfully connect and open the server you will see there is no database there at this point but what is good is that at this point you're pretty much also testing the connectivity that it's working just fine so let's get new query and initialize the database with some test data that we can use for the demo so go back to the scripts that I provided under the script called initialize database simply copy/paste it put it here and hit run once the script completes you can refresh the databases find your new new demo database with a table called cars from which you can select the data and see top thousand rows in our sample data this is the end of the setup for the demo right now we already had a clean server you have the clean SQL server the connectivity works and you are ready to start setting up the integration runtime at this point you can simply close all those windows and prepare for installation of data factory so let's go back to the portal and let's close this because at this moment we can actually hide it or close it go back to research groups open the research group that was created which is called our raw data factory self hosted IR and you can review what is currently inside as the see currently side you will have two virtual machines one being SQL Server virtual machine and a storage account where we're gonna put our data after extraction for the demo purposes on the storage account I'm gonna quickly create a single container called demo this is where I will output the data later on but for now let's just create it and leave it at this point we don't have a data Factory so very quickly let's create that let's hit create and call it ir demo let's select the resource group and select north europe because this is the region that I created all those resources in and let's hit create once the resource is created after a couple of seconds open outer and monitor for the data factory editor to come up let's change the zoom a little bit and go to the outer tub and in here inside of the connections you will find a section called integration runtimes in here by default you always have Aalto resolve integration runtime which is the one provided by Asha but you can also create new and create your own integration runtime so select the first option hit continue you don't want either one you actually want self-hosted hit continue give it a name i'm gonna call it on pram hit create and at this point you pretty much have it in the portal but you also need to install it on the machine itself to do so you need to install integration runtime and to do that you need two things as you see there are two steps you need to download and install integration runtime when you click on it a new website will be opened for the microsoft download just copy that URL into the virtual machine and on the virtual machine you need to paste that URL in so let's use recommended settings here on the internet explorer and paste in the URL and simply press Enter you don't actually have to add websites to trusted simply hit close this opens known JavaScript enabled website where you simply need to hit download it will still prompt you to add it hit close and if you scroll down you will see the actual installation link here so always block grab the latest version hit download add the Donald to trust it in order to be able to actually initialize the download hit the download link again and hit save once the download has been finished simply hit run at this point you can also close the browser if you want hit next accept the terms hit next hit next and install once it done it finish it will open the integration run time window and it says you need to register so what we did right now is we already registered the integration run time in data factory we only installed the service locally so we need to register it so they connect together and to do that you need to paste in the key you can actually grab the key in the portal you can either use key 1 or key 2 depending whatever you want copy this from the portal paste it here hit register once it registers hit finish and after 10 15 seconds you should get a response that your self hosted integration runtime node has been successfully registered in data factory at this point you can actually hit launch configuration manager although you don't have to when you open this you will see information about your self host an integration runtime being connected sometimes you get this issue here about runtime upgrade it is not always required so definitely check it here at this point you're ready to go as you see it disappeared after a couple of seconds you can go back to the portal where after closing this window you should be able to see your new on-prem self hosted integration runtime when you go into it notice couple of stuff first of all you're gonna have a node tab this is the server that you can be registered and you have currently running the service on you can have multiple service within the single integration runtime definition you can also select auto update which by default is selected remember about that this will automatically update the integration run time from time to time so you can actually pick the time at which point it will be updated and lastly sharing is a great option because you can actually share a single integration runtime installation across multiple data factories to save a lot of costs and management and that's it that's how you set up the integration runtime you can actually close this window and start using it so let's do it demo and a test where we're gonna use that SQL Server so let's type SQL for the link service let's find it on the list I saw it there on the bottom so it's SQL Server that we're gonna be connecting to this will be our own from SQL what we need to write now I'll write underscores now we need to change the integration run time and as you see this is the important part we're not using auto resolve integration run times of the one provided by user we're using our own two to the connection we need to provide the server name and as you remember for the server name I need an IP so let's go back to the portal and open the cloud shell if you by any chance close the cloud shell you're not gonna get those credentials that I showed you but I prepared a script that actually saved those credentials inside of a file so this is a perfect example to show you that in case you closed it yourself the script is actually saving the output within the cloud shown so I can actually type Alice Cloud Drive you're gonna find other data factory self hosted integration one time JSON and you simply need to check the content of that file to get the output from the previous script so if you're going to lose it this is where you can find the credentials for your virtual machines so simply copy the private eye piece go to the data factory paste in in the server name you also need a username so paste in here remember to give a database name in our case that was demo database we have a username and we need to provide the password so let's go and grab the password to our SQL VM paste it here and now is the critical part if everything went fine by hitting test connection you should see successful message that first of all connectivity works your credentials are working as well hit create and at this point you have a new link service connected to your on-premise feel through your integration runtime let's add one more link service to our storage account so we can actually output the files I'm gonna call it blob I actually can use my outer resolve integration runtime and this case select from module subscription the storage account that we created previously as part of the script again test connection here as well now we have a connectivity beef between our blob storage and the SQL service on premise so what we can do is create a very small pipeline in which we're going to copy a data and we're gonna copy a data from where from a new data source from our on-premise SQL so let's type SQL use SQL server use the link service from our on-prem call it cars table write again underscores cars table choose a table name dbo dot cars and here if it actually says dbo dot cars that already connected to the right database and it sees the table inside so you can select it hit OK and do the sync mapping in which case add a new data set this is the data set for the blob storage let's save it as a CSV file select the link service on a blob storage and let's call call it cars CSV and the Browse put it in the demo container we want first role as header so let's not lose an information and hit cook ok and you can either publish here or debug to get the same result after a couple of seconds the pipeline finished in this case that was 12 seconds so you can go back to the portal close this line go back to your resource group inside of the research group go to the storage account go to the container open demo container and here you see dbo cars we forgot to change the extension but it doesn't matter and it n sec this is the extract from your on-premises server using your integration runtime that you created yourself and let's finish by talking about couple of important considerations when using integration runtime first of all sharing is important so you can actually share your integration runtime across multiple data sources and multiple data factories to save a lot of costs additionally you should always try to keep it close to data source but I wouldn't recommend it to install it on the same machine just for the maintenance purposes and if you have power bi gateway installed don't install integration runtime on the same machine and lastly currently integration runtimes Windows servers only service so you cannot install it on the Linux machine and if that machine has F IPS compliant encryption it might cause some jobs to fail so either disable that or pick another machine to install that and those are the most important considerations when installing data factory integration runtime with self hosted integration runtime nothing is keeping you right now from integrating on on-premise data sources to the cloud and moving data one way or another but remember to do some sort of capacity planning because you need to understand how many of those virtual machines hosting those integration runtimes you need and what are the sizes required for those to be running effectively but for today that's it hit that thumbs up leave a comment and subscribe if you want to see more and see you next time [Music]
Info
Channel: Adam Marczak - Azure for Everyone
Views: 97,004
Rating: undefined out of 5
Keywords: Azure, Data Factory, IR, Integration Runtime, Self-hosted, Self-hosted Integration Runtime, On-premises, Onpremise, VPN, Virtual, Private, Network
Id: weiHOeje-QA
Channel Id: undefined
Length: 20min 12sec (1212 seconds)
Published: Thu Jan 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.