Azure Data Factory | Azure Data Factory Tutorial For Beginners | Introduction to Azure Data Factory

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone hope you all are having an amazing day today we will be learning about azure data factory this session is conducted by a cloud expert having more than 15 plus years of industry experience but before we begin the session make sure to subscribe to our channel and also hit on the bell icon so that you will never miss an update from us data generated by several applications of products is increasing exponentially day by day as the data is coming from many sources it is very difficult to manage it using azure data factory one can analyze transform publish and visualize the data professionals with azure skills are in huge demand so what are you waiting for buckle up and get started here is the agenda of the day firstly we will begin with introduction to azure post that we will tell you what is azure data factory after that we will be learning about flow process of data factory then we'll move on to integration runtime in azure data factory post that we will see how to create a new data factory then after that we will introduce you to data lakes in that we will tell you what is data lake and why do we need it then finally we have covered a hands-on demo on creating pipelines using data factory to give you a better understanding of the subject so guys this is the agenda now let's begin azure data factory what is azure data factory data generated by several applications and product is increasing exponentially so we all know that we are living in the world of data so all the different applications which are running they are generating millions and millions of data azure data factory or we lovingly call it as adf falls under the identity domain of services in the seo which is nothing but search engine optimization catalog and it is cloud-based integration services basically it works on data it orchestrates and automates the movement or transformation of data so adf or azure data factory is a code free etl or elt solution which is completely based off cloud you don't need any hardware you don't need any licenses you don't need any kind of setup you quickly can go in without using a single line of code you can perform your data movement as well as orchestration as data is coming from number of different products to analyze and store all this data we need a powerful tool azure data factory helps us here how does user data factory help us storing data with the help of azure data lake store so data lake store is nothing but a hadoop compliant storage now we have the azure data lake store which is microsoft implementation of data lake store we can analyze the data we can transform the data with the help of pipelines a pipeline is a component in data factory which groups the related activities to perform a task we'll see that in detail publishing the organized data visualize the data with third-party application like spark or hadoop so data factory has support for spark baked in inside as out of box solution now how does a data factory pipeline flow so you can see that this is an architecture diagram of a data factory pipeline so first and foremost we have the input data set what is an input data set it's nothing but the data which we have in our data store we need this to be processed so we pass the data through the pipeline so input data set is nothing but the source data that we capture which is coming from the source then we have the pipeline so this is the building block of data factory it basically performs an operation on the data that transforms the data which would be anything from just data movement or some kind of transformation so it can happen that you get your data from a source and you want to push it to a destination without doing any kind of transformation so let's take an example that you get your data from an ftp location and from that ftp you want to push it to a sql database without doing any kind of transformation you can do that on top of that if you want to do any kind of transformation you can do transformation right from your data factory using power of apache spark and then push the data to your destination from source then we have the output data set this contains the data that is in a structured format as it has already been transformed in the pipeline storage it is then given to link services such as data lake block storage or sql so output data set i won't say that every time it is structured you can work with semi-structured output data also so you can work with json xml and other semi-structured data set and then you can push this for your consumption now where do you want to push your data of course you can push it in multiple locations you can store it in a data electrode you can store it in a blob storage or you can store it in a sql now to establish the connection with this syncs we use something that we call as link services then what is an info what is a link service link services store the information which is very important when it comes to connecting to external data sources so link services are nothing but the username password server name database name so the connectivity informations are stored in your linked services so there can be different types of link services you can store it in a data lake store you can sort in a blob store or you can store in a sequel it's not just limited to these three there are way more linked services that we can access in azure data factory then we have the gateway so this is an important concept what is a gateway uh this is called as gateway or we call it as a data gateway or we call as a integration runtime this connects to our on-premises data to the cloud so we do not need a client installed on our on-premises system so that we can connect to azure cloud now there can be scenarios where you're getting your data from an on-premises source or you want to push the data to an on-premises destination in either way since azure data factory is a cloud-based approach you have to set up a connectivity to the on-premises source or the destination so what we have to do is we have to provide a networking handshake using the data gateway or the integration runtime we can establish the connectivity between on-premises as well as cloud finally we have the cloud our data can be analyzed and visualized with a number of different analytics software such as spark r hadoop and so on so this is the end way where we have uh analyzed our data transformed our data and finally we want to consume it that's also possible using data factory so why should we use azure data factory so there are just couple of things which are listed here it can be more than that way more than that improves productivity with shorter time to market data factory is a tool and it doesn't have any kind of language in it so what we can do is the learning curve is very easy in here you don't have to do much of learning so you can very quickly grasp the concepts of data factory so the time to market reduces drastically it results reduces overhead costs because it's a one-stop shop where you can perform your data movement as well as your transformation so you don't need lot of things to get involved a lot of products to purchase so the cost is very less and cost is actually very less uh if we see the pricing of azure data factory it's it's very very cheap transfers data using pre-built connectors so a lot of connectors if i'm not wrong currently we have support for around 70 connectors which is out of the box from data factory so you don't need to create any custom connector of course you can create but most of the connectors that we use are pre-built in data factory integrates data cost effectively because the pipelines and the process is such user friendly we can very easily save cost gets continuous integration and delivery which is ci cd so we can easily connect with a git repository and set up a cicd pipeline with adf it's very easy and works the way we want so we can do our own custom logic in our data factory so that we get to know how we want to do this it improves productivity with shorter time to market it develops simple and comprehensive etl and elt process without coding or maintenance so as i said adf is not just an etl whether we have extract transform and load on top of that we can perform elt as well where it's extract then load and then transform so it doesn't require any coding or any kind of maintenance it's completely code free it ingests moves prepares transforms and processes our data in a few clicks and completes our data modeling within the accessible visual environment so data factory comes with its own canvas where you can quickly design your transformation logic and your overall program flow very easily the managed apache spark services takes care of core generation and maintenance so as i said data factory has support for native apache spark out of the box and the good thing is that this apache spark is a managed environment so you don't have to bother about the infrastructure you don't have to bother about the setup the configuration the maintenance of the spark environment it's completely managed by microsoft you can create your own programming logic without writing code please remember we don't need to write any code and you can use the power of spark right from data factory reduces overhead cost when migrating our sql server db to cloud it preserves our etl processes and reduces operational cost and complexity so it's it's kind of similar like we had in ssis we if we want to migrate on premises data to assure data factory is the de facto choice currently it does everything with breeze but it's not just limited to moving on-premises data to cloud there are more things that we can do in data factory it re-hosts on-premises ssis packages in the cloud with minimal effort using azure ssis integration runtime etl in azure data factory provides us with familiar ssis tools so people who are coming from ssis background they'll be happy to know that azure data factory is very easy to learn if you're coming from ssis the learning curve is very very simple you don't need to spend much time in relearning a lot of things most of the things will be very familiar if you're coming from ssis background and also if you're still not convinced that you want to refactor your ssis packages into adf pipelines you still have support to run your ssis packages from adf so you can have your existing ssis dtsx packages which can be run from adf transfer data data using preview connectors it gives access to the ever expanding portfolio of 90 plus preview connectors including azure data services on-premises data sources amazon s3 and redshift and google bigquery at no additional cost so as i said i said more than 70 but it's actually more than 90 and it's increasing every day so once we move and see a hands on you can see that there are so many pre-built connectors which are available right out of the box from data factory so you can use amazon s3 bucket you can use google bigquery apache impala azure store lot of things are available pre-built out of the box in data factory integrated data cost effectively it integrates our data using a serverless tool with no infrastructure to manage as i said this is a quote free cloud-based etl solution which is a complete pass service and managed by azure for your data factory we need to pay only for that what we use and we can scale up the elastic capabilities as our data grows again since it's cloud-based and it's backed up by azure it has a pay-as-you-go model so whenever you're using data till factory point that you're using the quantity that you're using you only pay for that and whenever you need to grow scale up and scale down it transforms data with speed and scalability using the apache spark engine in azure databricks so as i said this also has support for spark infrastructure with you without bothering about the spark setup and anything so you can run big data also with virtual data factory works that way we want data factory provides a single hybrid data integration service for all skill levels we can use the visual interface to write our own code in python.net or arm to build pipelines we can add any processing services into the managed data pipelines or insert custom code as a processing step in any pipeline so if you still want to use code you have support to run your code in either python or dotnet or r which is azure resource manager and still if you are not convinced you still have support to create custom components so you can write your custom code and still use your pipelines gets continuous integration and delivery connectivity to ci cd it continuously monitors and manages pipeline performance alongside applications from a single console with azure monitor so again as this is part of the azure suit so you can easily integrate this with azure monitor and you can visualize what is going on inside your data factory pipeline it integrates our devops process using the built-in support for pipeline monitoring so this can very easily be integrated with azure devops so you just have to connect it to azure git or a normal git repository and you can create your very simple and quick ci cd pipelines for a continuous integration and continuous deployment if we prefer a less programmatic approach we can use the built-in visual monitoring tools and alerts so not not just it has support for as your monitor azure data factory itself comes with an inbuilt monitoring so it is native to your data factory if you are not if you do not want to use azure monitor you can use the inbuilt monitoring tool which is available inside adf okay now we're going to see what is an integration runtime in azure data factory and what are types of ir that we have so what is an integration runtime in azure data factory so integration runtime or ir is the compute infrastructure used by data factory to provide the following data integration capabilities across different network environments so ir is essentially the backbone where the infrastructure is managed by microsoft so using an ir you can perform data flow you can do data movement you can do activity dispatch or you can run ssis packages so let's see what all things we can do so in azure data factory and activity defines the action to be performed so these are the terminologies that we use inside data factory if you hear the term activity this means the action that you want to perform in your data factory pipeline a linked service defines a target data store or a compute service meaning that this is the place where you define your connectivity and all the other related information and integration runtime or ir acts as a bridge between an activity and link service so this is the managed infrastructure now in azure data factory we have three types of integration runtime the simple one is the azure auto resolve ir second one is self hosted and the third one is azure ssis ir now based on your requirement you can choose any of the ir okay so this is a pictorial representation which shows the different types of integration runtimes which can be used in combination to offer our rich data integration capabilities and network support so typically what we do is when we have our source and destination which are both cloud-based then we go for an auto resolve integration runtime or an azure integration runtime if you have something which requires an on-premises component then we go for a self-hosted integration runtime and azure ssis runtime is exclusive to run your ssis packages now you can use a self-hosted integration runtime to connect to cloud-based source and sync as well but it's typically not used so if your source and destination are both cloud-based then we go for an azure ir if it involves any kind of on-premises component then we go for self-hosted ir okay now why we have this inside an azure virtual network the reason is that if you have something which is on premises then we always should create an azure virtual network which will have all the network connectivity set up between your on-premises system and cloud and using that network we will connect via the integration runtime and as you can see using the azure integration runtime we can run apache spark program as well so let's talk a little detail about the ir locations data factory location is where the metadata of the data factory is stored and where the triggering of the pipeline is initiated so what it means is that since azure is a cloud-based solution when we create any resource you have to provide a location of that resource data factory is also a resource in azure so when you create a data factory we have to provide a location of that data factory now whenever we create a data factory in a specific location let's assume that i am creating in india since i'm based out of india so let's say that i'm creating in a central india location so whenever i will trigger my data factory the metadata of the data factory will always be stored in central india and the pipeline will start from that location it's nothing but the physical data center where microsoft creates your resource meanwhile a data factory can access data stored and compute services in other azure regions to move data between data stores or process data using compute services now don't think that if i create my data factory in central india i won't be able to access data which is stored in any other location that's absolutely not correct if you create a data factory which is based out of central india if your data is in japan you can still access the data using data factory this behavior is realized through the globally available ir to ensure data compliance efficiency and reduced network egress costs now how does an adf which is created in central india can access the data which is stored in japan that is the job of the globally available integration runtime so this integration runtime takes care of connecting to your data and moving the data across the region as well as maintaining all the gdpr policies along with efficiency and reduced cost the ir location defines the location of its back end compute and essentially the location wherein data movement activity dispatching and ssi stack the executions are performed meaning that whenever we create an integration runtime by default the integration runtime gets created in the same region as your data factory but of course you can change that from the data factory pipeline as well it's not a requirement that the region where your data factory is your integration runtime should also be in the same region you can change it of course but what happens is based on the region of your integration runtime all your resources your data movement and everything will initiate from that location okay we'll be discussing about azure data lake okay so azure data lake has two components uh within it has storage account as well as adls gen 2. so adls general is a jazzy word in the market as well and then it's really a resource or a service offering which is given to you by microsoft azure so what is the purpose of the storage account so when when i talk when i say data lake right just consider data lake to be either storage account or adls gen 2 before talking about the definition i will tell you what is that resource so storage account is can also be considered as a lake account so storage account you see that this this is itself the two flavors within so you have two flavors within the storage account itself one is the general storage account and another is the adl is general so while creating i will show you both the flavors but just consider when i am talking about lake it can be either storage account or adls gen 2 okay so what is data lake account so data lake can uh help you to store any kind of data so it says that you can store any kind of blobs when i say blob red blob can be csv file clock can be a par k file orc mpeg file any any particular file which can come in our mind you can actually go ahead and save that data in data lake and the good thing about data lake is even if you don't have you did not provide any instruction about while while saving the data right let's say that data size is of 1 gb or let's say 10 gb size right it will internally split the data into multiple chunks and then save it in a distributed manner and when you extract the data again from the link it will be again a distributed read for you so that's why you see a better performance when reading as well as when writing so that's the beauty about data lake and on the data restriction data size restriction there is basically no restriction in the size of data so more the amount of data you say obviously you are paying more that you should always keep in mind but if you can if you save less if you store less you will you will basically pay less as well and the costing of storage account so just be mindful about this costing of storage account is just not related to storage cost it also deals with the read cost right cost as well so let's say that i have stored my data once okay let's say i have stored some 10 gb of data now i'm just not paying for that 10 gb of storage and let's say i have a lot of customers who keep on reading the data from my storage account so that the read is also attributed within the storage account cost and when i'm writing information to the storage account that that time it that's at that time also i'm incurring cost so there are three components to the storage account cost one is the storage cost one is read first and another is right first these three components okay so i will just i trade through the theory in just five minutes or so and then i will move on to the portal okay so just watch out here so it says the main objective of building a data lake is to operate an unrefined view of data to the data scientist within the within with the increase in data volume data quality and metadata the quality of analysis also increases data link offer business agility then machine learning and artificial intelligence can be used to make profitable projection prediction data link gives a 360 degree view of customers and makes analysis more robust okay these are the just the definitions of data library i'll just give you in simpler language now in today's world if you are using uh sql database right for for storing data as well as doing reports you can actually do the same from lake as well so instead instead of keeping the data in in stage format space you i i generally call it as raw format okay so in in sql world you will call something is staging in late world we call that as raw raw file so let's say customer is providing me some file i'll store as is into my storage account and i will call that as a draw file and i start processing the raw file and into there are multiple okay no problem so uh this is the lake architecture okay so you might have heard about lake house architecture at least you might have heard about this the turnout or lake house architecture i'll just explain about the lake architecture okay so when data comes in from external system right it's the upstream this is the upstream where i'm getting the data so i store in my lake in ra i call that as a raw zone okay and then i do some data quality checks and then i store it in gold lab i call this as gold because only when data quality succeeds i will move the data from broad to gold so what is the data quality check let's say i want to count the number of records before moving into the gold app if the count is not nearby compared to the last raw file rate then i will discard the i'll stop the processing and not move the raw data to gold data anyway way i'm actually providing more security checks before even moving to gold now i'll tell you more okay let's say i'm moving it to publish lab i'll tell you what is published okay publish is something which your downstreams can refer to okay so what i'm doing is let's say some people are doing dimension modeling in a sql database right so in publish there also i can keep all the files in different uh let's say csv files or parque files i'll save uh those dimension and then try tables within the publish lab okay and then that will be the source output for my downstream application so if if i'm maintaining the project there will be some upstream there will be some downstream so when i am receiving data from upstream my duty is to make sure that the data accurateness is maintained data integrity is maintained and once it is maintained i will move the data to publish layer which is our fact and dimension and then my downstream my downstream can be my reporting application also okay just giving you a simplified answer my downstream can be my reporting application or my downstream can be some other projects who are looking up to my data once data is done and published you can actually share it with others but you call raw staging you just consider staging as a schema okay here i can consider as a folder under the draw folder i can place multiple files so raw is relatable to freezing lab and then from staging you might move it to an intermediate layer okay where i will just do some data quality checks and then move the data to the intermediate in sql which i call here as gold and then i move the gold to publish let's say i want to perform some slow changing dimension one two etc i want to formulate some dimension and tag table so i do some kind of transformations all over here you also do the same thing and move it to the main table so i call this as a published layer here so this is basically the lake architecture if you have not heard about this this will be very common when you appear for any interview in uh data engineering there people will ask you what is the lake architecture you can explain the same fundamentals okay so uh if if i am managing a particular project right so i will have some upstream when i say upstream the the project which is giving me data is upstream okay and downstream is the the subscribers who are actually looking up to my data my projects data so i will receive data from upstream and i will provide data to my downstream okay let's forget about push and pull mechanism who is pulling who is pushing data will not talk about that i my focus is to just explain the data lake architecture so i'll just try to correlate it with sql once more so try to understand draw is nothing but staging staging we understand it the way upstream provides me the data i keep the same thing in the same format so if i may i may be receiving it in a csv file i may have a staging table where i store the data without doing any kind of transformation that is staging that part so in lake i call that as raw what is gold i think as a part of transformation you might be doing this in current world but you have to invest time more on data quality checks as well so when i move data from broad to gold i have to ensure that i am i am doing data quality checks so example of data quality check is let's say i'm expecting that a particular column should have only non-nullable column it's a non-nullable column it should not have any any null entry so i will do data quality check for that column and similarly there might be some 100 rules if all the 100 rules satisfies then only i will move the data from raw to gold that is the example of data quality so you can adopt this in any any technology if you are using sql also you can adopt this particular approach but in in when you are using lake architecture right this will be a common practice so people will definitely ask you what are the data quality checkpoints you're managing so once this is done it is the publish is as good as the main tables in your sql table so in sql you might have managed dimension and track table so i will just put all those i'll perform all the transformation from gold folder and then i'll publish all of them to the publisher in the form of dimension files and then track files and then my downstream right while looking up to my data can refer to the published lab my downstream may not look into the goal i will not provide anyone access to raw and gold i will just receive data do data messaging till publish and then we'll send it to the downstream so what what is happening i have explained about the entire floor right it is just a visual representation of what what i have just mentioned so you can do batch injection as well as real-time injection so in between you can store it within the storage account and you can do data management services which i mentioned data quality checks is one of the map data management service mdm is nothing but master data management where you can store some master files or you can i i call that as dimension file you can call the same thing as master data management for storing the uh the dimension files and and the same technology okay workflow management is nothing but we transform from draw to gold to publish that is the workflow management and what is the processing tier so i'll be using data lake sorry as your data bricks in order to process all these things right so my processing here is nothing but the gold tier as well as the published here where i have processed the data and moved to the gold or published and you can do in memory operation multi multiple processing parallel multi parallel processing so because i will be using the capabilities of storage accounting the reads and rates are already parallel and with using data bricks i or data flow i can do all the in memory based processing so what do you mean by in memory based processing adobe used to do all the disk based processing so it used to incur a lot of charges on i o as well so it used to save the intermediate result in hard disk and when the subsequent process needs the data again it will do uh take take the same data from disk and then take it into the memory so that's basically i o operation but when i am using data breaks or when i am using data flow all the intermediate result also will get saved within the memory itself memory is nothing but ram okay and and once everything is done distillation tier is nothing but again you can call that as a publish okay once once entire processing is done that you can save them in the hdfs storage which is nothing but a storage account okay and once you have done that your downstream can maybe using sql maybe using some map reduce operation so they can pull data or you might have some reporting requirement where you might do some insight based reporting okay so that is basically the lake architecture so the terminology which i have mentioned it is is up to date in the market you can you can hear about what is raw gold publish the same thing is mentioned over here now let's see uh how do we create pipelines using datapack all right so creating a pipeline basically is uh listed as one of the uh i mean steps in performing your etl solution guys now we have a school database let's say i mean we are trying to extract some data from the database we will be extracting and if something has to be processed then it will be processed right then it will be stored in somewhere i mean wherever we want now let's see the steps for uh i mean so basically guys as we are seeing in this on the screen so your we have your uh database i mean sql database and we are trying to extract some data from the database so we are just so we will be extracting and something has to be processed then it will be processed and then stored in the data lake so these are just some sample databases guys it can be anything it can be happened between uh your blob storage to sql or sql to blob storage or your blob to data lake or data active law it can be anything your data source can be anything there is no specific restriction that your source should be something or your destination should be something specific it can be anything all right so these are the steps for creating your pipeline using data factory the first thing we will be doing is we will be just creating your linked service for your sql server database or whatever i mean i have taken my in my case i have taken the example as sql to data lake so that's why i'm just showing the steps for creating the linked services i mean atl pro i mean just steps for creating the ctl process so we will be creating the link service for sql server database as the source database and then we will be creating linked service for the target database which is in our case it is azure data lake store and then we will be creating the data set for the data extraction uh this one we will be just doing an author and monitor tab guys which is which once you will be creating the data factory you will be uh i mean you will be you can just access that uh author and monitor tab there you can just go to the visual i mean you can just go to the author and monitor and there you will be having there's all these options for creating the linked services and creating the data set for your data extraction process and then you can just create a data set for your data saving and then we can just create a pipeline guys and then we can add some copy activity or any raw duty and then guys and finally so the main advantage of this data factory we can schedule this pipeline just by adding trigger uh so basically guys you can just schedule this your i mean let's say i mean whatever transformation you have you can just schedule that on some specific basis we can transform them on weekly basis we can transform that on monthly basis we can transform that on day to day basis you can just transform it whatever and however in whatever way you want you can just transform in that way all right so guys now let's let me just show you one project uh where we are going to perform today so guys uh in now today i'm just going to demonstrate your project where we will be identifying the videos in the selected channel from youtube that is getting maximum traffic so let me explain this problem statement to you guys to you guys so basically guys uh we are having so basically we are having a youtube channel so in the youtube channel there are various channels right so let's say there is a company so the company wants to promote its advertisements so in order to promote its advertisements so it wants to uh promote its advertisement on the channel which is getting the utmost traffic okay so we how does the channel knows that uh the channel specific channel is getting the at most traffic so guys for knowing that that i mean that company needs to put a developer who is an azure data factory specialist where he will be automating the list of i mean list of channels which is getting uh the more i mean which are which are getting the at most traffic at specific time so he can just put a trigger for every five minutes so for every five minutes the traffic is let's say the traffic is loading so every five minutes that that list gets changed and based on the list this company can promote its advertisements so this is the project we are going to do guys so in our project we are just considering the two data sets which will be where the data set one will be our first interval and let me just show you the data sets so these are the data sets which we are going to perform let me just open this better data sets for you guys so these are the two data sets guys so where we are having uh video id video title views likes uh dislikes comments and your duration minutes but in real time you will be getting in different format so for your understanding purpose we are just uh creating this project as a sample project so this can be these all things can be uh these are all the some of the basic data which we are getting from the channels so now let's without wasting any time then let's import this to our sql database so for creating this let me just move to my lg portal all right guys uh now first of all let me create a sql server so what i'm going to do here is guys so i'll be just creating my so first of all we'll be just uh creating an sql server and then we will be just creating some other different data stored which will be let's let's create data lake and then we will be just creating pipelines uh just to load i mean make some kind of uh things i mean we can just transform the data and then we will be just uh we will be just building that i mean to copy that all the data and then we will be just performing a basic basic uh query we will be just performing on that where we will be just uh let me show you so first of all let me create a sql server guys so guys uh if you have a resource group it's fine or else you can just create a new resource group first of all let me create a resource group as workshop so i'll just give my server name as youtube so i'll just give my admin details which will be intellipad and then i'll just give it as follows let's create this server so the deployment is taking some time guys uh let's wait for the process so meanwhile let me just open my sql server management studio yeah it's already open so guys you can also connect to this sql server instance using your azure inbuilt sql i mean that also you can use or you can use the sql server management studio just to connect your sql server database or anything else all right yes the deployment is completed now let's just go to the resource so basically your azure sql server microsoft uh i mean now let's go ahead and just create the database for this so it's basically your azure sql database is a managed cloud database which is provided as a part of microsoft azure so let me just create the database here so my database name will be let's say sql tv alright so i just don't want it to be an elastic pull uh you can also create an elastic pole or a managed instance guys so basically there are different concepts so basically when we talk about elastic pull uh so your elastic pool is like something nothing but a collection of your databases with uh some shared set of resources uh i mean which are which are managed using your sql server database server guys so basically what is the benefit of using an elastic pool and azure is that so using with this you can just create a single database which can be moved in and out of an elastic pool which can use which gives us flexibility actually so but we don't want that we just we can just create a normal sql database for this purpose because uh cost also varies for every database whatever we are creating so now i'll be just creating my sql database so you can just see the estimated cost per month whatever is shown over here so let's create this thing so in azure sometimes the deployment takes much more time guys let's get it deployed meanwhile let me just open the sql server and connect it to the connected to my ssms which is on my local machine let's go to the sql servers there you can just find the sql server so youtube source is my server which i have created now so let me just check the service settings so this is my server name guys using this server name we can connect it anywhere wherever we want so i just wanted to connect on my uh ssms portal so here i'll be just connecting to this guys uh so i'll just connect to my database engine so here i'll just paste my server name so guys here i'll be just doing it with sql server authentication so my login login credentials will be the one which you have created it over while creating the sql server so guys i have created my login name as intellipad right so that's what i'm just giving it over here so guys i need to sign in with my azure account for getting this client ip address so basically our client ipa address is not having the access to the server so we need to just uh sign in with our azure account and then we need we need to create a firewall rule further enabling it so let me log into my azure account yeah so let me add my client address so guys for importing or exporting uh an azure sql database with basically i mean without allowing i mean for without allowing azure services to access the server you can just create you can also create an azure virtual machine and then you can just connect to that and then you can install sql package and then you can just create a firewall rule uh to allow that vm access to the database and then you can export a database using sql package so but here what we are doing is that we are just importing i mean we have just created a server on azure and then we are just whatever data we are having we are just importing to our sql i mean to our azure sql database so let me so let me expand the database over here so this is my database which i have created on my azure sql server so guys let me open sql import and export services so with using this thing i can import my import my so guys my source data is whatever i'm giving is excel data right so let's put it in excel data so i'm using 2016 right now so let me put it first so let me give the source form i mean path over here all right guys now destination is nothing but my sql is your sql server right so i will just put it as my oledb driver for sql server or not this one we have to give it as so i'll be just giving my server and server name over here over here and then i'll be just giving my uh server i mean details of credentials so let me login over here now let's choose this yeah you can choose this over here so this is my database which i upgraded on my server right so now i'll be just going forward so what i'm doing over here is guys i'm just copying the data from uh all of the tables to the server so let me edit the mappings so i just want okay it's already if you want to change the data type or anything you can just change it over here guys so here my uh data is and this uh so video id should i want it in my integer type i don't want it on my floor because all my options might deserve so i'll just put them to end and then my views are also integer type reviews cannot be in decimals right so i just wanted to be an integer format and i mean i just want everything to be integer instead of flow so i'll just change this thing also you can alter these data types over using query also we can do it manually also in this way all right guys uh now let me just go on next okay i think everything is fine yeah good this should import successfully let's see yes guys we are good to go so our process is very successfully done over here so let me just uh worry this data so it's taking time okay meanwhile let's uh just create our data factory and also data i mean final so i'll for my target database i'll just create a data lake guys so how do we create this data like in this azure so basically for creating data lake or blob storage you will you won't be having a specific option so you need to go to storage accounts guys so let's go to the storage accounts uh all right so let me just add a storage account so it's basically your azure data lake when we talk about your azure data lake it includes the capabilities uh which are required to make it easy for developers or your data scientists or analyze analysts to store the rate of any size or shape so we are just creating the data lake over here so let me just open so i'm just choosing the pair as you go subscription now already we have created the resource group i'm just creating i mean i'm selecting the workshop as my resource group and then i'm just giving my name for my storage name which is data lake so let me just give it as a target database so okay it's already taken let's give let's give another name so guys basically we are using databa data lake version 2 so which is the latest version so we are putting a replication mode as read access geo redundant storage uh that's fine uh access tire we are choosing it as hot so let me just create all right guys if it work let's create this so you all might have a uh query in your mind what is the difference between the azure direct uh one and what is this gen one and gen two so guys basically uh here for azure data lake uh generation one is the file system storage in which your data is distributed in blocks in hierarchy file system guys so your azure delete generation 2 comes i mean contains both file system storage for performance and security guys and your object uh storage for scalability so azure data analytics is also support which is which is not available right now because it is expired right now so i mean now some of the new features are updated in this thing let me just go to this thing so let me choose this uh show up as my resource group let me i think there's some mistake which we have done which we have not checked uh so let me create it as target yeah fine so version two is this one and this is version one and this is blob storage uh we can just create these three services and data lake yeah all right guys now let me just uh check this database okay we can just check it over there now let me just create the data factory now so we are creating a job where we can just automate all this process to be happened it's basically your data factory uh provides your single hybrid data integration service guys for all skill levels so basically you can use the visual interface or i mean you can just write your own code i mean in any language to build the pipelines as well you can just build the pipelines manually as well guys you can just put your choice of processing services into your managed data pipelines or you can insert custom code as the processing step in your any any of the pipeline so let me just create the one data factory first so vdf project sorry so let me just choose my resource group uh let me choose it as workshop fine now i don't want to enable git so this is already choose no so i don't want just let me just do it as some different name just a quick info guys test your knowledge of azure by answering this question what are the different cloud deployment models a private cloud b public cloud c hybrid cloud d all of the above comment your answer in the comment section below subscribe to intellipack to know the right answer now let's continue with the session all right so guys uh if you i mean what is this enable get option guys so basically to provide a better authoring experience so basically azure data factory allows you to configure a git repository with uh either azure repository or your github so basically it is a version control system guys that allows you for easier change of tracking and collaboration but i don't want to enable it for this so let me just create this thing alright guys it got deployed let me go to the resource so as i have already said in the presentation when i am explaining about data factory so whenever you want to create any linked services or pipelines you will be just going to this author and monitor guys so in this tab we there is let me show you yes guys uh it's open so let me just show you so guys we can just go to this manage tab and then we can just create linked services guys so you can also directly create all the price process at one go but this is more i mean you are having this specific thing where you can just create link services where you can create integration runtimes you can just get you can just do some get configuration over here and then you can just define triggers and whatever you want so guys first of all let me just create uh the link services for this first of all i'll be creating a linked service for my source database which is my azure sql right so my server name is uh let me check it hey guys if you're looking for an end-to-end training in azure data factory intellipaat provides an azure data factory certification training for dp200 and bp201 and you can check out those details in the description box below now let's continue this session sqldb let me choose as this thing and then let me just uh press enter the path intellect rate one two three let me test the connection okay let me just uh or let me do it over here so my task name is that uh transform the channel list sorry uh let me just out list youtube channel list so guys here you can just uh run it once or you can just run regularly on schedule you can just trigger it every i mean you can do it hardly minute or you can just do it every one minute or you can just do it every uh whatever you want i just right now i'm just uh i'm not doing it i'm just running it once uh let me just so let me just choose my azure data all right let's do one thing let's do it with service principal id uh so i'll be just creating uh i'll be just going to my app registrations i just create a new app over here which will be my azure sql so i'll just go to certificates and secrets guys uh so here i'll just upload a new client secret which will generate my principal key so let me just put a secret add this thing so this is my uh application you have to copy this thing which is your key so let me just put it in my notepad for save site so click on over here you can just go you can get the principal id which will be your application id so this is my application id right so first of all let me copy this thing here service principle id and the principal key is noted over here right and also we have to do the firewall access in azure sql right let me just do that as well so here i'll be adding a roll or assignment so here i'll be just uh putting the owner let me just put this let me check this application let me confirm this app registration name so whether it is sql or anything else let me just go through the application i think this one we have here this one okay we have to create for azure sql so let me just go to that uh sql server so under the access control i'll be just giving the access for mine so once we are done with this let me just uh text this connection let's see what is the error we are getting database uh check the links on congregation and make the sql server database file but i lost the integration okay let's check the firewall so we are done with the principal key and then we are done with the pair let me just let's check the database once okay so this is a source database right so let me just check the access for this thing as well we need to set that and that only it's set already successful let's create this all right now link service for azure sql database is successful now let's uh create let's go for the step so while loading the i mean source data to just to our target data we just need to query we are using including the query as well so we are applying i mean some query based on this so we are just we want to change i mean we just want to the output where the views and like i mean here in our data set we are just considering the views and likes to we uh i mean where we want the traffic so in order to just find the channels which has more traffic so i am just querying over here where so let me just do it select a video id let me just see the schema as well so you also can preview the data over here guys see here is the preview of the data so now let me use the query for this so it would be something like uh select video id title uh views like uh we'll be selecting the views uh i mean video id views and live let me let's say we want this only video idiom video title only okay let's put the views also so what is our table name over here worksheet right so let me put the field as 70 to give me some random field all right guys we got the data to that uh now let's choose the destination data store now my destination data store will be let me just create a link service for my data lake store which i have created so my data lake store over here is storage it's target dl right uh so let's choose the data like storage gen 2 over here and then click on continue now let's choose the integrations of this one so i'll be choosing my azure subscription which is pay as you go now my storage account name is target data lake uh so i'll just choose this so i can test the connection over here so let me create this link service all right so let's click on next so guys uh for my folder path uh you can choose it let me just go to my storage explorer preview so under the containers i'll just create one container let me create a block container over here output so guys as i've already said what is the difference between gen 1 and gen 2 so basically gen 1 is a file system storage which your data is distributed and blocks in hierarchical file system and your gen 2 is which contain both file system storage to performance and security and your object storage for your scalability right so let me just do this okay so i'll be just choosing the output yeah so i'll be choosing this output and then my file name will be final so i won't define this maximum current connection block says let it let it even leave it blank click on next so guys uh my file will be coming in text format i just want it in text format and my column delimiter will be comma you can change it as well uh and then draw delimiter will get auto detected which is slash n which is ending of the line and then i just click on next let me just do it all right so you're gonna you can also monitor this thing so it's still in progress guys it's taking some time let's see so we have applied manual trigger over here uh you can also as i have already said we can schedule it minute twice i mean or else you can just do it hourly wise whatever you want i can do it in any way guys uh let's just open this thing it's still in progress let's get let's wait for getting it succeeded so guys uh basically you know uh let me just show you so when we talk about this triggers guys basically data factory supports three types of triggers which is a schedule trigger which is a trigger that invokes a pipeline on your wall clockwise schedule and another thing is tumbling window trigger which is a trigger that operates on your periodic interval while also retaining in which is in retaining state guys so even base trigger when we talk about that a trigger that is uh responded to an event so these are the three types of triggers which are supported on major data batteries i think it might be completed just refresh this and let's see so we have already defined create this linked services we can directly choose it over here so my source data is your sql database i just click on next and then i need to apply that query again so also guys uh let me show you one more thing so we can just uh i will set it okay so let me worry this thing first of all before pouring on this we will just try to bury over here let's go to slow let's uh choose from everything so start from let me validate this first okay okay this is the output which we are getting so it should work actually so the query is working that's where we got the output so we have we have given some 100 channels and we are getting one two three four five six these are the channels which are getting the atmos traffic as per my constraint which i have set to my uh data set so just click on next yeah so i'll be choosing my data like storage as my destination data store which we have already defined which is the linked service so let me choose this thing output here that was file name you can just touch this connection all it is it got validated uh now let's finish this thing now let me just [Music] let's move ahead to storage accounts and check whether that got a build or not so this is my target database and i think it should be under processed so let me open it in storage explorer preview mode still getting loaded i think yeah it's successful so now i will go ahead and create a new data factory just see it here so in the search box i will just type in data factory click on data factory and click on create okay so you can provide the resource group name if you know about it estus is fine and then i will provide a name b2 so there are two versions if you see that there are two versions b1 and v2 you should always think about choosing v2 not b1 b1 was as a part of data factory evolution we used to code everything in json so if you want to build a connection manager or link service you have to write this now small snippet in json then if you want to create a data set again you have to write a small json snippet to formulate the data set and similarly for pipeline triggers and so and so forth and it was not a good experience with the developers to use v1 because everything used to be coded but v2 is more of a drag and drop behavior and you can just simply pull out the respective activity like for each loop or if condition data flow and you can simply set the properties of those before using them but in v1 you used to actually formulate the json to simulate that same same thing what you use now in b2 v2 is very simple to use and you will only use v1 then when there is some kind of migration projecting some some of the odd projects might be still having v1 but if you are building a new data factory project definitely you are going to invest time on b2 only and v2 is integrated with devops but while v1 is not integrated with devops so a lot of lot of such criterias are there i will i will show you the flavor of v1 within v2 itself okay then click on next so what we will do this is about the github integration we are going to do the github integration later so what i will do is i will say configure git later i will check this box so later what i will do is i'll set up the repository type as github i'll provide the github name repository name what is my branch and all those things so for now i will select configure git later and i can choose public endpoint over here not private endpoint because i'm i want to run this data factory instance uh uh on on a public network right on within azure network so i will use them so if there is a particular virtual machine from there you want to run the data back to then it makes sense to use private endpoints so it will not make use of the internet but if you want to use public endpoint you can make use of internet to run this data otherwise you have to get into the vm which is in which is installed within a virtual network and then from there you get connected to internet and execute a database so without going through the route of internet you can execute data by using private endpoint but our focus is to execute this with public info so if you want to encrypt your data factory then you can do so using a customer managed key okay so i i don't want to do that if you have a customer managed key then you can put the keyword url and the the data factory would be encrypted but you can keep it unchecked and microsoft is anyways encrypting you see here by default data is encrypted with microsoft managed key it's just that if customer wants to encrypt with their own key then you need to provide one otherwise you can ignore this i have explained the significance of tags i will explain once more so uh so you are already classifying your resource with respect to directory active directory then subscription resource group and then the resource itself but if you want to also classify your resource in a different dimension you can create new tags so let's say that this is mostly used by my development team i can say development as the environment this is kind of a key value part and let's say similarly i am doing for five more resources right later when the building comes up i can understand that okay for the set of resources with the tag environment equal to development this is the pricing so this is just an extra way of representing your your class set of resources then go to review plus create if everything goes good right the validation would be successful like this and then click on create once you create you will see the notification icon that the deployment has started so in just 30 seconds or one minute of time this should be created and you can see here that the deployment is in progress okay and you can see that the deployment is succeeded i will go to the resource and you can see some some more options here so the main portion will be author and monitor i will come here uh let me go to i trade to the desktop item so again overview will comprise of what you have already created like the name of the data factor you can see over here the resource group name subscription name subscription id and so on and so forth so if i want to make some changes to this data factory date i can see all those things here within the activity log so as we have just created we don't have any events to display but as we keep using this data factory we may see some more other job access control if you want to provide access to someone or service principal or a group you can do so over here coming to roller circle let's say i have a user name does my keyring i want to provide banky the reader permission on this data factory so what i would do i will click on add role assignment and then let's say i am providing reader permission to link there is a user called swanky click on mankey and click on save and you can see that the role is getting added for banking into this data category yeah and the role is added so now if venky logs logs into azure portal right he should be able to see this data so if you missed out on creating any tags you can still come here and create additional tags so let's say i will say that this development resource is being used by my payments team so what i will do is team name and then i will say payment and apply so once you apply this will be a part of your tag circuit yeah this is added now now diagonals and solve problems let's say you are encountering some uh regular problems with some some problem you are facing it within data factory so the first place you should come into and check is diagnose and solve problem so this this is having a lot of community based prompt treatment user will be posting lot of problems and then the community will upload those problems along with the solution so if your problem matches with one of the problem setting listed here you will definitely get a solution but nevertheless if your solution is not there right you can always raise a new support request so when when you are really stuck with something red or you feel that there is a bug in the system you can raise a support ticket and one of the support member will contact you and take it to the closure so let's say there is no answer from the support team they will point it to the data factory productivity and they will respond to in some timelines but the first place to come is to come to data diagnose and solve problems i will go to properties so okay why i came to properties because i want to show you this property so there might be use cases where you might require to run this data factory some data factory pipelines using some programming language right let's say you are using a small piece of c sharp code and from c sharp code let's say you build a web application and from that web portal you want to execute a particular pipeline so when user is clicking on a button you want to execute a pipeline so at that point of time some properties are required right so this is one such property to identify which resource you are pointing to is fully qualified resource id it is nothing but the resort id if you see subscription subscription id resource group resource group name and i am using data factory service which data factory i am using i am using demo bits adf 127 it is basically giving you the entire hierarchy so some programming languages might ask you to keep the resource id and you can extract health resource id using properties so this is applicable to any resource okay if i go to azure sql i will see another resource side if i go to cosmos db i will see another resource id so you have to just go to the properties to extract such a problem just a quick info guys test your knowledge of azure by answering this question which of the following web applications can be deployed with azure a asp.net b php c w c f d all of the above comment your answer in the comment section below subscribe to intellipaat to know the right answer now let's continue with the session logs i explained but i will repeat again so if there are two types of log read only and deletes not if i want to make this data factor just read only i will i will click on date only and this data factory becomes redone so you will not be in a position to modify any properties of this data factory account and delete is that you you will not be able to delete the database this is the delete log okay so you have to simply write do not delete and click on ok and this is set i don't have to save it is automatically saved okay i'll now not be able to delete this data okay so let me try to delete them okay i will leave this name yes you see deletion field okay i will not be able to do it because there is a lot so this is to avoid accidental deletes okay it's always good to use these locks rather not using it maybe later so there is a use case where if you perform a full deployment rate there are some some resources which does a full deployment which means believe and delete dropping the resource before recreating if you don't want to perform those full deployments right you can always keep it do not delete lock or delete stock basically so that won't delete your data factory resource quick start see i mentioned that if there might be cases where you might need to run this data factory from dotnet application so you can make use of azure sdk4.net if you want to run through powershell you can delete this documentation azure sdk for python rest api if you if i want to hit on a particular url and my database there are a lot of such options you can just go through the documentation okay and then alerts metric diagnostic setting and logs i will cover in in one topic so just i'll give you a use case okay one of the metric is called this field pipeline run so this failed pipeline along with all other metrics will go into logs work on day to day basis now using the concept of alerts and diagnostic setting i will be able to understand that the pipeline has failed and i can trigger to the to the respective team or respective person so i will make use of that use case later on i will go to the overview section and then again click on author and monitor so once you click on operand monitor this will open up the data factor workspace okay so there are four tabs in within the data factory so you have data factory overview author monitor and manage okay so so we have overview author monitor and then manage so i will give you a problem statement before solving any of these things okay so we we will understand all these things by solving a problem instead of just going through them manually right i'll solve with round trip so this is the storage account that i have in my system and if you see here in under the blog container i have one container called a source and another container called as target okay target doesn't have anything at this point of time target doesn't have anything and if i click on source i have netflix underscore titles file so what i will do i will simply solve a simple problem statement to transfer this file netflix status into target location okay i will solve this very simple exercise so can anyone tell me what is the first step that i need to do anyone i explained the theoretical concept you should be able to answer first we need to connect the source here connect to the using link service yes okay so what i will do i will create click on create link service here or i can click on new so once you click on new just observe the list of sources that is supported by microsoft there are a lot of resources it's which microsoft supports and then if later on tomorrow microsoft adds some more resources that you will you will just see it over here they may keep on adding list of sources and destinations over here okay coming to the problem statement we want to connect to our storage account so i will simply type storage and then you can see here azure blog storage so you also see genuine and gentle because it's a normal storage account i'm going to use her blob storage and naming is very important so i i follow a particular nomenclature if you like that nomenclature you can go ahead and use that okay so what i will say i will say ls underscore ls is for link services underscore i will say that it is a storage account now with storage account i am connecting to i will say demo bits storage account and now if you see i was talking about integration runtime also right so by default we are getting one integration random which is auto result integration this is nothing but the azure integration if you want to get connected to any of your pass instance right platform as a service instance then you need to use azure integration so auto result integration content is nothing but an azure integration click on this and then scroll down you have several authentication method like account key i exe i was explaining about the access keys as url and all right you you can go ahead and use account key and from subscription you will be able to list out the subscription you can list out the account so we want to point to demo bits accounted i i selected demo bits account and it automatically loaded the key and scroll down and do a test connection before actually creating this link service okay click on create and your link service is created is in the process of getting created and is created now okay so what should be my second thing and again i will have a proper naming convention ds storage i will say this as source okay and then i will also say netflix then i will select the link service so my link service that is created is ls storage demo bits now you understand that why i am i'm so much focused on the naming part because when i'm selecting it from the drop-down by the look of it itself i should be able to understand it okay with chilling service i'm pointing to so it is very easy to comprehend and the next thing is if i am creating a data set which is of storage account type then all the related storing services will only come up all the storage based link services will only come up so you will not be confused that way also and as i as my link service is already pointing to storage account i will be able to browse my file as well click on browse so again repeating there is a small folder option here right folder icon i will click on this folder option to browse to my file okay and you can see that the container is source and there is no intermediate directory that's why you don't have to pass anything here and the file name is netflix.csv and i know that the first row has a header i will select this option if you know that the first row doesn't have editor you need to uncheck this that will be treated as a data record click on ok uh what should what is my next step just pointing to the container option it i don't have to provide the file i can't prove it but i don't want to provide it just specify the target folder name and whatever is coming from the source the same name will be written in the target so what i will do i will click on new data set okay i will say storage account again click continue delimited text continue and i will name it as ds underscore storage underscore target why i am specifying target is because my container name is started don't think that source i'm i'm my source and i'm pointing from source to that way i'm actually referring to my container over here i mentioned that you don't have to classify data set as source and target so keep that in mind and then i will say target underscore folder i will not leave a file here and i know that i am using the same link service i will select that first row has a header and browse i will select simply target click on ok so target is automatically put here and there is no directory no file click on ok ok my data set is also created now before moving to the next step what i will do i will click on publisher because now you are developing in this data factory which is where are you developing you using a web browser in order to develop this entire resource so by mistake if i am refreshing this browser the changes will be lost you will not see these datasets anymore so on a time to time basis you need to click on publish all publish on what am i publishing i am publishing both the data sets that i created so once you have published it after that if you refresh your browser it will stay permanent so when you are publishing it means you are just publishing these relevant objects data set pipeline link services to the database now it is successfully published now if i refresh my browser you will see that the things will be intact okay my my things are intact right you can see those two datasets now observe if i am creating one more data set let's say i'm selecting the same thing or let me do one thing i will simply clone the existing data set by clicking clone it is faster option now i will not publish i will refresh the browser it it still saves me this page is asking you to confirm that you want to leave information you have entered may not be saved leave page so basically i'm doing a forceful refresh so now you see that that that other cloned data set is not available to you anymore which means that you need to click on publish i just wanted to explain the importance of publish you are developing everything in a browser it makes sense to publish i will i will create a pipeline and move data yes that is correct so let's create a pipeline new pipeline and then i will say pl okay pl for pipeline ds for data set ls for link services uh copy file i'll just name it simple copy file and i will also say storage within storage account i am doing these things here i'll say pl storage copy file okay then there is an app so you see there are a lot of activities within that there are lot lot of activities there is copy data data flow and you can browse through all these activities but because it is very new right it is difficult to browse to the exact location so what you can make use of is the text box because i want to perform a simple copy from source to target i can simply type copy okay and drag it here in middle okay let's name this operator how will you rename this by by putting the name over here and then i will talk about few more options as well timeout redraw return table just remind me once more i will talk about once i execute this pipeline and talk about these three things in one go let me select source so obviously source is data set right so i will select ds storage source netflix so the naming importance is understood so because the naming is very clear i am able to understand and click on the respective dataset sync i will go to sync and then select the appropriate sync also the store is target folder i don't have to do mapping mapping makes sense when you want to map from one column to another column but here i am simply copying a file it will talk about mapping after some time we are now focused about just transferring one file from one location to another location okay now i will just publish it before executing the pipeline i will explain you how will you execute the pipeline so when you are publishing that you can go to this notification icon to check whether it is getting published ordered and it has successfully published once it is done so for for development purpose there is an option called as debug you see there is an option called as people and before even i execute i want to show you my storage account in target i don't have anything but in source i have netflix underscore title so i want to move this particular file into the target location so i will simply run debug okay now you see it as in when i click on debug you see an output option over here and this copy data activities is currently queued the next see first status is queued it means the run has not started yes it is trying to acquire compute from the integration runtime once it extracts the resource rate then then it goes to in progress state after in progress it can go to either succeeded or failed if i have done some mistake obviously we go to failed status but because everything was correct i got a succeeded message okay because it is successful you can just click on the details to see what has happened files read yeah one file was read files written one file is written and how much data was written 3.33 kb and it also showcases you what is the throughput it's a very small file hence you see that it took some time 1.1 kb per second okay now i will go to my storage account so earlier this was the earlier status it in target there was nothing i will refresh this over here and you will see netflix underscore title here make sense that the timing you can see 853 indian time and currently it is 853 only okay and to just quickly browse and understand it what is there in the file what you have to do you cannot browse it here so you can scroll down go to containers go to target container here in the same file click on this file and go to edit and you can see the mini version of this the same file the best option is to download and see the file because you will have lot of editors in handed you can see this csv in your excel file and there will be lot of options okay so i i just did a simple copy of file from source container to target container so here is link service so this is very much clear that if we want to get connected to a source system then our destination system we can create this link services integration runtime so by default i did not create this right so by default we are getting an azure integration you see this the name of this is auto resolved integration and what is the type of integration and time it is an azure integration so if you want to get connected to the pass system like storage account like sql system then you can make use of this auto result but i mentioned that if you want to get connected to a different integration where you want to connect to on-premise ecosystem then you can make use of self-hosted integration how will you make use of self-hosted click on this continue click on self-hosted again continue and let's say i am saying posted i'll just tell you the steps you can take it uh offline okay click on create hey guys if you're looking for an end-to-end training in azure data factory intellipaat provides an azure data factory certification training for dp200 and bp201 and you can check out those details in the description box below now let's continue this session okay it is created successfully not not finished yet so we are trying to connect one on-premise ecosystem so what you need to do you have to download so let's go with option two we don't want to do automatic process we'll do very much manual so let's say that you have a sql server system which is present in your on premise in a particular virtual machine so download and install integration runtime in that particular virtual machine once you download and install and you open that application it will ask you to provide a key you provide either key one or key2 so let's say i am providing key one in that application after two minutes of time okay after two minutes of time this will be running so you have to download install plus provide one of the key and then wait for two or three minutes of time this status will turn to running which means that my data directory would be able to talk to that virtual machine so that virtual machine i can read file system i can read the sql server if there is oracle in that same virtual machine i will be able to read that's the purpose of self-hosted integration then i'll again talk about azure integration so we already talked about auto resolver so that is nothing but azure integration runtime only but just see the options if i click on azure again there are some options right like if i want to select a particular location where i want to run all these things so i can select the location as well so if if my customer base is us i keep everything in us and i can also keep us location in this east us location in within the integration it means that when my data is getting moved i am 100 sure that that will be moved within east us itself and i did not talk about data flow and i am talking about detector i i'll talk about these three things okay so okay i'll give you the exercise before going to the exercise i will tell you how to subscribe to a subscription okay i will explain the signup process now so what we will do because you might be you you may not have still created a subscription name for what you can do so just provide portal.azure.com i will just ping this in the chat window i am explaining the sign up process okay go to portal.azure.com then do a sign up you can provide your personal email accounts like yahoo outlook live.com any personal email account okay gmail account any account and once once you are in the process of sign up you will be asked to choose a subscription so you will land up to a page which i will be showing right now so during the process of sign up you will come to a particular page which will be this okay so you will see three types of subscription types right pre-trial subscription pay as you go and azure for students don't think don't deal with azure for students unless you are any university student at this point of time that university has tie-up with microsoft until that point you will not be able to use this this particular thing this this gives you some free products for 12 months of time but this is for universe history students so what option is leftwards we have free trial and pay as you go subscription so obviously because you are in the learning phase go with pre-trial subscription it will give you 200 of free credits for the 30 days of period so if you are in india it is something around 13 000 inr okay 13 000 iron in us it is 200 so you will get 200 of free credits over 30 days of time so what does that mean so if i'm exhausting 200 in 10 days of time then the subscription status would turn disabled or let's say that i'm not exhausting the money but 30 days has gone then then also the subscription would turn disabled so you will select this particular option as and when you select this option you will be asked if you put the credit card details you can safely put your credit card details okay and once you do that based on your location if you're in india one rupee will be detected if you are in us one dollar will be detected and it will be reverted back as well it is just to check if your card is legitimate if your card is legitimate if you just successfully validate that you your card is valid and allow you to complete the signup process so the money will be back okay you can just read out the terms and conditions over there but that is the process you don't have to worry from critical perspective okay but at some point of time you have to make a decision because you are investing time on this azure subscription or maybe azure product rate so once the free trial is over what what next so there are two options right either you can again continue with free time so you cannot use the same account okay you have to create another account another personal account another credit card only then you will be able to again use spiritual account that is another option second option the best option is to go with pay as you go it means that your credit card is already linked right so you can go with pay as you go and the billing will actually start but it's not an easy decision to go to psgo because you are already skeptical because credit card is linked you will be little skeptical to move to pseudo because every building somewhat other amount might appear so whenever i talk about any of the resource just have a practice to understand the costing model okay i will after this once you do this sign up process i will also explain you where exactly you can understand the costing once you are comfortable with costing you can actually take a decision and move to pay as you go so while you are in free trial try understanding the costing perspective once you do that you can actually easily move the page to subscription so that is the entire process and second second uh scenario you might think in mind will my subscription automatically move to pseudo from pre-trial to ps4 unless you provide the instruction it is not going to automatically move from pre-drill to psu so once it is expired you don't have to worry from the credit card perspective if you are so much critical then you can remove your credit card from after using this free trial subscription so that is the sign of process so in this problem statement i was trying to transfer file from from source to destination net this netflix titles file from source business what if i want to upload a different file using the same pipeline that i have created so you have to understand few more options there now let's do one thing i will upload another file so just give me one second i will identify the file and will directly upload those files okay i've uploaded a new file as well customer.txt then i will go back to the data factory this is static copying it now i will make it little more dynamic i'll say then click ok ok so what i will do i will create a new parameter called as file name so whenever you create a parameter and you click on debug right the first thing it will ask is to create to pass a file name without passing this you will not be able to execute a pipeline so that's the reason because we don't know that while when executing this pipeline i'm not sure what is my file name okay again repeating the problem statement earlier i copied a file from source to target okay in very static manner right i chose a particular file called as netflix titles and then move the same thing to target location but this time i am creating a dynamic copy which means during the runtime i am going to understand that this is the file and then i will transfer that file from source to destination so i am creating this parameter so that during runtime i'll pass a filename and that file name should be moved from source to target that is my intention now what i did was i made a clone of the existing pipeline and i renamed this pipeline to dynamic copy that is what i did post that i created a parameter called as highly okay post that what should i do so i have a copy data activity and and i have source and thing right and this source is dedicated totally to netflix file so i will not make use of this data set at all so i will go to this data set okay this is the data set which is pertaining to netflix file name i will create a clone of this status and i will name this as dynamic file okay so just for my interpretation i am naming it as dynamic file and then i will remove this and then there is something called as parameters so i will create a new data set parameter and i will name it as data set parameter sorry file name data set file go back to connection and then in the file name right instead of passing netflix underscore title i will pass this parameter name that i have created so to add this click on add dynamic content and then search for the same name data set file name click on this and you saw right this is automatically formulated so what what is going to happen when i'm going to run my pipeline pl storage dynamic copy we already have a pipeline parameter and when i'm running i will obviously pass the pipeline parameter let's say customer.txt when am passing customer.txt in this particular pipeline in this this particular pipeline that has to cascade into this data set okay where will it cascade to this will that pipeline parameter will go to data set parameter and that data set parameter is being referred to as a file name okay i'll go back here copy data activity and under source i'm going to change this to dynamic file and because this dynamic file data set has its own data set parameter i will be prompted to provide a value here you see right data set file name now the link you are able to understand it so i have created a data set parameter now because i am using that same data set which has its own parameter i am being asked to provide the dataset filename here so what i would pass i will pass the pipeline parameter into the dataset filename parameter click on add dynamic content and then type in file name click on file and you can see this expression is automatically again evaluated it is the pipeline based parameter which is file name click on finish so during the run time if i'm passing a file name called as customer.txt so that that same name file name will go into data set file name and then if i refer to this data set file name for where is it it is within the file property right it is the file add dynamic content once i do that so the sync is still the same thing it is just pointing to the target folder right it still holds good i will click on publisher to save all the changes before running okay publishing is succeeded i will click on debug now and now you see that because the pipeline has its own parameter is asking me to provide the file involved so now i will provide the name customer.txt i will copy this and paste it over here and click on ok and you will see the output copy is in keyword status it is in in progress and it is succeeded i will go back to the storage account so this is there in my source container right so i will go to my target and i will be able to see that customer file is present now the timing you can see over here okay so i have made it more dynamic in nature right now let's say i want to pass another file so i will pass netflix underscore title okay the time is 8 15 900 in the target it will change and click on ok refresh skew to in progress and then it will go to succeeded yeah it is succeeded now if i go back to target and refresh this the time would change here you see that netflix file the time is changed to 933 now i will again repeat the same thing i i hope it is clear but i will repeat the same thing what i did my goal is to dynamically decide during the runtime what is the finding so to do that i have to create a pipeline parameter called as file name i created filename parameter once i did that i have to pass this file name into my data set right how will i pass my pipeline parameter to the dataset parameter so when i created this data set i first created a parameter data's experimental called as dataset and where did i place this dataset filename in the file property so this is container this is intermediary directory and the file name so what am i expecting during run time my pipeline will pass a value to data set file in property which is used in my file value so that's how i am i'm using this under copy data activity if i go to source and i have made use of the same parameter parameterized data set it is prompting me to provide data set file name which is nothing but the file name of the parameter and sync is pointing to the target folder so that's why you see that this has succeeded now i it is very much dynamic compared to the first pipeline where i am able to pass any kind of file which i want to write i will unmute you guys to understand if there is any questions on this aspect or you can post your questions in the chat request so by default it will replace the existing file at destination yes when it is going to okay so what i will do i'll just make a copy of this pipeline okay cloning this pipeline and then we'll rename this pipeline to copy and copy if not exist okay so this copy activity will come in a while so what i have to do i'll just repeat the entire flow once more i will first check if the target if the file is present in the target or not if the file is not present in the target i would make it conditioning and then would copy the file if the file is not present then i am not going to copy the file that is what my logic is just a quick info guys test your knowledge of azure by answering this question what are the different cloud deployment models a private cloud b public cloud c hybrid cloud d all of the above comment your answer in the comment section below subscribe to intellipack to know the right answer now let's continue with the session so first i need to ensure that the file is present in target or not right so what i have to do i have to create a data set for for the target file so at this point of time we have a data set which is pointing to the target folder but i have to create a data set which is pointing to the target file so i will make a clone of this okay i will name it as a target file okay and then i will create a data set parameter like i created earlier go to connection and use the same parameter click on finish so this is done now i will go to the new uh pipeline which is this and then there is a activity called as get metadata so see as and when i am solving some of the use cases you will come to know more and more number of activities so i'm extracting get metadata activity i will say that check file existence okay and then i would use a data set which i just created which is the target file target file and obviously this target file data cells has its own parameter so i will i will put the file name parameter from the parameter type so what i would do is just type in file name and then the pipeline parameter would cascade into this dataset and scroll down there is one more important property called as field list so click on new scroll down once more click on this drop down and see what are the items so i will be using exist property because i want to check if the file is present or not right that's why i'm making use of exist there are some use cases where i can i treat a list of items using child items over here child item but for this use case i'm making use of exist property just to cross check if the file is present or not present so and and i'll explain you one more property about this data factory if i want to run just this this particular activity right get my data and i don't want to run copy data activity during the process of debugging i will just check this over here you know right in ssis you used to place a debug point i will call the same thing debug until this point debug and then i will pass a file name let's say i am passing a random file which is not present obviously and then click on ok refresh the output keep refreshing go to this output section see i will explain how to read the output of one activity so there is an output there is an input so let's go and check what is the output and you see the first property is exist false so it means the file name that i provided doesn't exist in the target location which is exist false now close this i will debug this once more and i will pass customer.txt which is present in our target right click on ok and then let's read out the result once more okay and then i will open the output to see that exist value is true because the file is present in the system but what should i do once i check this get metadata i should do some conditioning right before popping i cannot see why right now the flow is like even if the values exist true or false i'm making a copyright i cannot do that i have to do a condition before copying the data so i will disconnect this and then we'll place another activity called as if condition okay i will connect the the gate method with condition okay and then we'll place i'll just cut this and paste it within the true activity so there is if condition that is true and pulse rate i will copy that activity here copy file activity will be here but now i have to make use of this output rate the get metric the output i mean i need to make use of this property in order to understand okay if i want to execute the true activity or false activity so click on this each condition i will name this if condition is if i exist okay then go to activity there is a expression box so click on this expression add dynamic content and i am going to use the output of one activity in here so scroll down or just type in check okay so my my activity name is check file existence click on this so we formulated till output okay click on finish and just formulate this once more so check the output so what is it under output you have a property called as exist whose value is true at this point i'll copy this and go back to this expression and you say under output i just need to put dot exist got it right so i am using the output of check file existence activity so this is automatically formulated i am just making use of the output one of the output property which is x and because it is giving me a boolean value rate of true or false i can make use of the same i don't have to equals equals and what is that i don't have to do because any of this expression is giving me a value of true or false so click on finish okay now can can anyone tell me what what is the incorrect thing that i made can anyone point me where exactly i did the correct thing so see again i am trying to read the output of get metadata the value would be exist true or exist false then in the if condition i am putting this expression okay this expression activity check file existence dot output of exist and then if the value is true i am going to copy the file here so what is the mistake that i have done here now why did i place a not because when the file is present it will say that it is true and not of true is false right so i am not trying to copy anymore but if the file is not present the exist value would be true at that point of time i want to execute sorry the exist value would be false and not of false is true is when i am executing this copy data activity okay now i'll just publish it before executing the pipeline okay so while it is in publishing moderate i can still go ahead and click on debug i will say customer.exe click on ok so customer file is present in the target location so ideally if condition will give a value of false and should not execute right so let's see what happens it's done so you say that the check file existence value is giving me a value of true over here true and because it is true not of true is false read so it evaluated the if condition but did not copy the file because it evaluated to false now see here i will try to execute the another thing okay i will delete this customer.txt and then run the same thing again and deleting this file and will go back and run the same thing with the same parameter customer.txt you see that now the third activity came in which is copy file because this evaluated to false not of false is true and hence i am executing the copy file activity and it is successful now if i go to the target location refresh this i am able to see customer.pxp okay this is one another use case within storage account if you you can play with this expression and based on what is the outcome you can decide how you want to orchestrate your data movement so uh this is what i want to cover on storage account one more important thing which i didn't cover yet i told that i will cover that so i will just explain one of the important property okay so if i go to the general tab of any activity i can go any pipeline and choose that copyright activity or any activity there are three properties right timeout retrain and rater interval so i'll tell you what it is let's say that i i know right when when something is getting copied i will roughly understand that it should not take more than five minutes of time but the default time of property is set to seven days of time if let's say on a odd day if it gets stuck for no reason right then if you get stuck for seven days of time this is not good right so what i would do if i know that it is going to take maximum of five minutes of time and just to be very much pessimistic in nature i will set it 10 minutes so i know that it will take maximum of time so i will set it to 10 minutes so the format is the date days days dot hours dot minutes dot seconds so i am setting it to 10 minutes at this point if you want to set it to some days you can do like this six days zero days at the same time there is some property called the serie track let's say currently the retro is zero i am making it to one what does it mean if copy date of activity fails for some reason it is going to retry once more that is the meaning of reading but it will retry after waiting for 30 seconds of that interval okay this is the time of retrain and return double now what i will do i will try to solve a simple problem statement of copying data from source database to target it okay now we will go one level up in data factory and and later on we'll solve an incremental problem statement of what is the infrastructure that i have now i will go to my sql server instance let me switch over so i will select equal server this is my sql server account and if you go to the databases right i have two databases one is sourcetb and another is target db okay so let's assume that sourcetv is maintained by the customer and i'm retrieving the data from customer and loading the data into my target database that is what we are doing here okay for the simulation case what i have done i have kept source db and target tv separately within the same sql server box okay so what i'll show you a few things in source 3d and targeted before approaching to the problem shape so there are some data which is getting published on a day-to-day basis and it is published by customer i have to retrieve in an incremental manner into my target database target database i have full control but source database i can only read the information okay having said that will not be able to create anything in the source database obviously it is maintained by the customer how can we create it now um if i click on the source tv i will open one more instance as well okay now go to query editor if you are not able to get connected to your system link it means then you have to add your id so one rule is you can click on plus client id and your ip will be added like this you can see here your ip is added at the same time you have to say allow your services and resources to access the server so by default it might be no you have to set it to yes if you say yes then your external system like data factory cosmos tv or data breaks streaming analytics would be able to talk to this equals if it is not there not those those systems would not be click on save so why i added my ip is that because my machine is my my router type needs to be added and sql will understand that okay it is coming from a trusted source okay i'll go back to source db and then go to query editor i will type in my password now i should not get any error anymore click on ok and i am connected to my system and similarly i will do the same thing for target deviance i want to showcase the entire scenario right so go to query editor i'm typing your password and i'm connecting okay now what is customer providing so if i see here there is a table called as orders if i execute this i am able to see certain number of orders over here i have total of 16 orders so so far customer has published in 2020 01 and 11. okay at the same time what i would do i will try so you will see here there is no record in this target system yet so if you see here there is no recording orders table there i see zero number of records over here okay so what what are we going to solve we are going to push data from source database which is maintained by my customer and i will load it into the target so it is very easy that if i use a copy data activity put source as orders and destination as order i am going to put the entire data from source distance but our goal is not that right we want to do it in an incremental manner so for that we have created another table called this water market so this table will help me to maintain the timestamps so whenever i'm loading the data for the first time i'm choosing the first date in the system which is 1900 0101 because we are dealing with orders tabled it whenever you are going to do this for this exercise for the first time place the entry as orders and place the first date because all the orders would be definitely greater than the state rate so if you see my source database there is a timestamp entering in daytime you are 100 sure whatever date it is in the in this set of orders would be definitely greater than 1900 so this is my initial configuration setup what i would do is i will put orders as the table name and i would start from 19 it means i'm copying fully for the first time and then let's say i'm loading this data from source to destination what is the maximum timestamp here it is 11 1 2020. so let's say i have loaded the data from source to destination successfully the next thing that i need to do is to update this value that which is the maximum timestamp in this bottom up table over here so that now let's say it is updated to 0 1100 now if i run the same pipeline once more it has to start from not 1900 it has to start from 202011 so now you understand it you are now getting the incremental data set they are not getting the full data that is the purpose of maintaining those checkpoints right in watermark people i will explain it with a diagram as well so let's go to data factory let's say i am creating a pipeline which i will call this as here copy orders incrementally okay i've created this pipeline new pipeline now i'm going to draw the diagram before explaining this so what i am going to do so this is what i am going to do okay this is a lookup activity i have not talked about lookup activity i will talk about that this is a copy data activity this is a stored question so what am i before copying the data i need two information okay watermark value yes anything else which i need to bring yes watermark table watermark value will give me the starting yes n time as well n time is the maximum timestamp which is present in the source database so what i'm i'm going to do i'm referring old watermark i'm going to refer to the old watermark i'm going to refer to the new watermark so old watermark is nothing but the watermark timestamp so if if something is there in the watermark table it means that the data has loaded till that point of time now when i'm running the pipeline again i have to start from that point and that's why i'm calling it as cold watermark and how am i going to extend the new watermark this new is nothing but the maximum value whenever i'm performing the load rate i will help you with the maximum date value within the orders table the maximum at this point of time is 11 month 2020 so the same thing that the max of insert date time from from the source orders is nothing but the new watermark so now i got the old watermark i got the new watermark copy data activity has source and thing right link will definitely point just to the table source i have to evaluate the query before loading it so i will do select start from order square insert date time greater than hold water mark and insert date name less than equal to once this copy data activity is successful i am going to execute a stored procedure which will update the watermark value with the new entry what we got in the loop okay so the first two things i need to do is to extract lookup activity right let's try again one more lookup one will be for old watermark one will be for new watermark okay over here then i'll scroll down so you understood the value of what what you should put in time of retro editor interval and not modifying these things but if you have the fair understanding then go to settings so we obviously need a data set to refer to the old watermark which is there in our watermark table right so what i would do i will have to click on manage and then select link services and then click on new to create link services before even creating data i need to create services for link service i am creating and creating a sql based screen service adsorptic database click on continue and then i will select ls underscore sql underscore source db so this is something which is maintained by customer not me so i will have read access on this okay so when when you are trying to put your watermark table you cannot think about putting that in source database because you have just read access on sources now let me try to fetch the server okay and the database so database and authentication it is asking me so i know my sql authentication username and password so this is azure sql name i am doing it in this manner when you are trying to connect to on-premise ecosystem right and and to your on-premise sql server you have the prerequisite is to have a sql authentication because end of the day you are going to put a username and password you can do that accordingly i will test the connection to just be sure if the connection is going to be successful and it is i'll create one similarly i will create another link searches to point to the target system okay target database so again i will say sql database continue and the naming is important unless underscore sequence for uh target pb then select the respective subscription server name database name sql authentication username and password test the connection before creating okay so i will create one more data set okay again refer to the sql database then click on this click on continue give a name called as ds underscore sql underscore source db underscore orders okay and then select the link searches before choosing the orders okay click on ok this is just one part then i will go back to the pipeline and will arrange this thing so in the old watermark as you rightly mentioned that old watermark will point to the watermark table okay so i will go to settings select the data set which is water magnet what are my table so you can even search it with this filter criteria and then i will specify query okay what will be my query just observe here i will go to the target database and i will say where table name equal to orders this is what i will get right and at the same time i will just name it as okay i'll just copy this and then would paste it in the uh here over here okay i've pasted it and i can do a simple preview understand if i have done correctly so you can see that i'm getting table name as orders and watermark value as watermark now for the new watermark i am going to obviously refer it to the orders table that source db orders the first one and then again uh do a query okay and then we are going to extract the maximum values in it so what iphoto i'm going to go to the target database and then select max of insert date time from uh order state so i have to do this in source okay i will do that go over to source source database and just cut this and paste it in the source database and let's see what is the result and i am getting a result and also at the same time i will modify the abs because there is no ideas currently as new water are ready to copy this exhibit once you cross check if it is fine or not yes it is fine go back to data factory and paste the query here okay i'll just do a quick preview to understand if it is correct or not yes now if i run this you will be able to see the same thing in the output so yeah i'll publish at the same time i will run this click on demo here both of the activities succeeded so if i choose the old watermark output i am able to see under output under first row i am able to see table name as well as watermark values now if i select the new watermark output i will be able to see first row and new automation so basically i got a date range rate the frown date would be this value and the two date would be this particular value i have those context in mind now the next thing is to execute a copy data activity in an incremental manner i will just name this activity as copy orders and then i will connect these two things which means only when both of these things are successful that you can execute copyright and i am doing this parallel there is no point that you you set this up sequentially because these two things can be executed parallel in parallel so go to the copy data activity select force and then select the source data set which is source db underscore order which we have we have already created and then i have to formulate the query so what is that query let's go to the source database so select star from orders where insert date time is greater than less than equals and i'll put whatever value is that currently i'll put that i should get all those 16 entries right let's see if i'm getting 16 or 15 okay 15 so i will do 12 okay i will get all the 16 entries over here so i will copy this the same thing and paste but i have to instead of putting this hard coded i have to extract from old watermark and over here from new watermark so just remove this portion and over here as well i just want you to be sure that this is the query that we are going to write so put the cursor in between the single code and then click on add dynamic content then search for cold watermark where is that question okay i'll cancel it and then copy it again and add dynamic content paste it here place the cursor in between of this code and then search for old watermark okay before even i i replace this just place at the rate and curly braces curly braces okay delete and cut your lists within this you place the cursor and click on lookup old watermark value okay i have clicked on that so you can see that activity.lookup.output dot first row so if you remember we had a property called as first row within the first property we had watermark value quarter mark value and then similarly place a attribute and curvy braces and come in between and place new water method and then under output we had first row dot new water mark value okay just remember what i have done output first watermark value from old lookup finish and then go back here just check the same thing output dot first row dot watermark value for the new one i did output.bustro dot new watermark so i'm not remembering anything basically i'm i'm using what is getting what is visible to you all so i have formulated the source source query okay okay i have formulated the source now let me go to the sync so i have to select the target table also we have not yet created any data set for sync data set so click on new and then select sql database and then select the link so this is target dp select the table as orders and click on ok so now go to the mapping import schema i'll just give a random value because the query expects value rate the parameters right uh the lookup input the old watermark and new watermark so i am providing a random value and click on ok it will import the schemas yeah it has imported the schemas right over here okay it has imported all these schemas scroll down make it visible and then there is one more activity called a stored procedure this is just to execute the stored procedure so what is the intention for us we want to update the watermark value right watermark value and go to settings choose link service to select the stored processor you don't need a data set you just need a main circuit so i know that one of the stored processor is present in target database i will select one and select the what marks okay watermark what i will do i will just showcase you this particular storefront sp set watermark it is a very simple uh stored procedure want to expand this hey guys if you're looking for an end-to-end training in azure data factory intellipaat provides an azure data factory certification training for dp200 and bp201 and you can check out those details in the description box below now let's continue this session okay you can see here create procedure processor name it is taking two parameters which is table name and the new time once it is provided i am simply using an update update watermark table set watermark value equal to incoming new time parameter where table name equal to incoming table name it is a simple shortcut okay now i will go back and then would click on import to import the parameters of this sp so now you see that both the parameters came in so what is my new time the new time is going to be the new water market because old watermark was already loaded earlier now because copy data activity executed the orders from old to new now i have to update the bottom up value with a new one so that next time when i'm executing my start it is the whatever is there in bottom now what i will do i will add dynamic content and will say new watermark click on this dot first row dot new water mark click on finish and at the same time this table name property is also present in the old watermark output right so if you see here it is the table name property so instead of hard coding i will directly pull it from this particular paragraph so i will go back and select the value add any content and then we'll filter out the table name sorry not table name old watermark dot first row click on finish my pipeline is built i will publish and will run the pipeline just to be sure that we are having no orders at this point of time and the watermark table has a value of 1900 okay in the targeting now if i run this pipeline click on debug just a quick info guys test your knowledge of azure by answering this question which of the following web applications can be deployed with azure a asp.net b php c w c f d all of the above comment your answer in the comment section below subscribe to intel pat to know the right answer now let's continue with the session okay this is in the process of paper the pipeline is executing so if you see this queue status set what it is doing behind the scene is trying to acquire resources from the integration that's what it is busy okay so it has actually succeeded plus the copy data has also succeeded and watermark value is also succeeding now let's see what is the old watermark old watermark is 1900 which is expected what is my new watermark the maximum value of of orders right from the source database which is 11 month 2020. so we had around 16 records in total so if i click on this details button i am able to see rows read was it 16 and rows written is also 16. and it also gives you picture about how many how much amount of data is read how much amount of data is written and then what am i updating if you see the input parameters i am passing this to new value which is 20 20 0 11 and orders so that the query is update watermark table set new time equal set watermark value equal to this where table name equal to orders that is what is happening in the stored procedure now if i execute this particular first this one what is the new watermark value that is 20 20 0 11 it is now updated and i should be 16 out of in total and and i am able to see 16 now the incremental problem statement yes this is for the first time i am able to load the data set fully from from source to destination now what if i the so the the customer is trying to insert a new entry here right so i will do insert [Music] orders values and i will have 17 comma 15 p4 c5 11. i am trying to follow the same sequence i will run this one row is inserted and if i run this for error i should be able to see 17 record now the latest record being this one 20 0 1 0 not without doing anything i will execute the pipeline you should be able to see that only one record will be copied and that will prove the increment of problem data problem statements okay let's see what is the old watermark cold water map is the old value 202011 which was there in watermark table and what is my new watermark should be 12 right yes it is 12 20 20 0 1 12 and copy data activity yes one row was read on robust written which proves that it is solving an incremental problem statement here and then it is updating the new time to 12 1 2020. now if i execute this order frame i should be able to see 16 sorry 17 recording protocols i am able to see hey guys if you're looking for an end-to-end training in azure data factory intellipaat provides an azure data factory certification training for dp200 and bp2 bp201 and you can check out those details in the description box below you
Info
Channel: Intellipaat
Views: 17,911
Rating: 4.8863635 out of 5
Keywords: Azure Data Factory, Azure Data Factory Tutorial For Beginners, introduction to azure data factory, azure data factory tutorial, azure data factory overview, azure data factory training, microsoft azure data factory course, azure data factory certification, data factory, azure data factory pipeline, azure training, data factory azure, data factory for beginners, what is azure data factory, azure data factory course, Intellipaat
Id: iSpVfWo4CUI
Channel Id: undefined
Length: 144min 55sec (8695 seconds)
Published: Wed Jun 09 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.