Azure Data Lake Tutorial | Azure Data Lake Training | Azure Data Lake Architecture | Intellipaat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] as your data lake storage first thing which we will be storing is in your Azure data Lake storage and it is very important that you understand this thing because what happens is in general like in your business purposes most of the data we first store it as an ad LS - okay and from there we do the read and write operations so even if they are on-premises batch files what we do is we move this batch files as a blob and orals mostly as an ad unless - and from there we start writing the functions whether we want to put it in the sequel data warehouse or we want to put it in a cosmos DB okay even if whether you use data factory or you use data freaks both do the same thing okay even you are reading a data from a Kafka stream and even in that case first you will write it as a delta Lee which is a data Agena as your data leak storage so it is important you understand this concept so what is your as your data Lake storage so it is a highly scalable distributed parallel file system in the cloud which is specially designed to work with multiple analytic framework okay so a DNS is as you see here it can read all kinds of data whether it is the on-premises database that is your structured data whether it comes from a web or it's a video or image which is completely unstructured or it comes from your know sequel databases which has some structure like it's a semi structure or from sensors or without the need to bring everything to a common platform all right so you can store an image also as an ad ls2 you can store batch files also as an idealist - you can take IOT data also as an ad Alice - so you see the difference right it is not needed for you to convert a no sequel data into a sequel C source just so that it's easy for you to query or you to load so any kind of data can be stored in your ad unless that is a show data like store without the need to do any kind of computation okay we can run all these on the ID LS that is your radial analytics HD inside your our spark we can do machine learning experiments on this we can develop machine learning algorithms on these data so there are two parts in your air show data link one comes to storage one comes to analytics okay for analytics we use the HD insight analytics or your Azure data leak in analytics and for your storage in the normal one there is your data leak stories now keep in mind the azure data Factory and Azure data bricks are something which help you to read or write data into this ok so as I was saying Azure data Lake will take any kind of structured data whether it is fixed it supports everything structured data semi structured data and unstructured data all right and the best part is you don't need to kind of bring anything on a common platform ok next thing is the size so there is no limit to size and no limit on an account even you can have multiple files of different different sizes or you can have a single very large file also ok next is how it works so for this much fast accessibility and all it actually uses the web HDFS interface it's completely based on that and then it supports your parallel reads and writes because of which again your analytical functions we are very fast next what we can do is if we can split a file say we have some data which we can split it up so in that case it is divided into 2 GB chunks which are called your extends ok and these extents are again replicated so every time a replication happens it is for high availability and for your reliability and vertices this comes in picture when we will use the you sequel they are created on these extents so I'll explain you in detail what what pieces are and what extends are when we are actually creating something with the you sequel ok so slow yak what is very is dreyfus so basically this is your web Hadoop file system actually it uses HTTP REST API is to get data or to load data it will use those HTTP GET at HTTP POST append so it will you those things and then it is to order data or it will get the data or it is updated so if you want to understand that it is a very huge topic I can just create a small detail one like I did today but it'll take some like two just to understand it it will take some time I'll make a note and I will get the details for you people okay okay because it's a completely different system so you you have your Hadoop file system right had roped is reputed file system how you do it through a web application or web interface is this so it is very fast and very easy to do so that's why it is based on that file system alright so the next thing is it's a set of capabilities which is dedicated to big data analytics built on Azure blob storage so they took all the important features of your blob storage they have used it on your Azure data Lake storage there are differences on that but it is similar to many extents so results of converging the capabilities of our existing two services that is there is your blob storage and as your date Alec storage gen1 we are getting the gen two features from data storage gen 1 are combined with low-cost tired storage high availability disaster recovery from Azure blob storage so there are certain things like your ad LS is more optimized for your analytics ok and a blob is for bulk storage similarly adls if you do any operation on adls it is expensive and compared to block then you're a dealer since it is based on web HDFS it is completely available for your analytics or for your big data work but blob does not support HDFS ok so by understanding the features itself you will be able to understand why we need to do the we need to go into data Lake alright so the main objective of building and atonic is to offer an unified view of data to scientists so just imagine you can analyze a picture also and you can get the did you read the pictures and get the details at the same time you can get details from video if they are related and you can compare it with on-premises file system so how easy it will be for someone who wants to access data across these environments with the increase in native volume data and metadata the quantity of analysis also increases and data lake offers agility so as we saw it's very agile you can update any new thing very easily when compared to other storages so here your machine learning and artificial intelligence can be used to make profitable predictions so as you also gives you machine learning studio okay and if you go and create so you create experiments in your machine learning studio so all that experiments you can easily do on your Eddy LS so and also gives 360 views of your customer and makes analysis more robust so we need to understand the data Lake architecture also so you see this picture right there's something called source so here it is showing real-time ingestion micro batch ingestion or batch ingestion then we have something as an ingestion tire we have the HDFS storage we have the inside tire and we have the actions let's understand what each tire is okay so the ingestion tire that is your first tire this one the tire on the left side depicts the data sources the data could be loaded into data like in batches or in real-time so what does it mean is you can actually put a batch processing data or real-time streaming also can be done even that data can be stored in your ad LS when I say LS I mean a TLS - all right because now by default you will get the option of using adls - only and most of the people will be using that only for already developed systems which are on a TLS one where they are using that so next is the inside tire the tires on the right represents the research white where insights from the systems are used sequel no sequel queries for even excel could be used for data analysis see so you get the data you do all the operations and then from here you can use any of these things can either query through sequel or you can load it in a no sequel database or you can do MapReduce or you can simply use query interface and based on what you will take this decision is completely on your business requirement so HDFS is the cost-effective solution for both structured and unstructured data is the landing zone for all data that is at rest in the system so in the HDFS storage all your data is brought and it is stored over there so data there is no transactional data over there everything present there is at rest and call this data will also be encrypted next we see the distillation layer which is this layer which is the fourth layer so what that layer does is it takes the data from the storage tire and converts this to structured data for easier analysis so suppose your unstructured data is coming in your structured data is there so in that case it is converted in such a way that it becomes structured for the book and okay yeah you will have some control and certain things will be done in build itself okay just a last question so when you say convert like and you said it would be done internally also right so how do that or on what basis it would convert see if this depends on what kind of data you are working on okay if you are already working on a structured data what it will not convert and by meaning to convert in a structured data does not means that it will convert something from a no sequel to a completely to a sequel database it won't do that instead it will put it in a structured way okay so that it is easily accessible it won't be you know very unstructured manner like or no structure at all if you are taking a picture or something so it will fetch the necessary informations and keep it ready you don't need everything from a picture right there will be certain tools to read the picture and do the analysis or from a video so it will lease put it in a structured way it does not mean that it will convert it to from a no sequel to a sequel database that will not do it okay next is the process yeah so basically like blob will store the unstructured data and this delicate store will get the instructure data but it will convert into in a structure format and using summon our analysis tool like no sequel and then it will store like in a destructure format okay so blob is your binary one right so large-scale binary storage binary large objects basically that are your that is blocked so it is used for storing your large files it is not based on web HDFS it's a simple storage okay it will not help you with it does not have any features of HDFS it's not as I showed you in this excel right it is not compatible it's not HD F is compatible only so you cannot do your analytics work over there okay you cannot directly connect it to HT inside you cannot directly connect it to MapReduce you cannot do MapReduce over here but when your data is getting stored even India this is a storage but here it's a completely different kind of storage it supports blob as well as it supports your other data also and then okay and then the it is based on your web HDFS it is completely compatible with your Hadoop file system you can do all kind of analytical work on this bit is much much faster when you compare to blog so blob is basically it will be used just to store data and the kind of data which we will not access it that frequently it's kind of a cold storage but this is not that here you have to define whether it is a hot storage or a cold storage and based on that you can get that data and work on it vocal so you're saying that is real is storage is suitable for analytics purpose right yes yes yes but but in the diagram in the architecture you saying that it is getting analyzed here itself and then getting a store now this is C analyzed in the sense when we are storing it right when we see when we store it it will define certain things on it so if you run a get operation okay suppose you are putting the HTTP GET to get some data so it has to structure the data in a way that that get process operates right so these things will happen internally okay and when we created Els also you'll get more details today so this is almost done so last is the unified operation tire so it governs the system management and monitoring so here you can get the monitoring details in our ok so now we are very clear with adls - so let's do one of the working sessions so there are two sessions for you all one is to create the cosmos DB account which we did then in that we will be understanding how to change the region's so we did it till here we created it and then I showed you to create a container in a container there will be item and these items will contain your documents we created a cosmos DB of sequel API alright ok so this is a simple exercise so since all of you are very interested in doing this we'll start creating with your ad LS itself so let us start with a DNS we'll create your Azure data factory ok ok I'll try and show you to create a copy data where we are loading the data they are from an ad LS into night okay okay so let us go over you all can log into your systems so one thing what I have done is I have gone through your amazing and again you can create of this is just to make you understand how we can get data from so many different sources ok so I have gone to Amazon I've created an s3 storage in the s3 storage there is something called as buckets and in those buckets in what I've just created a bucket bucket is like a storage account and in that I have just uploaded this CSV I just downloaded this file from internet and it just takes a random since data is over there I can show you the link from where I have corporate at chubbies yeah if you go to this site this is for LCS we find it yeah the CSV file which I am using so you guys can go ahead and create a free account in AWS if you want to utilize the Azure storage yes three storage over there I'll wait for you once you just create a free account just let me know in the meantime I'll go and create our data factory as well it should have been a double yes can we go in Microsoft edge sure so that he can find just let me show this to you okay the reason why I want to show this to you I want when we are doing the cosmos DB right I'll show it to you I will do it in this session but I want you people to get an idea what exactly it isn't there you can implement it because you all have the system there you once you start understanding the ADF you will understand why we are taking it as a PDF says we file and data like stories right yes there is an option to upload also yeah so from there we can keep it in Cosmo Devi I mean - yes yes okay I'll show you now you have a lot of capability in data factory not a lot you can do that this is just one small thing when small activity called copy data okay so I have only one resource group Here I am adding it now the location try to choose the location as near to you because the father location you will choose the mode you will be charged to create a storage or to read write data in that particular premises okay and one more thing your location comes into consideration so you will know right there are certain data which we do not which some countries will not allow you to move it out of the country you can access it only in that region so you can keep a location there and you can restrict data only say suppose you have created a resource group and you can restrict it is in East u.s. so your data is also not going in out of East us it is stored in East US region and you can restrict the people of us itself to do it okay so for now I am NOT enabling it we can do a get enable here all as I will show you once we create the data factory what all options we have to get to create the get over there it is your repository and monitor so your data factory will give us various different kind of activities activities or something where we do certain kind of transformation copy data is one such activity where you define two separate sources and you copy the data from one source to another source very limited transformation is allowed here in when you do a copy data okay there is something called as data flow so basically it gives you a GUI interface to do all kind of transformations filter create new Sparrow's all the transformations ETL transformations you can do it over here what is doesn't it spins off cluster and runs the code in spark all right but that is done in the backend and the user does not need to Gordon scholar he can do it through a GUI so pipeline when we create a pipeline we combine these different different activities in the workflows it's sort of a workflow so you want one acting activity a to run once it finishes you want activity B to run if it fails you want activity C to run so these kind of workflows when it is created it is called as pipeline next we have create pipeline from template so what I sure has done is there are some predefined templates like there are templates for creating CDC they're templates for loading data from say like database a to database B you can just download those templates can figure it according to your needs and then you can very well go ahead and do it right next is configure SSIS integration so we all know SSIS is a Microsoft product so what they have done is through this you can just directly lift and shift your code your SSI use code that you need not do anything you just need to build up packages and from there you can deploy it in a DNF and then you can run it in cloud I asked you not to connect to github right because this option of setup code repository will give you that option we have two kinds of repository that we can access one is the github and our other one is the azure devops kit so azure devops kit is basically the repository provided by issue you can put it in either of the two the preferable one is github okay so this is the starting page if you want to directly go to an activity you can just create over here so if I go and create here as a copy data activity it will work all right if I go and create over here create a data flow directly it will take me to a page where I can create a data flow suppose I don't want to go buy at this page okay so just discard everything this cart changes so the next thing is you are addicted so I was telling you about pipelines let's do one thing let's create a pipeline whenever we try to create a pipeline see all these different features you can these are all the activities which you can do in Azure it's not only just a ETL transformation okay you can do all of these things and you can do multiple things together also say you want to create a copy data then you want to create a dataflow activity you can even call notebooks which are written in data breaks and run it in your data factory okay so all this can be done now we'll learn the thing of creating a copy data let me go here and discard okay so we'll click copy data now we have to give a name to this copy data pipeline so let's give it as copy theater training all right if you want you can give the task description we can also schedule the task okay I am just showing you now know what I'm saying that insert the movement transform we have the copied adoption right the same thing will open here we can drop it on the reference and read write whatever you want to do it you just have to so the same thing will open over here also you just close these things okay can do let me show you something so you see everything is written in adjacent code as and as we will add the different details these codes will be related so when you open the github you will see the code in this level so when we were saying it is Jason compatible all the data codes will be written in this way so for now we will create it from there itself copy data so yeah as I was saying uh schedule so it provides us to kind of schedules so for creating a schedule we need to create a trigger and that can be scheduled or tumbling window will go in details over this later for now we'll just create it for once okay that means this pipeline will trigger only once okay then you go to next now here we set up the connection so you want to see from how many different sources we can actually get data for your as your data factory so it they are these many so you can connect to any of these sources alright so you have your azure connectivity where you can connect to a blob storage you can connect to a cosmos DB which has a MongoDB API that is the document DB or it can have sequel API or you can connect to Azure data Explorer or the gen 1 or Gen 2 or your Mario DB M is equal post gray so any of these things okay you can connect so we just have to give certain parameters your link service will be created so these many connectivities are set up respect your databases you have all these databases I'll create a adls alright and then I will as your data Lake Jen - okay continue so if you see I have actually not created an adls account is already not not there right so if you see here so it's showing my subscription but there are no accounts so first thing what I have to do is I have to go over here and create a storage account okay let us create a one with a hierarchical one let's see if we do that IPE training I'm just giving any name so let me keep it this way I'll just change this next advanced and here you are asking me to keep it as enable you you go to advanced or enabled next ads review and create validation passed and let us create it yeah - changing stores a contract I created a new storage account just make sure that yeah he created a new store a check on where I've clicked that option of hierarchical axis Jennison - yes thank you let me check as far as I know we can't this is this is an all right yeah this is the echo and then like for storage we have to go to the continuous only right yeah we have to go to the containers only we create a container create an ad we give a name so that will region to right container is just where it's like a folder just consider it and so forth no my question is when you are creating storage account with chain 2 features enable and then you are going to the containers and then you're creating a storage that will be considered as radial is in - right that is I think within within the storage account so it should be gentle I believe because it's getting created within the storage account which is Gen 2 so it's just like an internal folder within back this container is just a folder ok inside the blob a folder called container which is named as test it is actually a folder you can add more folders inside this and you just go to upload browse it over here I will show the file my question is like how to create Jane - ideally will specifically appear or not right what I am saying is okay just give me a some time just what I told you is there was supposed to be an update from which by default it will take gen 2 it not be displaying it just give me five minutes and just check on that meanwhile you guys can download the file and upload the file over here so you are not going with Gen 2 you are going with a block right okay consider this as blob I just confirm whether this is a blob itself or it is a Gen 2 has of not oh ok ok so here when you and and I'll just show you to create the link service also so that meanwhile I'll just search it and you can just go ahead and go to create a new connection you search data legend to continue from subscription here so you see right both of them are appearing in your storage account name so I'm not sure where did I upload it so you just create the storage account and if you do a test connection it should be connected as for the latest thing yes you remember we selected somewhere as a storage account v2 when we select that it is by default as your data Lake storage gen2 that is what I am reading it right now in the updates but I will again go and cross check and see if there if I can go and show you any previous version how it used to be differentiated at that point of time but as of now wherever I am checking in the updates it's showing that by default it will create as a gen 2 which will have capabilities of cloth your blob storage and your data Lake storage see if we were creating only a blob storage right you'll never get that option of excess tire because blow-up does not support hot access tile so so after checking Djinn - when we select create storage from containers it will by default cleared gen - right yes that's what I'm trying to say okay this is just the name hazard italic storage 1 if you see here the name I did not change it is not related to storage late you can give any name which you want and this someone was asking if you can change your configuration right here you see the name is not you cannot change it you need to create a new connection if you want it to do it with a new name so what we are doing here is basically we are doing a connection setup like in ODBC through ODBC drivers you do a set connection setup right you connect to certain servers similarly your setup is done from the data factory you can connect to your Azure data Lake storage alright and here I went ahead and I uploaded this records also so if you go on next you can go file on folder here you see the test you create the file and you choose ok now do you want to consider this as a binary copy if it is a binary file if you click this and country consider as a binary you can directly upload compressed files which are of these configurations so if you have a gzip file you don't need to explicitly unzip it and do it you can do it here if you want to deflate or zip deflate these kind of compression types is supported it's not binary so I am NOT doing it next if you want to do it recursively so if in this folder you have 10 files alright and I want all 10 files to be copied you can just go here and I can give this option of recursively currently there is a single friend so I am not giving this option so your concurrent sessions how many sessions you want to do you can see if you if you can define explicitly also but what happens in your SEO right we are giving your as your runtime environment so it will internally calculate a value for these things and it will try to match up with the most busca closest thing which you define here not necessarily every time you define the number of core maximum concurrent sessions you want that may not be the maximum because we are not using the optimal configuration rate which is going with the bare Masek things I have a question here you said the recursively more like if you want it to be on daily basis for example no no no see process all files in the input folder and it's subfolder recursively you're getting it if I have 10 files over here if I have one more folder which have another 10 files and I want all the files to be copied ok then but in that case you have to have all the files in a CSV format or the format you define so you need not separate separate copy data if you have separate separate files ok and then they need to be in the same format though yeah you because I will show you next when I do since I have not given a binary over here I'll get something like this what kind so first it will try to detect the format on its own so it has detected the format that it is a text format it's a comma separated this is the role limited whether you want the first row is as a header that also it has done it on its own I have not checked anything ok the escape character it has found that happened or the code character it has found the encoding C so all these kind of encoding is that so remember when you are getting an on-premises file sometimes this encoding plays a very important role because there may be some characters which may be supported only in some encoding ok so it will take the default there is no compression now let me open the file and show it to you on how it looks so let me go over here and this is your scenes record okay so here it is showing this way but so basically it is a comma separated 1 then you have your region country this is the first line is again your header ok so all these things it has it has actually detected on his phone I need not define anything suppose you want to define so you have all these formats in which data can be read text Avro Jason or RC and parky format so yesterday someone was asking me right whether this file is readable or not that different different files so all these different kinds of files are readable over here Excel is a comma-separated text file so that is how it has taken is it expand now your delimiters can also be changed whether you self colon semicolon pipe tab and also you have a no delimiter also and okay so initially see I'll just see you this was not the start of the Hedy okay so this is basically your X slash fix level zero one which you will get mostly we get it in files which we create in every show hardened UNIX so here it came or else if you want there is an only limiter so even that came up now okay so these features were not there initially like just three four months back also it was not that can you see now since I have changed the format here saying there is no delimiter do you see how the data is read everything there is read as a single record you can see here so again if I suppose I have put a semicolon all right so since we know that there is no semicolon see how it will be so again it is just putting everything as a single record because you have a row delimiter over here so just remember if your format is correct and the definition is correct you should be able to see your data okay for Road image and if you click this edit option you can actually give your own column delimiter all right similarly we have your row delimiter as well now suppose you want to skip a line first line first 10 lines you don't want to read it for some reason or say first line is a header you don't want you can just define it over here or if I uncheck this then you will see what will happen is so you see every time I am changing it that the things are changing so the first record is actually showing as a region country item number and what it did automatically it just renamed your column names his property 0 1 2 so if you do not have a 4 header then automatically it will take it as property 0 1 2 it will define these names all right similarly these are the advanced options where you provide escape character code character and all this so by default it has come so if I do next so you see this got created as your data storage got created all right so now now we came to destination if you see here the destination has come into picture so do I want to create it here itself no what I want is I want to load it into a sequel cosmos DB which will be created which I'll create now again I deleted everything yesterday so we'll just go over here here's your cosmos DB we'll do one AD let's select the same resource account name will just give us cosmos DP training all right so here I am selecting it as core sequel you have all these other options ok I'm not putting it does it Apaches but I am not enabling the geo redundancy or the multi writes and let me put this to your he she is it available yeah East Asia okay next is networking and putting it as all Network tags and we will create it so my cosmos DB is created okay I want to say one more thing is your features keep getting updated in your different different in all the resources okay they are keep on adding new new features and since it is a UI things may look little change so currently I'm actually working on data bricks and your power bi so it things I worked on this about six months back so you can see how much of a change has happened in these six months and what they do is whenever they put a preview feature or they add a new feature right they always make it backward compatible so if we have created something six months back and they have changed some things so with an update you might encounter that some new features have been added okay so as I was saying in the metallic storage the delimiter that particular delimiter was not there six months back but now I can see that that particular delimiter is there so these kind of changes might keep on happening they may keep on adding or whenever some preview is there they may keep on adding or removing features all right so I've created the cosmos TV I have created container and item in that I have container a new container and then I have created the item in this so if I go here I can see the subscription this will load the DP name this will load the database name suppose you want to entertain or manually you want to control someone else has created a cosmos maybe you want to connect to that so what you do is there is something called cosmos DB account URI so if you go over here in your cosmos DB overview you can you see this right URL just go here copy this and if you paste it over here okay next we want to create it through your cosmos DB access keys so where do we find this access keys you see your keys you can copy any of the keys the primary or the secondary key so you just put it like a service account witha like normally in our DBMS we'll have a say we are deploying it to production or something will have a service account instead of an individual user using is it something like that you can configure that but this is not that okay so here what happens is I have to actually share the primary key with you and in case I change the key or I rotate the key which is a good performance which is considered as a good activity because you should from security standpoint I have to again explicitly shared this key with you all right so this is one way of doing it and the thing you which you said is a service principle right so you see something here called as your key world okay so there is a feature and I give the database name over here just a second so I created training as the database already so there is something called as as your keyboard okay so if you search here there is key world services that are provided what you can do here is you can actually go and create a username and a password how you said right so there we basically those are called a secrets so you can create a secret you can give a value and that will be constant and then you can actually always connect it through - - so that is the best way to connect because here your security is of the highest standard so when you productionize any kind of job right you need to go via these things you cannot just define how I gave the URL here or directly it is peeking up from the subscription level right things are hard-coded we should not do it this way you actually go enter manually and you create a key vault even if you see the connection string I can I can put it in a key vault so when we have a session on there short this itself is a session on its own so when we have that session I can show you how we create a key vault and how we create different kinds of secrets or if there is a certificate in picture we can upload it so this is something like how you said right there will be a read if it's not like a user ID or a password which you have to define but it will be an automatic setup will be done so whenever you go it will connect to the key walls and from there it will reply the information okay so I do a test connection over here and I go ahead and I create it so when I do one next now my destination is this is my source all right I can just skip mapping for all tables okay or else if I do this what happens there it is the preview feature goes off all right so if I go and do next let me skip it for the first time so the behavior is insert there are two kinds of behavior here insert and absurd there is no update all right and all these things we can define the timeout is the batch size the concurrent size but for now this li just we will consider the default ones we are not defining any such thing now so if you want you can have a fault tolerance like about activity on first in Rumford Pincott these are the options where you can do whether you can skip the in the rows which are not compatible and it or you can skip and log if you give this you have to further give where it will do it okay so then you have to create a storage account and in that storage account you have to give this these details if you want to do a staging say you want to load it into a staging table and you want to do it then you have these options okay now here you see right you are enabling compression or something you can enable compression also data integration unit you will be charged number of duis in to copy duration into point to 0.25 DUI per hour all right this is how you are charged in your data factory so for now I have kept it as auto I can actually go ahead and decide it but then remember since you are doing it in as your integration run time even if you give some number it will calculate on its own the the most effective integration unit and it will assign that it will try to keep it as close to the number you have defined okay I will just remove the staging so you see here your task name is copy data your source is data Lake your data set name is this all the details are here your destination is this and these are the time out sessions so what is happening is it is deploying and now it is it created up data set it created the pipeline it is running the pipeline so if we go to your monitor it will actually show you how the data is running so you see this is succeeded if I click over here the data read is 1245 KB data written is 49 a B it has written 100 read hundred rows basically this is your log it has read and written both hundred rows okay the number of peak connections that were set up is one and here it was for throughput is this much you have all the log details so here it has used one parallel copy and it is used for di use alright so now let's go to our cosmos TV and let me do a refresh over here so what do we see here what are you seeing here you just see two things right you key value pairs okay and here what do you see you see the data this is one record which I have selected this is the second in this way there are 100 records so you have chili taken a structured data and just with the copy data activity you have actually transformed into it note no sequel database okay now let me show you some more interesting things what all got created when we created this copy data okay we just created it from the front end let me refresh this one pipeline is created all right which as an activity called copy data okay now when I click here you see the parameters of your pipeline okay so we can generate parameters here we can use variables here for it I'll show you here when we have some more diría other requirements where we can use parameters I'll show it to you then we go in ahead and click a copy data activity all that we define in the source ID everything we can redefine it over here sync mapping if we want to do a mapping okay if I do this I can add a dynamic content also so all this I will show you where the things are relevant let me get those use cases and I'll do it for you so here you see write all the things which we define then we can redefine it yet okay and then two datasets are created one is your source data set one is your target data set if you go and click this if you go and check the connection you saw the link service which we created here right so that is shown over here the file path is shown over here okay if you go over here and if you do a preview data you can actually see the data okay these things are there whatever we define when defining the data set you see over here right you can do it so remember if you want to change your data set also just go here you can edit it in your destination data set you will see these details in the collection here now since that is data in this I can do a preview data and I can see the complete data here you can see you trade the schema details let me do an import schema so what it will do now since there is data see it has imported and everything it is considering it as a string you see it by default it has kept it has understood that it is a region country and all so we haven't create any data flows if you go here you will see the connections which I have set up let me delete this as I don't need it right now so we have two connections has your data Lake storage & Kosmos TB everything is editable you can add enter it to a new subscription but your name cannot be changed okay triggers we have not created since it is a one time you remember we initially while creating the copy activity we kept as one thing suppose we want to add a trigger to our activity I'll show it to you when we have a little longer we just create two three activities and create a pipeline next it provides us with monitor I showed you right it shows the pipeline name when it started for how long it run it whether it was triggered by our trigger or it was manually triggered the status if it fails you will get logs a small node kind of thing where the where your defect will be shown the reason of why it has fail and it will be mostly in Java code there won't be a direct English thing that okay this has happened so you need to debug in that way you have to search the code so this is the readin option if I want to rerun the pipeline alright if I go ahead and click here if there are so now I just have one activity right if there are multiple activity is okay I can select an activity rerun from the from that particular activity will come suppose there are multiple activities I have done then I will get this option of redone from failed activities so what will happen is a new run will start and it will trigger it so you can also monitor your trigger runs if there is anything here you see right there is one integration none time if I want to create a self-hosted one I can create here if I want to there sure one is here and if I want when I have to move an SS is package I have to create a separate runtime with that so that this pipeline will be automated right like we don't have to run it manually if you want to do that suppose how so in that case you have to create a trigger alright I can create one trigger and I can associate that trigger to that I need to go to delete yeah either you can do it at the beginning beginning there was two options right it's a single time run or scheduled there you can schedule it now so you're doing both you have two options okay let me when you create a trigger you of these three options schedule normal schedule occurs every one minute everything tumbling window is something there is a recovery occurrence all right it will run these many times so here you have delay it gives you all these kind of options apart from that there is one thing called us event what happens is so suppose you want to trigger it whenever the blob comes you just have to create it even so you go ahead and select here suppose I'm just selecting this and I'll give the container name so the block path it will ask what event you want whether whenever the blob is created it will automatically trigger it will load it whenever the blob is deleted suppose there is some other activity you can do it so there are three kinds of schedule which you have to do so you create a trigger and then you add this trigger to that particular copy data activity you see here there is an option of pad trigger right you do word written okay one more thing if you create a call to devaluate yes if I go and do a new edit over here I can add the trigger so for that first I have to go there and create a trigger or from here I can create a new trigger and I can add it okay yeah you can schedule it and the thing is one more thing is once you create a copy data there is no button of play if you put the debug it is a it will run in a debug mode if you actually want to run go to add trigger and press the trigger now so we have successfully completed your copy data activity today one more in this whole process in data factory you have moved the data from ADL wasn't into two cosmos DB right and remember the data was a structured data ADL is to suppose everything so it's not necessary that only non-structured will come it has changed from a structured data and we have loaded into a no sequel DB and we so from a deal already make it in a structure format okay then we have the copy to cosmos DB but in case of blob it might not convert into a structure because block and see the conversion has not happened at your adls level it has happened when we have run the copy data activity it has internally happened there is a huge set of code which runs at the Kraken okay this is no it's not visible to us so we can only see that data moment right from now so here you cannot see that but when you do the things in data bricks you can configure those things because you are the one who's writing each and every line so you understand the difference now because there you are actually writing notebooks you are actually writing the code so you mean to say that even if b.a.p from blob to DV cosmos DB the data will be in a structured format right it was structured cosmos DB you are three you have kept the sequel API so it will show you in this JSON format all right and it is storing in that key value format as you saw nobody supposed any block is in enough structure format and also it will be internally you say yes Emily it will convert into a structure yes when we run this purebred it will be done okay then we did this copy data activity right you see there is something here when I click over here so what has happened this is in East Asia this is an this arrow mark you are seeing right so here it will convert and it will run they have built and given it to you what chord is there inside and how it is configured we cannot do that configuration the configurations which we get to are only these things so what kind of transformation is done and also if it is not supreme transformation see in the beginning only I told you this is basically if you just want to shift data with very limited transformations you can do a mapping mapping can be allowed if you want to go and do a transformation you have to create a data flow alright so let me create a new data flow you can add a source for this you have to just I'll show you what all options are present so here you have joined conditional split exist Union look up alright so you can do join you can conditional split is something your existing dataset you can split in two based on our condition so it acts as a filter where you can create two separate datasets or that and then you can what else you can do a derived column is something based on a certain condition you can derive a new column suppose you want an aggregate you can do that what this can be done on the pipeline to write Shwetha okay we create a data flow here you go to the pipeline alright you create a dataflow activity will come and you can run that activity over here you're getting it so this data flow will do something okay suppose it is taking two sources joining and filtering some records and writing in a destination which is the Agena blob stood it is a file okay so what i am doing is i am copying a file into cosmos DB and I'm putting a source and bring some transformation and doing it so here I basically we'll call it data flow which internally will call that entire activity and this actually when we create a data flow basically it will spin up a cluster so you see script allowed drift through validation schema so this kind of code is written okay and listen the background will run on a Scala code but we need we need not configure it they allow us certain things to do in the front and we can do it so data flow is itself is an extent extensive topic and search for a use keys and I can show it to you but this will be used for your detailed transformations it is more visible to the user is not about more visible to use it okay see suppose you just want to copy your data your entire requirement then there are two things okay your business your business has to approve which one they will go with sometimes they choose data breaks over data factory sometimes they choose data factory over to tablets now suppose you just have a copy activity and you you don't need then why do you need to exceptionally go and create clusters you have to define those things you have to write a code just come over here define the source and target and you can go ahead and do it and someone was King me about how do we connect it to on-premises right so here there is something called as self hosted see perform data flows data mode this little external compute this means you can actually set this up on your system you get the self hosted okay so this but what happens is most of the industries are the companies they will not allow you to set up a direct connection this way because this is risky okay so either you can go it in this way or there are other ways to get the connection set up this is all you can let go on prim there are other ways like as I was saying I'll share one option you will see in the storage account called file share see in storage Explorer you have a file share right so in here it is not showing I can go ahead and create a file say see it clearly mentions sorry clearly mentioned serverless SMB file ships if your organization supports SME which is service which is that server server message protocol block protocol 3.0 or above you can just create this file share it will lower mounting on your on-premises and any file there will automatically come over here in since this is an ad LS to it it will come in your storage account as a file share you can also define a table and queues are there for reading your message queues so is the thing we have in our utility function even if you where you can just shoulda combinable suppose that I have scheduled a few pipelines and mine amid jobs under logs ways to check the logs so you have your monitor over here right all the details since I have Dunbar pipeline it will show you I can select the are the days just for last 24 hours latest run whether I want to include reruns or what I want to do so you have a dashboard over here so it will show you how many pipelines are run how many activities are drawn whether there is a trigger or not and then there is something called alert alert Kristin's okay you can add an alert rule so suppose you want to put it this is your severe attacks you want to put the criteria when any activity is canceled or when any activity is field has failed so here you have to define it you see configure email push voice notification add configuration you create an action group create one add notification if you go what kind you want whether you want it as an email you want it as an SMS you want an insured push up notification he was the voice call the type you want you can add it you have to pay also for all the things you to question please go for the session today because I need you all to understand that also so beta that integration runtime was there right yeah you hadn't so did we created manually decide no I did not I fault your edge auto resolve integration runtime that is your assure and time will be created so I'll show you something suppose we want to create a new turn connection all right we create a new link service say I'm creating to a blob storage continue so by default this will be selected and when we do a create it will get created okay what is the best like is it good to use auto resolve integration is it good to use our own or create an own see when you create an own every time whenever you create something on your own right you become completely responsible for that so you have to actually setup each and everything here we didn't do any setup it by default it came before it did everything and it did the best it should be doing for that amount of data okay because suppose you have small data and you are using very good configuration you will unnecessarily pay more right so it's so quit and used with as your on and then it says running so it will be they'll manage it right I'm in Microsoft would manage it whether it is running so is there a case where it will be down or something the imposition I know 99% of times it won't be if it is there it is their responsibility to fix it okay as we did or activities here like EDF data Lake and different accounts so how do we once we have completed our activity like once we complete our examples now I need to date all all these activities like resources groups accounts so how do we do in one one go you want to delete everything yeah exactly yeah you just go to your particular resource group rank since this is the reason why we created everything in one resource group select everything delete delete once everything is delivered you can delete the resource group or else you can directly go and delete the resource group automatically everything should get deleted until there is some other dependency okay if you saw I did a publish alright so this means all my changes are saved okay so if suppose I have not deleted and I want to come so now if I do any more time doing this if I want to revert it it will go to that particular in state so I hope you got an idea or data/factory but this is just a very small functionality which we can do here there are many many more things which we can do so when we wake up which way did we publish it it saved in turn is it will have its own internal deposit we also like not github that bees are okay not give up not yet have not yet have or we have not set it up right now the repository so it is saving the internally so we would it's more like and only the reason is if I want to go back so I say I have to go to the first version so now I cannot do that no you cannot do that for that you need to maintain our version detailed version kind of thing and then you can do it I have not set up the code repository right loaded data factory limitation in what way I mean there is some restrictions we cannot be allowed beyond this you know like suppose see you copy data has a restriction like you cannot do very like only unlimited mapping is allowed to you to do for you to do okay very simple mapping you can map different kind of data this thing even initially they did not even provide data formatting so what we were facing a problem is like her data was in slash data was separated by slash but we had to load it in in a format with - and for that we could not do it in copy this copy data even that it was not supporting but I heard that recently they have updated that so these kind of things even though it is not there they have provided you with data flow so you can do it over there now there are things like as your function so it's a very powerful tool you can write functions to do a lot of things spin and it itself is a like it's a huge course you have data breaks functions here so data analytics so if you create a you have a cluster created a tell a cluster created you can use use equal to do it now there are general functions also like lookup stored procedure if you want to run if you want to do a web app application like you want to connect to a web you can do that and then this iteration things are there they are very helpful suppose in the pipeline itself I want to set a condition like for each I want to read a file and based on that file I want to do some runs I can put for each switch until all these conditions are there and then again I can directly run my machine learning pipelines or executions I can do here finally the effect will be used for catalyst the data load from one place to another place that's the only thing it's not only to load from one place to another thing using all these things you could yeah that is the end-to-end source you're using all these things you can do a very large number of transformation [Music] you get the data okay in any format all format you do transform so you see here what I have done no I've actually done a copy data over here okay if I create a data flow I want to join something all right I can do a copy data offered structured data into the cosmos DB again I can get from no structured data and I can load it in cosmos maybe the same data base so yeah finally there will be a database or or anyplace where you load it and from there the reporting or for your analysis purpose or for developing machine learning code you can get the data from there and there'll be a place where you can okay so there is something like Azure cloud shell where we have bash and PowerShell I just since it just came up I am telling you so all this whatever we did to GUI ray you have commands to run it in bash and Power Cell also but usually we dual truth UI that is much simpler but suppose you want to do something in a loop right you can just put a command and do it I was just showing you go to the data factory yeah you can see the release notes right so you see like on Jan 27th there is one release note they have done 24 something else then some things so this way they keep on adding things evaluate dataflow expression in with strings in ADF so now it allows you to interpret expression in line strings easily and so these were some things which were is not there initially so every now and then they'll keep adding features so maybe you're working today after six months you may see some new features also if you are not constantly working on the same thing okay now now we'll move on to the theoretical parts please have nots okay so I think all of you got a very good idea of what data Lake is okay so these things we have to understand with respect to your data Lake your data storage we have to go through all these features let's start one by one data ingestion as I told it supports structured semi-structured unstructured so you have to understand something if suppose our ETL tool is they say ab initio is there okay and I want to read a file which is a file which is in CSV and a file which is in Jason I cannot put a join directly like that I have to at least bring them both in a way where I can read it and they have to have a common format and then I can do the Proceedings but here that is not required see the things which I am showing you is a very basic things and mostly we are as of now people are doing this because most of the industries are just moving but there are a lot more things which you can do in this multiple injections like batch real-time and one-time load is supported so I showed you we can run we can even get a real-time data many types of data sources databases like database web services emails IOT and FTP from all these things we can get data so directly the data will come and go to your a DNS - so how it is stored firstly data storage is scalable you can always go and increase the size of your storage account ok so there is no limit here like see when I uploaded the blob over here right you will see a size here there is no limit there is nowhere mentioned that this size is less or this size is more like so I can just go ahead and keep on adding even even a huge file also I can it may be it will take some time for uploading or something but even huge files we can do it there is no limit to that ok so and it supports various data formats as I showed you now it puts ESV it supports Jason Park a sequel all these things it supports alright so the next is your data governance so we data governance is a process of managing your availability usability security and integrity of data used in an organization see for yours - at your storage account level ok we had you can set up a lot of things you see over here there is something called firewall and virtual networks currently I've kept to all networks I can very well go into a selected Network create a virtual network define subnet inside this so I can put a lot of details I can even create a private or endpoint here when I go to advanced security ok it's taking some time to load because here I need to sign into something so insurer will provide me with additional features ok and see some I showed you the key vault right when we use those kind of services like if we create a key vault connection so nothing is actually getting displayed in your data factory or or in your storage account rate the connections are completely hidden so even at that level we can do it alright so here is the encryption thing it'll come I'll show it to you all right now security security is implemented in every layer of data lay it starts with storage unearthing and consumption the basic need is to stop access for unauthorized users it supports different tools to access easier to navigate your GUI and - put authentication accounting authorization and data protection are some important features of data like security so I showed you at the storage level how these things are done right so if you go here again these access keys come into picture so if any of you want to access my storage account I need to share all this and then apart from this again I have this access control also I can again access give you roles different kind of roles for this particular storage account so if I want I can completely restrict a user any being from a particular location say the storage account is in India I don't want anyone from other than India to access it I can even control at that level all right okay now next is your data quality data quality is essential component of data like architecture data is used to extract business value extracting insights from poor quality data will lead to poor quality insights so the quality of data the way the data is maintained is also kept it enough it is maintained in a high in a proper way it's not like we just if a data is used suddenly it is getting we are shifting it to archival or certain things no it is maintained in a proper way so this is the data quality next is the data discovery it is the another important stage before you can begin preparing data or analysis in this stage tagging technique is used to express data understanding by organizing on interpreting the data ingested in the data link see tags are very very important when you want to segregate the data so if you are getting files from say for the fraud department now if we can everything is Department but - through tags we can differentiate them and while retrieving the data or getting it it is much better next is the auditing so data comes from various sources various departments various assets classifications and just because of these variations some data requires special security requirements and handling certain data in data leak need tracking of change that it undergoes as well as who accesses for various legal and construction aspects so as I showed you over here right you have all these different ways so when you go to the advanced security right it is taking some time to load you can have even advanced methods of securing the data you need to sign in for that and with this subscription you need to pay some extra amount even with pay-as-you-go I think it is not it won't be activated directly data lineage data lineage deals with the data origins what happens to it when and where it is moved over time it simplifies tracing error back to a source in a data analytical process and journey from origin to destination is visualized to appropriate tool like you saw in the data factory so in the pipelines you can see from where the data is going and again internally if you go further in the logs it will show from what data type to what data type it has been moved data exploration so it helps in data analysis it has to identify right data sets before starting data exploration I hope you guys are clear with the azure data Lake storage
Info
Channel: Intellipaat
Views: 24,009
Rating: undefined out of 5
Keywords: azure data lake tutorial, azure data lake training, azure data lake architecture, azure data lake analytics, azure data lake store, azure data lake storage, azure data lake vs blob storage, azure data engineer training, azure data factory, Introducing Azure Data Lake, azure data engineer, azure DP200, azure data lake, data lake, get started azure data lake, get started with azure data lake, data lake store
Id: sS3Xsw-F344
Channel Id: undefined
Length: 77min 9sec (4629 seconds)
Published: Wed Apr 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.