Azure Data Lake Storage (Gen 2) Tutorial | Best storage solution for big data analytics in Azure

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
are you building big data analytical solutions that I got perfect service for you this is Adam and today we're going to get introduction to other data like storage stay tuned I always like to start with the definition and data like is your data storage solution that was specifically designed for big data analytics and how does it work well it's quite simple each data like service underneath has always a container that container is very often called a file system and just like any file system it has a folders and files within it on each data Lake you can actually have multiple containers multiple file system containing any structure of files and folders that you wish to have but I already said that this service is designed for big data analytics so that means there's something called Arthur blob file system a BFS for short we've sometimes S at the end suggesting this is over SSL so it's encrypted this file system is Hadoop compatible which allows many of the existing solutions in the market connect better with no hassle at all like Hortonworks data breaks HD inside cloud era or hadoop all those systems have almost no issues connecting to data like out of the box just few lines of code from the documentation and you're ready to use it but additionally one thing that I want to highlight is something that was recently released multi-protocol access thanks to this access you not only have the ad FS but you also have W ASB which is Windows Azure storage blob and again S at the end for encrypted thanks to this you have two ways to connect to your data like a Hadoop compatible one and classic blob storage API so things that normally could not connect to data like can't connect now like Barbie our owner or analysis services of course part bi recently also got connector to data like through a BFS previously it didn't and it was nice way to connect there so the smoothie protocol access is very big thing if you have some things like older Python SDK that works with a blob storage it will also work with your SDK so doing data science is another cool way to utilize your data like storage current version of data like is generation 2 because it evolved from the first iteration of data like storage but was also built on top of blob storage and there's a several benefits coming from that because you have two services two concepts combined into one but you get benefit of both from the data like of course you got Hadoop compatible access but also you get POSIX permissions so you get that ACL allowed to be managing read write execute permissions on the folders in a file level and an optimized driver for those big data analytical workloads and for the blob storage you get low cost and this can be even further taken advantage of with the storage tiers hot cool an archive you additionally get high availability and disaster recovery so you get those multiple copies of your files spread across regions if you want to learn more about geo replication for the blob storage check my video introduction on the storage account but this is all I want you to remember right now so what are the similarities and differences between a TLS and a blob storage access tiers are the same you get both hot : archive storage on both of them and lifecycle management as well the top level organization of your data also stays the same it's also container in both cases but below it it changes a bit from the virtual directories to directories so in case of the adls as you see you have containers and those containers have files and folders that's a classical structure that you are very accustomed to when exploring your own hard drive when it comes to blob storage you only had a container and a file underneath it and while some tools display those folders there were only virtual folders so since there was no concept you couldn't do any actions on entire folder at once you always have to iterate through all the files within the specific path of your file there are some features that you're gonna lose when moving to generation two most notably its foot soft delete snapshots and immutable storage but also static websites and blob fuse definitely check out the documentation if you need those features but there is no nothing saying that you cannot create both blob storage and generation 2 in your research group and use both of them and lastly when it comes to comparing two services we need to talk about ways of authentication and authorizing to your data when it comes to blob and data like you still have access key shared access signatures and roles based access control from Azure and Active Directory but additionally for data like you get access control lists and thanks to these POSIX and having this file structure and a folder structure you can now also have access control lists so you can assign specific people groups or application to specific files or folders in your drive and everything protected through other Active Directory is one of the biggest selling points when collaborating on data across organization and not the best part creating a GLS demo followed by controlling access with access control lists then I will connect from power bi and demo you the multi-protocol access and lastly we're gonna do a simple ETL using data bricks to show you how easy it is to connect to data like from your big data analytical workloads let's go to the portal and start creating go to your menu hit on create resource and search for storage account you can also find that here on the quick start because adls as I said is built on blow up storage on top of lob storage therefore it's the same service so let's create a new resource group I'm gonna call it am ad l as the log now I need a storage account name I'm probably gonna use the same AM a TLS demo is it free it's free I'm choose north Europe I'm gonna change to a la rests to get as fast as possible my storage and right now everything is the same as for normal storage account to get an idea less out of this go to the Advanced tab on the top here you will find data like store generation to which you can hit enable of course you're gonna get this notification that you're losing soft delete feature but we don't need it right now let's hit review and create get that validation running everything looks fine let's hit create and I'm gonna skip this ahead alright deployment finish after 30 seconds we can now go to the resource and explore what does the idea let's deliver if you look at the screen and when you know storage account already this looks similar because it's the same service in the end so everything that you have here is pretty much the same features that you have for the blob storage you still retain even file shares tables and queues for your data like but the only difference here are the containers that you get different icon because just to indicate you that those are different containers and here of course you have file system notice that here it doesn't says other container is just new file system and here it can create a demo container where we're gonna play around it's as easy as that and when it comes to tools you can either use store text or in the browser to manage your file systems or you can just switch to store Explorer right click on your storage account hit refresh to get the latest version and let latest list of your storage accounts I like personally to use storage Explorer because it gets their features earlier than the other tools and it allows for more flexibility and more operations from the level of the storage account that's why I will do the demo here so I will open my adls on the blob containers you can find our demo container which is our file system here you can use upload button to upload a folder or I'm gonna use it right now to upload a small CSV called movies CSV hit upload and as you see this is pretty much the same experience you would get with a blob storage the only difference here is that on each file you can right click and there's some managed access here so I can right click hit manage access and I have a list of roles available to this file and the permissions that those roles give so let's say I want to give someone an access a full access to this file I can just right hit add here and search for that user or an application second type atom hit search and find my users here I can even find my external user or my local user because I have only two users in my active directory called Dunham's you can hit add and right now this user is added and you can select what kind of permissions will this user get well maybe I want to give him a read and then execute in case of those groups you can definitely add someone to a group so you can select this and again tighten atom search and assign atom as an owner of this folder when it comes to ACL there's a big guide on Microsoft website you should definitely check it out if you want to learn more about politics and how POSIX compatibility is created for data lag but for now this is a simplification of this process explained hit ok so it saved our permissions and remember you can do the same on entire container so you can actually hit here manage access to grant someone access to entire container same principle every level you can manage your access Adam is not here therefore Adam has no container level access and the last part you can also create folders so demo folder and maybe within that demo folder I can again upload a file and then can again select movie CSV there's good one important thing that I want to show you because it's corrupt quite critical notice that if I will go to the demo folder and I'm gonna go to manage access and I will grant a demand access so I'm gonna add Adam here search and add this Adam user here I'm going to give him an access of read write and execute and a default for the same default as it says below will automatically add these permissions to all new child run of this directory but if you notice it said new and if you're gonna go inside of this folder you will notice in the manage access that Adam is not on the list because it did not add it it will only add for any new phone if I'm gonna re-up load the movie CSV right now then Adam will get an access replace the file transferring transfer completed manage access and as you see Adam is right now on the list this is very critical to remember when you're gonna be designing your permissions so the next demo that I have for you is power bi power bi can connect to a DLS two ways it can use both connector for adls and for storage account using this multi-protocol access and I want to show you both so let's go to more select a drawer select data like storage generation 2 as you see it's in beta and here you need to provide the URL so this is the full URL to your data like storage I need to go back to the portal to the overview tab and notice that in the overview tab you don't have the full address you can actually find it under properties here under data like storage here is the full adls filesystem and point so I can go back now pace that you were rollin hit okay and I have two options i can either authenticate through account key or use the same account that i use to log into Azure to simplify the process of multi-factor authentication which I don't want to go through right now I'll use the key so I'm gonna go to access keys and grab the key paste it in hit OK review the files hit transform data so get only one file and let's say grab the movie CSV click on binary and this is our file so the second thing that I want to show you besides this is this multi-protocol access because I can actually use more and Asher and it on a blob storage and instead of typing the the idealist URL right now just provide the name of the storage account that this lies on hit OK again I need to provide the account key notice that we lost the organizational account option this is because for blob storage it was not implemented in the connector even though other actually allows for that so let's go back and grab the key again paste it in hit connect and as you see a little bit different structure but the same principle is here a storage transferred data and the files that we have and the movie CSV binary and the same file is here this is how multi-protocol allows you to connect from the tools that not natively connect to data like so for the very last time I will go back to portal and start creating ourselves a data breaks because this will be a demo of data breaks data bricks hit create and type a name I'm gonna call it am adls demo the same name as for our storage account select the existing resource group I'm going to pick location North Europe is fine and hit standard hit create and let's keep ahead the recess was created we can hit go to the resource and lunch our workspace this usually takes between three to five minutes in this case it was actually less than a minute so what we need to do right now create a cluster really quickly if you get a blank screen on create cluster just wait a couple of minutes usually workspace needs to initialize and it might to take up to five minutes so let's create a demo cluster it's gonna be a standard in a default run time I'm gonna disable all the scaling because it's a very small demo and I'm gonna enable auto murid terminate after thirty minutes and I'm gonna scale down our cluster just to save some costs hit create cluster and wait a couple of minutes cluster has been created at this point we can actually go to our workspace go to users our personal user right click and create new notebook this is our script that will be executing our dictate workloads and then we gonna use Scala today because I wrote scripts in Scala I actually prefer scar personally but that's my personal taste so don't worry about it for today we have a couple of scripts that I will run but most importantly I want to you to understand there are couple different ways you can attach a storage you can for instance attach normally free account key this is the least secure way and I would say the least preferred way of attaching data like because you have that access control list and advanced security for a reason so attaching through account key should be dis advised you can also attach through an app ID this is what we're gonna do today but I'm gonna use a little bit bigger code which not only attaches a cluster to a data like but also mounts it as a local drive so what does code does it takes an application ID in a password and a 10 an ID those free are basically at saying that use my active directory an application we have this ID so let's actually create it right now so in our report all go to Azure Active Directory to up registrations and create an app I already have one but I can create a new one I'm gonna call it am adls demo application nothing else is required right now so it just can hit register so what we did we created an application account that we're gonna use an app ideas here application or client ID hit to copy go back and paste it here the second thing we need is a password so to generate the password for an application go to certificates and secrets hit new client secret the description is anything you want and I want one year expiration date I'm gonna copy to clipboard and paste it here the third that we need is a tenant ID so for the tenant ID you go back to overview and there's the directory or tenant ID here so this is the ID of our active directory where this app resides next we need a file system name so you can either add existing or create a new one so in our portal if you forgot go to the resource group go to adls demo go to the storage account and in the containers you're gonna find your file system in this case it's called demo so let's type demo here and lastly we need a storage account name so basically our am a TLS demo so let's go and paste it here if you run this right now it will mount this as a storage and as a local mount so just like a normal Drive notice that the source URL is a BF SS we already said about that and this is nothing you have to really remember it's just copy paste from the commentation and mount it as a storage so let's run it usually Mountain takes about 15 to 25 up to 30 seconds so let's see what's the result as you see the request is not authorized to perform this operation why is that and the answer is very simple because we created an application but we never gave this application and access since we are logging with a static application account this application account has to have an access and you can do it to fold either we can use a CL to give access to specific file system specific folders or we can use also our buck and I'm gonna actually show you that this time so you can actually go to your ad LS go to access control other role assignment here select the role and for ETL I usually like to use the role storage blob data contributor this allows an application to modify every folder every file on the storage and you need to type in that name so that was am adls am a TLS demo application select it hit save and if you go to role assignments you will find your application being a contributor on this entire data like service after probably between one to five minutes this will be propagated and you will be able to run your code again so let's go back to our data breaks and rerun this code control enter to run the code and let's see mount at storage as you see it went through flawlessly returning true at the end unimportant remark that I want to make here never leave your password like that in the opening a code used aw secrets for that this is just for the demo purposes so let's scroll down and test if our multi-store it works if it does we can probably run some very small script like loading our movie CSV from our mount point if you notice the code beforehand we mounted the storage until slash mount slash data catalog so we need to start from it and then we need to go by folders which if you remember I created a demo folder and inside of which there's the movie CSV so let's run this and as you see this was as simple as this this is of course not a data bricks demo but you can change those stuff like changing the options for the read and running this again so that you get proper headers then you can do a little bit of transformation like selecting specific columns and lastly saving this to your data like using right action select right after one second if you go back to storage Explorer and simply refresh you will find new folder with the partitions from your data bricks this is how easy it is to transform and work with a GLS from Big Data technologies and remember every time you do this you don't have to assign on entire storage account you can use manage access and ACL and for the applications you can assign users but you can also assign applications here like ICC by default I'm an owner because I created this folder but you can search for other applications like for data bricks that I had in the portal seen data bricks application SEC data Lake storage is easy it's scalable fast and extensible service for storing your data in your big analytical workloads that's it for today it thumbs up if you liked it he leave a comment if you have suggestion and of course subscribe if you want to see more and see you next time [Music]
Info
Channel: Adam Marczak - Azure for Everyone
Views: 104,739
Rating: 4.9778781 out of 5
Keywords: Azure, Data Lake, Data Lake Storage, ADLS, ADLSv2, Hadoop, Big Data, BI, Business Intelligence, Spark, Databricks, Gen2
Id: 2uSkjBEwwq0
Channel Id: undefined
Length: 24min 25sec (1465 seconds)
Published: Thu Dec 12 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.