Ingest data into Delta Lake on Azure Databricks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello welcome to today's session on working with azure data bricks and um this will be a three-part series by the end of the series you'll have the know-how that it takes to create a geospatial forecast using open weather data and some open traffic data the overall goal this session is to get you give you the ability to get hands-on enough that you can participate in an activity like an open hack or something like that where you're really building something for real and with that i'm going to go ahead and turn it over to john from databricks and john go ahead and take it away hi my name is john o'dwyer i'm the developer advocate here at databricks and i'm uh i'm joined by nikhil gupta and who will introduce himself here in a little bit uh myself and nikhil uh nikhil is an uh solutions architect at databricks uh we're here to uh talk about um essentially uh how to get started uh and we're how to get started with open data sets uh we're going to discuss uh ingestion of data from those data sets into data bricks and how to get started with with weather data in particular nikhil is going to go through a demo of that and i'm going to show you a couple things ahead of time but nikhil is actually the one that's uh going to uh do do most of the work here mcgill do you want to introduce yourself sure sure thanks john hi this is nikhil gupta and i work as a partner solutions architect at data breaks uh working closely with microsoft igsc partners to enable customers and make them successful i'm glad to be here and you know showing showcasing some data adjacent to you thanks nikhil and i'm going to pull up my screen here okay here we go can everybody see my screen [Music] okay great um first thing i wanted to show you was just how to how to even get started with uh azure data bricks in in your azure environment so uh azure databricks is a first party service in in azure you access it directly from your azure hub and it's it can be set up and run immediately so essentially you just look for it and you hit create it'll take you through uh take you through the the setup process it uses uh everything native in azure including adls and vms for the compute everything else is essentially uh those live in your environment and there is a control plane that uh that i'll show you here in a second as well so that's that's all you do you get started you hit create you get going and when you uh have finished what you would end up on is this screen this is a azure databricks workspace the workspace is all compute and data lives in your environment and what we do is essentially we give directives to data and also vms to uh to build clusters uh to that you can then interface with through uh through this website uh which is your your platform or workspace and you get going and do all your work in this environment a couple things in here on this page there are a lot of tutorials and ways to get started this is the main page of the of the workspace one thing that i wanted to point out is uh the compute that's used at any given time um the compute uh that you uh that that you use to actually do the work that you want to do in this case is ingest uh you would you pull up a cluster and those clusters uh you can create a new one you can easily give one a name and uh that's really all you need to do to actually start a cluster is give it a name there are a couple things that i wanted to point out there's a couple different modes that you can that you can use usually you would actually start with standard high concurrency is uh is something that can be used in the case that you have a lot of people that you want to uh bring on to the cluster or in the opposite direction if you want to do a small amount of work and really uh just get started with development let's say uh sampling ingest data or just kind of noodling on a on a ml project or ml model you can use single node instead there's also various different types of runtime versions that you can use specifically around standard for data engineering or ml for ml specific pieces say if you when you're ready to start modeling if you wanted to use tensorflow or sklearn or any of those you can use the runtime version for ml um and that's about it you can you can then hit start and just start doing your work uh one thing i did want to point out and is uh we are we already have one up uh this is nikhil's cluster that's associated uh to the work that he was doing uh so i'm actually gonna just go under that i'm gonna show you how to um go into the workspace and start doing some development so uh this is a collaborative workspace where everybody can work together and you have your own little area uh of the world that you can uh that you can do your own do your own thing what i have is um this is my workspace and i i have already started a devradio uh folder for the work that i'm doing and this is the notebook that i'll start with you can uh you can pull up new notebooks uh or or folders or even build out libraries that you need to use building you can build ml flow experiments as well to get started with the notebook if you don't have one you can actually pick different languages to work in in databricks i prefer python so that's actually what i uh built my notebook on but you can use scala sql or r whatever languages are best for you and then you pick the cluster that's up and running for you in our case we're going to use nikhil's already up and running cluster so what i um what i have here uh i'm actually not going to do any of this work i'm just going to yakk on for a little while but uh but what i did do and what we actually have for both the notebook that i'm going to show and also the notebook that nikila is going to show later we have a we have a git repo of both of these notebooks that are out there now out in github there is a uh notification of where to find them so you can actually uh take them and follow along uh this notebook uh these notebooks are a um it's a interactive and collaborative uh development experience this is uh it makes developing uh data engineering uh workloads and also ml type uh work extremely easy and uh what you can do which is really great is uh you can document your your uh work with markdown so it's explicit you can also then explain what you're doing and visualize it extremely well for example what i have here everything in this notebook that i'm going to show is just that it's just markdown it's uh it's it's uh how you can visualize and convey your ideas to the outside world uh the other nice thing is that nikhil can actually be in here and we can pair program together and he can actually uh work in this notebook while i do um this is a extremely powerful idea where you can actually collaborate together and do things together at the same time and especially in days like now when we're not in the same place a lot of the times uh it it makes it extremely productive to work and collaborate together and pair program uh in these notebooks um in here i wanted to point out a couple things so i'm using this notebook to essentially present how uh why you would use databricks uh and um and how it really uh fits into the uh the architecture of azure as a whole so um [Music] what data bricks is generally used for are two things um well three i guess uh analytics machine learning and also data engineering in the core uh is right here databricks uh is the compute layer for big data and analytics and machine learning uh of in azure and you can use all the different pieces of azure uh to uh essentially in the ecosystem and uh databricks is one of those pieces so for example what we're going to show today is uh how you can do ingest that ingest can come from event hubs azure data factory it can also even come from uh from azure data lake storage um and that's actually where we're gonna pull our information from using two different uh two different commands that we'll i'll point out in a bit and that nikhil will actually go through a demo of how to do this with with the data lake and azure at the uh in the core of your analytics and data engineering uh you can also use various pieces for example you can push data to uh to azure synapse uh for analysis and uh very quick um access to data uh you can also um access mach azure machine learning and uh and also use power bi as as your bi layer on top of the data that you access through an azure data bricks cluster we also uh have have tools that you can use for monitoring governance uh that are in the azure azure ecosystem as well one thing that's at the core of how we do work at data bricks is something called delta lake uh in the delta lake architecture uh delta lake is a uh is an open data format that is based on parque but it allows you to do far more than just access data through par k it uses parquet at its core and essentially it's park a with a with a set of logs and indexes around the parkade data what makes delta lake extremely powerful is that a couple things one is the acid transactions on it on a in a data lake what i mean by that is at an atomic row level what delta lake allows you to do is uh is at a row level delete update merge data it also uh it also puts asset transactions around each of those so that uh you can't get caught in between uh in between transactions and corrupt your data like you can with um other big data technologies um this is uh an extremely powerful piece of uh of of delta lake uh it allows you to to do all these things such as acid such as delete data without um without recourse in the case that uh someone is reading the data at the same time you can you delete it you can't actually corrupt it in those cases it keeps from that happening another thing that the log of delta lake allows you to do is you can actually start to use these delta-like tables as sources or sinks of streams and you can start to unify your etl processes between streaming and in batch we'll actually get into some of that a little bit and uh different aspects of batch and streaming with the ingest that nikhil will show you later um other another very great piece is schema enforcement essentially what's uh puts guard rails around the type of data that you are putting into your data lake so that it doesn't become a data swamp while also allowing you to uh to evolve those schemas over time say if you need to change a data type in in the schema that's associated to your delta table you can do that you can also uh add extra rows or excuse me extra columns of data over the course of time as well it's very powerful tool on top of on top of your data lake everything that's uh associated here you can also um uh reference uh so if you go and get this notebook there's a lot of different references that are directly in here um around around what we see as the data analytics uh architecture and how how data bricks fits in them in that etl as a whole the delta lake as a whole these two documents are actually what we'll discuss in a little bit with nikhil around copy into uh copy into is a uh an ingest type that is a very easy way to do incremental ingest uh through sql an autoloader is a very easy way that you can do streaming in just in the same manner and um to to demonstrate this and also you know spur some ideas around uh how you could take open data sets where we're going to use the azure open data sets to access and ingest this information that nikhil will go through here in a minute we are going to access those databases or data sets that you are already have access to through the open data sets uh how to do that using uh the azure ml open data set package is what nikhil will actually uh get into uh this specific data set is a noaa uh weather data set that's available to uh to everyone noaa is a um is a um a um a ngo that's associated uh deals with atmospheric research here in america and uh we're just really excited to see what you come up with from these packages weather in particular has all kinds of practical applications we've also referenced a couple use cases that you can go check out to associate uh two weather to to spur some uh some thoughts around what a hackathon project you could you could build from these sets uh one being a really cool uh case where actually uh accuweather is using uh using a lot of this data to literally predict the weather uh on data data bricks already and that uh use cases associated here so we're really looking forward to seeing what you come up with and uh and if if and how these uh these uh these different uh features of data bricks uh can help you out doing so so with that i'm going to uh turn this over to nikhil and uh this is the uh notebook that he is actually going to get started with so over to you uh nikhil i'm gonna pull perfect perfect john thanks a lot uh one question uh i think uh can you just elaborate a bit more around the the multi-hop architecture around the bronze silver gold like how do you you know go in and what's the idea behind that right so just just a quick overview of the of the uh yeah oh of the uh of the essentially yeah the we call it the medallion um architecture uh this is a architecture that works extremely well in in delta lake that that architecture goes from an ingestion tables that we consider bronze uh those tables are actually what we will focus on today uh with copy into and copy into an autoloader those tables are uh are essentially just the raw format that that you have and using um using the streaming and iterative processes associated with the back of those tables you can actually use those tables as sources for other tables to to refine the data and transform it into silver silver sets of data where that's usually at a row level if you are looking to access data that's where you would actually do it you would be filtering out bad information dealing with duplicates and things like that past that you can um then use uh you can take that information you can either aggregate that data into gold tables or sometimes those gold tables that uh that ml uh ml uh engineers would use are actually feature sets or uh or models that are actually run on a row-by-row basis as well so those gold tables are the ones that either data scientists or uh or analysts would use uh that are really refined and are ready for consumption uh downstream is there anything else yep yeah perfect i think that's that's what it is it's end of the day right think of it as like three you know just on a very high level think of these are three different folders on your adls location right and then you are just you know putting data in there like moving progressively towards more augmented and refined data sets that's definitely that's that's exactly right were there any other questions that you think that we should uh that we should uh go through at this point or or or do you think we should yeah yeah yeah i think we are good i think the another the last question is like you know how how do we cost it and you know i can you know quickly take that as well so basically you know end of the day the cost is basically your compute so the uh so the amount of time your cluster is up and running you will be charged for that cluster running and that said you know we have a lot of like uh options in terms of you know auto scaling or you know dominate after so if the cluster is ideal like for a certain amount of time it just goes and automatically dominates it so yeah so the compute end of the day is uh is uh is charged uh nothing else from database perspective perfect so let's go in and let me go and share my screen and uh let's begin on the demo side sounds good thank you perfect yup uh so hopefully my screen is you know able to see my screen um so today you know the idea behind is just to take uh the noa data set uh it's an open data set from azure perspective uh take it from there download some of the data and then you know uh go in and move it to a delta location the branch location so this is the idea like you know most of the organizations or how would you see uh you know getting data onto cloud right so you bring all sort of data your structured semi-structure unstructured data the idea is to bring to a cloud cloud storage uh adls gen 2 on azure is the best in class right you bring in that raw data and then once you have that raw data how you how do you easily move it into your created data form and you know uh and we normally we recommend moving into a delta lake or a delta format and all the benefits with john talked about a bit earlier like asset transactions and caching and you know all schema enforcement uh basically end of the day delta lake gives you sort of data warehouse capabilities on your data lake right that's that's the idea and once you are on your data is in curated format uh you can do all sorts of you know uh workloads on that data right so you don't have to duplicate data per se for different workloads right so same delta lay can be used to do streaming to do your uh analytics uh to do your machine learning workloads so one data source and you can run your multiple workloads right uh from from the same data so that's that's the that's the idea you know uh for for a unified data lake uh and then just decomposing a bit more into that and today we will be like concentrating on the delta ingestion part of it but you know data could be coming from any number of sources right it could be a streaming source it could be a bad source getting from on-prem data sources right uh on azure if you are you know streaming it probably you know event hub or iot hubs are the best way to get that data and then if you are like the batch processes we have seen tremendous success with our customers using adf right and then you move that data to a landing zone and from there you know we start to ingest it and bring it to bronze layer which is your raw ingestion layer and today we will be more concentrating around auto loader and copy into command so that's uh that's you know that's the idea and then we talked about the multi-hop or the medallion architecture and then you know data breaks provide with you with clusters you know any of the language of your choice optimize spark to go in and process that data you know and use and develop different use cases around that data so that's uh that's the idea today and you know the the bunch of the demo we would be you know talking more about the autoloader and the copy into connect so that said let's you know let's start and ingest some data right that's that's what we are here for so um um as part of the you know since we are using you know open data set i just went in and did a pip install the azure ml open data sets library right i did a paper install here a couple of ways you can you know put the library in another way is if i go to my compute right this is my compute plot this is my cluster uh nick cluster and i go into libraries so i can go in and install a new library over here as well i can then go in and i can see i can go to different repos like maven pi pi i can even store libraries on my adls or dbfs you know or even just drop in some jar files right so a good number of ways you can you know ingest those libraries and bring them to you on on your compute platform right uh i do i take a simple approach i just do a paper install and i install this you know open data sets uh library on top of it uh once i do that uh then it becomes uh real easy uh i you know that my library is installed i can go in and you know just import those libraries and you know few other things i need to process the data so we are doing you know in a way weather data so i just from the open dates open data set i just go in and import this library uh for this demo uh you know i'm working with dbfs which is databricks file system uh which basically comes uh you know automatically when you're deploying a databricks workspace uh you could also use your adls gentle location or the blob storage uh the idea behind for me to using dbfs was i want this notebook to be reproducible so if you just go in and take this notebook and run in your environment it would go in and run in it so i i took dbfs but you know these things you could even attach a dls chain to or a blob storage location you have to you know just bring it mount it to your dbfs and you know you can start putting data in and start you know consuming data from there uh it's it's it's it works in the similar fashion uh so the first step what i do is i just go and create a folder uh i this is my dbfs in my dbfs i just created a folder called ingest started demo which is a folder i created and now once i have it i can go in and you know put some put some data in perfect so uh now we are ready to read data a folder is already set i've imported all the libraries and now i can go in and you know uh to take some data right so i for for for the this demo what i did was i just took 10 days worth of data from me uh may 2020 so i gave the end date and you know the start date uh typically what i would suggest you know uh don't put in like a lot of like two or three years worth of data because uh sometimes you know the cluster uh just try to you know handles probably a month at a time create some python function which would you know go in and automatically read the data say a month at a time so yeah that would be my suggestion like a tip uh but you know to to just start getting that data in uh so pretty easy to do that i took i take that data in uh i you know i read the data uh from from 10 days worth of data and i convert it into a data frame uh pandas and then go in and save into my csv file so now my csv file uh i've gone in and saved my csv file the data the ten the ten days worth of data right so this is my csv file where the data is in here so think of it as now what we have done is we have moved the data to a landing zone right so this is where my data currently resides uh and now the next step is i want to move it to my bronze zone right and then the two ways to do which we're going to discuss today is you know autoloader and copy into right so now data as a csv file has landed in my landing zone and you know we'll go in and move it into the branch zone perfect so let's understand you know what what are these two uh what are these i'm talking we talked about auto loader and copy into and what are these two you know exactly mean right so so these two you know are methods of ingesting your data uh from uh from a landing zone to a delta lake table or a folder right uh the thing is like uh somebody asked like what's what is like how is it different and how is it useful so the idea is like no with these two commands what happens and we'll show in the demo as well so let's assume you have a folder in adls and you know constantly you are uploading new data files to it right and once you run the autoloader or a copy into command only the new files will get processed so it's not the whole folder which will you know get process so let's assume i put in the data for me and tomorrow data set for june comes in and once you run re-run your you know copy into auto loader uh only the you know the june data would be picked up and processed and moved into a delta folder now this is really really easy uh this is really powerful for two things one is you know you don't need some sort of a streaming architecture like kafka you know to get your data from from landing zone to your bronze table right so you don't it uh it significantly uh simplify the increment in atl process right so that's that's the idea and you can do both continuous ingest uh and a schedule ingest uh for scheduling what we mean is you know you can put it as a job so on data breaks uh if you want to schedule something we use something called jobs so you can schedule it say it's 9 8 9 00 pm in the night run that notebook will just go on pick up the new files process them and put it in your in in your brands folder right and then incrementally your pipelines can run and move the data from silver to to the gold tables right or you can have a more of a continuous process uh in terms of uh for example autoloader can work as more of a streaming uh streaming uh use case where it goes on and continuously say listens to a folder and as soon as a file get drops in it goes in picks that file and you know moves it to your uh to a branch folder so both can be done right you can do a scheduling you can do continuous you know and all copy into can work a copy and autoloader can you know do both of these things so we talked a lot about it you know let's go and see in action how how things are working over here so as a first step i do it i just go in and create a database so i create a database called ingest brown right so this is my database over here uh i give it the location of the database as well right so and then uh there are no tables right now created on top of it so you know it has no results and this is how the copy into command works so it's just practically one line of code which basically helps me move my data so i do copy into uh and i you know this is my basically my my folder uh and which is you know uh which i'm putting data into this is my raw data or my landing data right where the csv file exists uh i mentioned that the reading the file i'm reading is a csv file and you know i want header to be true so this is like with really a single line of code i'm able to move my data which was in csv format uh this this file pandas 2020 file and i was able to convert into a delta format right uh and now i can see like around they were around 3 million rows and 3 million rows got inserted into that folder right uh let's go and look into the folder if i go in you know this is my delta folder which basically end of the day delta is you know the data is in part k and on top of part k we have something called a delta log which you know stores each each and every transaction that has been made on this folder right so really really easy to do that right uh with single line of code i'm able to you know convert this into a delta format now i want to do some processing i go in and create a table so i went in i created a table weather uh and as mentioned that the data is in delta format and the location of the data is this thing right so this is my folder and i you know went in and start looking at the data and now it's a python notebook i just use a magic command percentage sql and i start reading the data uh select star from weather so if you see right it's it is really really you know that simple uh didn't have to you know do some lambda architecture or something to ingest the data into the branch so that was around a copy into command the next i'll let's talk about autoloader right uh so autoloader right uh it's uh it's the same thing you know we are using you know an autoloader command uh to basically go in and see if there is some data change and you know if there is a change you know it will go and process it or you probably send a notification as well that the new file has appeared right so i do similar things this is my auto loader table which i mentioned this is a new location uh i do a check pointing so check pointing basically uh now the autoloader knows where to start you know once uh assume like you know something happens and autoloader just shuts down and when you restart autoloader it basically starts from the point where you know it has stopped so you don't have to reprocess the entire data or go in and go you know restart your pipelines and just start it from where you start and i give my landing uh landing zone location where my csv file exists right and then uh what i'm doing here is i just write this this is a autoloader sort of like sp uh command so i i i tell auto loader like this is you know a csv file this is my you know uh check point location uh load the data from this the main.csv file you know from it's here and then you know write the stream in the delta format uh trigger once is equal to true so trigger wants to what does it mean is once you know uh once the autoloader has read the file and processed the file the spark streaming will go and shut down so if so that's like really easy so imagine you you know you put this in an automated fashion on a job cluster so you'll just read the data and the stream will shut down so if you don't put this you know the things will go on continuously the stream will you know wait for the data to come in and the cluster would be up and running and then you know i give the auto loader checkpoint location and you know then i started so if i run this command end of the day right a streaming stream would be created and the data would be moved to my to my new location so i can go in and i can check uh you know i read the data from the autoloader table and i can see i'm able to read the same data perfect so that's powerful you know similar way i can go in and check my data folder uh it has a few packet files two actually and then there's a data log to it uh so i know the operation has been successful uh things worked out uh so we are know now what happened is now we have like did a one iteration i moved data which was in my landing zone and i moved it into into the bronze zone right so using both autoloader and copy and lastly i just want to quickly go you know we talked about incremental ingest right we talked about it's how copy into an autoloader both uh would be enabled at incremental injest so you won't have to read the whole table again it's only right it's only uh the incremental table that would be that would be ingested so let's to showcase it what i did was let i just you know read 10 days worth of data for june 2020 right so i went to here i did a similar operation convert it into june and you know this is a june csv panda file so now if i go in into my landing zone there are two files in it uh one for june or 2020 uh another is for me 2020 right the previous file and i run the same command i'm practically running the same command which i ran earlier uh copy into uh this location uh this is the you know location for my uh for the the landing zone and give the file format and the file option right i i practically ran the same command and then what i see was the the number of rows which got inserted were around like 3 million 180 000 rows got inserted um which is like different from the previous one so previously we had around 3169 uh 3 million 169 000 rows cut inserted and to check uh what i did was i just did a select star from you know from from the table and i found like total was 6 million so now what happened was only the june process file got processed and now my total count is around 6 million rows which got processed inside my inside my table so that's how you know i was able to maintain the incremental ingest with my copy into command uh and then uh a sim i did a similar operation with my autoloader uh and you know practically you can just put this in a in a notebook and run it as a job i'll do the same thing i just you know ingested the the files over here uh and because of auto checkpointing the autoloader knows where to start from because you know it starts from the last file it processed and you start you know moving it and i was able to you know ingest the data from here as well so that was like you know a very quick demo uh we showcased uh two couple of things over here and just you know to process once what we showcased uh we ingested data uh from the open source noaa library uh which is like from the azure ml open data sets right uh we created uh we ingested some data and put it as a csv file in our landing zone right and then used delta injection frameworks of our autoloader and copy into command and then you know we move the data into the branch layer which is the raw ingestion and history layer so we moved in here it's it's it's good to go and now you know the idea is we can you know use it for the silver gold and create applications or create some you know machine learning models or bi applications as you go and incrementally you know filter and clean and augment that data perfect yup yeah i think we have a question for john uh is autoloader is the same as spark streaming or any benefits so um that's a very good question it's actually a type of way to ingest data using stre spark streaming um what i mean by that is uh it it is another way to do it in the same way that you can use kafka to uh to ingest data using spark streaming uh the really uh there are a couple very profound reasons to use it as opposed to something like kafka being that you don't have to include like another technology and moving part you can actually use uh azure data lake as you're essentially your streaming source as opposed to using something uh like kafka there there are other reasons as well um because uh we actually use schema inference associated to uh two autoloader so that makes it extremely convenient and easy as you can see from what nikhil did where he didn't give it a schema that was associated to the uh to the stream he actually used it directly without without uh doing that uh it also allows for uh for scheme evolution of those schemas over the course of time as well which are both of those are pretty powerful along with uh other various uh pieces such as being able to interactively being able to use csv json and then also give hints to what those with those schemas are as well any other questions nikhil perfect um not that i can see i'm just trying to see uh if there's any question which you know the whole audience would benefit uh that's fine i'm just saying the published one as well um yeah that's all for now yeah feel free to you know ask your questions right it's open forum the more interactive you make you know it's it benefits the whole audience as well so yeah keep keep keep the questions coming in and we'll answer as we go and then one thing around autoloader uh as well it doesn't necessarily only have to uh be 24 7 streaming so uh one of the pieces if you would go down on your uh on your screen nikhil to the trigger once piece that's associated um so autoloader allows you to stream it uh but uh what you can do is you can actually go between streaming and use it as a batch instead using this statement right here in line 7 called trigger once that's a very powerful idea as well because you don't always have to use streaming technologies a lot of times just the other features are good enough and something that are really great to use for example the fact that you can incrementally move data as nikhil showed without re-uh duplicating the data from may if you're in june the others being features that i pointed out around around the schemas the evolution of the schema and uh and various hints that you can give to the type of data coming through as well one other thing that might want to point out um and i nick nikhil actually hit on this as well is that if you don't want to do that you can uh you can also use copy into which is a batch uh a batch technology and that's the generally the main the main difference between the two perfect um yeah i think let me check on the questions i think we're good on the questions uh still nothing much i'm just trying to answer a few but yeah yeah i think uh yeah so so as you know first step i would like uh as john and you know we mentioned the the notebook is on github repository you can go and download the notebook at your end uh go in and you know try to ingest some data in uh azure has a lot of like open data sets uh noaa is one of them so just go in download some data try to play with it and see you know how we can do it this demo we use you know we put everything on dbfs uh ideally like we don't recommend putting data on dbfs uh if you are you know you like if you're working on a production sort of environment but this this notebook is meant to be you know run on any environment so you can do it uh you can use you know adls gen to you know put put your landing zone here and here both can be on adl as gen 2 or a blob storage uh and it would work perfectly fine you just need to go in and mount those uh containers onto onto dbfs and you know you should be good to go for from there and there is documentation around how how you would go in and mount those you know your containers onto dbfs or even go and directly access the data from from these containers right so there's this documentation available uh there as well um and you know go in uh get some data in and really powerful like copy into an auto loader right a lot of customers use it move data from you know one click one folder to another move it from delta move it you know into delta format and you know good to go from there and yeah yeah we have i think we have a bunch of questions as well let me go and check it okay so yeah so i think one question is around you know for processing streaming data through autoloader should cluster be up and running for 24 hours no no yeah you don't have to uh that's the beauty of uh of trigger once is that you can what what it does specifically is it will actually take and finish what is what is existing at any given time and then turn itself off when it's done uh it's a very powerful feature that make sure that you don't have to keep things up for 24 hours 24 7. nice um i think one more question which we can take is uh another question which comes up is uh so let's assume right you have data which is how do we go and incrementally process data for a material architecture right so you get all the data in the bronze layer you got a file from june in the bronze and now you want to process it to the silver layer right can we like how do we go and incrementally go and process that data uh that's actually a very good question uh there is something called uh change data feed that allows you to take uh take the data that is going into the delta table and off the back of that table you uh you then can process that information on a row by row level whether it actually is inserted or updated or deleted even so you can you can essentially take action on on any of the values that are associated uh based off of change data feed it's also cdf is what is what we call it as well it's it's a cdc technology for delta tables specifically so that's that's a great question it uh it essentially rounds out the ability to really do that medallion architecture fully fully all the way through and with cdf and also with uh with autoloader you can do all of this in a streaming and continuous manner if you want to as well perfect thanks a lot john another question is like does autoloader work only with databricks and delta lake uh can autoload you know data from other sources and ingest data into a database i you can it it works uh so let me uh let me break that into two questions actually yeah so the first question can you take it from different sources uh you you can use uh it's it's made to you can take it from different sources as long as those sources are associated to an object store uh storage uh source excuse me and it is uh of types such as text uh csv and json so it has various formats that are available to access but it is using an object storage object storage to do so um the other question it is actually specific to databricks and delta it is an ingestion tool for delta tables perfect uh one question is around you know how to create a job and i'll go and quickly showcase how we can go and create a job from you know for this notebook so uh hopefully my screen is still visible if i go so if you go on the left hand side right there are a couple of icons and one of the icons is jobs so i go in and i click jobs over here right so i go in i click jobs and there's something called a create job right so i now i can go in and create a job from this so i can say okay test one two three uh the next thing is i can see what my job needs to run you know we have couple of options i can run a notebook a jar file right so i select a notebook and i can go in and select a notebook from here uh hopefully i'll just let me so sure so i'll just select some random notebook for now so i can go in and you know run this select the notebook and then i can give the cluster configuration right so i can say okay this is my cluster configuration uh which i want to you know go and run on it right so i can uh i can go in and you know give the cluster configuration and it can go in and create a notebook out of it i'll create a job out of it right so what happens is that that particular time point of time uh the cluster of this configuration would come up uh it will run this notebook and you know it would uh die down so so if you have to see a trigger into it like copy into or trigger once only thing it'll be just the cluster would come up run the particular notebook and go in and you know send the results another way to create it which is like a little bit more uh you know simpler this was like this was my i'll go to my notebook once again okay so this was uh this was the notebook right and i if you see over here i can go and schedule it as one so i can schedule a notebook i can say schedule my job and i can give it every every day i want to run this notebook and confirm it right so yeah so you can do either ways so basically you know i'm just went in and created a job which is supposed to run on a schedule perfect uh yeah um what else so we showcased the job thing okay i'm happy to answer him since i can't see him nikhil yeah no worries um does it uh does you know do we integrate with azure ml studio uh yeah from from yeah from ml perspective yeah if you you can go and do all create all your models right uh but definitely uh you can go in and you know we definitely integrate a lot natively with azure ml so you know you can move models from one environment to another if you want to do it if you idle if you want to serve your models out you can move to azure ml and then use you know aks or kubernetes azure kubernetes services or uh those solutions to serve your model out yeah definitely that integration exists uh to be honest like we integrate almost natively with all azure services around right adls edf synapse aml right all these services there's a deep integration that has been built in uh which provides a seamless experience sort of a seamless experience to the customers to work with okay uh a question for you like would you use databricks job over running adf you know adf pipeline uh that's a yeah that's a that's a great question actually they integrate with each other so essentially uh what you can do is you can um you can call from adf uh and a notebook that would then uh hand off say an ingestion data into uh into a notebook and run uh run that as a job that's essentially the normal path that we actually see i don't know do you have anything to add there but that's that's a more of an integration of both like using one over the other yeah it's mostly like you know it's uh it's it's not of a preference like you can you can definitely stitch in different notebooks and try to create a pipeline but if you want to just orchestrate your pipeline right in a way that is much more visible much more you know gui based thing uh you would go in and use you know adf and create a pipeline of it a lot of our customers like use those features because you can just orchestrate your pipelines and you know everything runs on data breaks uh for that you know take those uh benefit of those powerful clusters and spark right so definitely you know you can do it it's just a preference how do you want to you know orchestrate your pipelines over there okay um we have time for few more so how do you like how difficult it is to add a column on delta lake uh when the schema changes it's seamless actually uh so uh there is a setting on a delta lake table that allows you to to evolve the schema over time and when it sees new new columns it simply adds it and that's how it works it's pretty pretty easy and seamless and it's just a property that's associated to that table uh on a table by table basis yup yup it's it's really easy to do that like by default we enforce schema but you know we understand you know schema changes uh schema evolves over time so the that setting is like really one single setting which you do it and you can add columns to your delta link perfect i'm just trying to see create a job um i think we have one final question is you know i have two tables in silver layer i make an inner join between them uh when there is new data what is the best strategy to load the data into the gold layer uh so uh that would actually be how you do it quite frankly um there it uh i'll actually there's a second caveat to that that i'll mention in a second but uh but essentially what you do is you literally interjoin those two tables and then the product of those tables would be your uh would be your gold table you can actually do that for silver tables as well uh where the silver table you know might have only an id for for like sales orders uh you would uh link to a another table that tells you you know who that what the id is associated to it for like say a company or something like that um so you would interjoin them together and then you would take the product of that and ingest that data whether that being using a cdf uh for like an iterative uh streaming case or uh or using uh in a batch case you can do that uh using the full table to then uh to fulfill the the goal table um those cases would be uh the place where a full batch override would make sense is essentially when you have extremely complex uh aggregations that actually need to make and like almost make a new table over the over the course of what's uh what's already there uh that actually that whole idea as well is easily done in delta because of the acid transactions you can actually do an overwrite and still access the the table as is until that override is finished perfect i think we are done with the questions over here i see no more questions um yep um anything else from your end final remarks john no i would just say excited to see what what ideas this spurred whether that be around uh the features that we showed with uh with copy into and autoloader or if if the uh if if thinking through uh the weather data helped uh do that or if the identification of the of the set as a whole is is where uh you gain your your inspiration from uh it would be great to see what comes of it and we will have two other sessions like this where uh you can kind of noodle on your on your thoughts uh as as we're going through these new features that will help uh hopefully uh spur on a lot of great ideas yep perfect yeah yeah my experience right has been like you know data ingestion usually sometimes uh uh it becomes difficult and that's why you know we introduced this copy into and in auto loader to make that data ingestion a seamless process and yeah looking forward to toward the next sessions from our end and happy to you know see what what what comes out of it of the hackathon yeah awesome thank you both so much this was a great today folks if you want to follow up with us after the session feel free to send mail to msusdev at microsoft.com and i hope to see you on uh on our next session in about a month thank you very much great thank you matt you
Info
Channel: Microsoft DevRadio
Views: 611
Rating: 5 out of 5
Keywords:
Id: iFLEc8cJYZ4
Channel Id: undefined
Length: 64min 41sec (3881 seconds)
Published: Thu Aug 19 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.