An Azure Data Engineer's ETL Toolkit - Simon Whiteley

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so this is generally an intro this is kind of very high level skimming across the selection of tools explaining how i use them and how they fit together so if you're expecting a super deep dive and to learn exactly how to use these tools that's not what this session is if you want to get a flavor and get context and kind of know what you should then go and learn next you're in the right place sound good good right ready to go let's go hi my name is simon wiley hi uh yeah this is a data engineers etl toolkit the various bits and pieces that come together for me to do data engineering for me to get data to the right people and it's largely focused on azure so i'm doing that in the microsoft cloud and why cis is not necessarily the right thing normal stuff housekeeping please silent cell phone because you're on tv you don't want to be that guy pass there's lots of things we all like past why we're all here me so i'm a director of engineering that's a fancy title there's only two of us in the company so we had to pick fancy titles uh iron a consultancy in the uk essentially doing this kind of stuff helping people with their doctor engineering problems advancing analytics helping people productionize machine learning models helping people get data from a to b and i spend a lot of my time helping people go from on-prem old-world ways of thinking to how do we adapt how do we become agile how do we react faster how do we meet the needs that people actually have these days so that's me grab me on twitter happy to answer any questions all of that stuff more housekeeping the microsoft search team so there is a data engineering certification and if you're interested in that the guys have been running workshops q as that kind of thing so i think you missed today's one so tomorrow 12 45 and again kind of another one tomorrow so if you're interested in that cert look out for those guys go say hello to them room 204 and you can kind of learn a bit more about what's on that certification how you onboard and all of that stuff okay so that's what we're going to talk about what is the data engineer and this is very opinionated there isn't an official definition but i'll explain kind of where it how it fits in contact in my head kind of the lineage where they come from and the challenges of the way a lot of people currently do things and why it's changing and kind of what's bad about it essentially not hating not going to hate on ssis too much don't worry just a tiny bit then looking at the new tools we're going to look at data factory we can look at data bricks we're going to have a look at logic apps gotcha always comes up everyone's like isn't logic apps there and i'll show you where that fits in and then circle back into data factory and have a look at this new thing mapping data flows which is now ga you can now use it i'll show you what it is how it fits in and then the big bit at the end give you that context of how it all fits together if you're building a solution you might use all of these tools to actually achieve that whole solution so trying to get you into the mindset of how it all fits together sound like a plan still awake yes good okay so what is data engineering it's a massive question and for me it's a combination of three things so we've got the traditional bi developer and it's a person supporting them it's the person getting the data to them it might be them they themselves essentially the etl developer the person building exercise pipelines the person getting data from a to b into facts and dimensions and shaped properly it's the person supporting the data scientist and actually data engineering it's a term that evolved with data science normally the big data engineer the person writing scala jobs in spark to support the data scientists that's kind of where the term first cropped up this data engineer and then there's the software engineer the person actually doing app dev and the person actually looking after the database for them and getting the data into it and all three of those are relying on this idea of this data engineer so it's kind of like a multi-faceted things a lot of customers that the data engineer has to get data to and unfortunately that also means there's a lot of assumptions about what they're going to do so from the bi point of view it's it's just an etl developer who wanted a fancy title and i do get that quite a lot from the software engineering point it's like it's it's it's me but it's just it's me if i happen to do some data and data scientists just see it as a cognitive process it's a small piece of actually it just belongs to data science it's part of their puzzle it's not serving these other people and because of those conflicting demands it's why we don't really have a good definition for it why if you ask three different people what a data engineer is they'll give you slightly different answers so at least with the microsoft search we're starting to get a this is the tool set this is the skills this is what we're expecting so in terms of for me for those different people asking the bi developer is coming and saying you need to know warehousing you need to understand slowly change your dimensions you need to understand how to do junk dimensions and all of that kind of stuff you need to know kimball um you know dead quality you need to understand how to make an audit event fact for the proper kimberlites out there the data scientist or software engineer doesn't care about any of that so they're more focused on actually you should be doing good architectural design you should be avoiding the monolith and using microservices and loose coupling there's all these kind of um ideas coming from software engineering that i just expected you should be a good developer and traditionally a lot of the ways that bi grew up those kind of app dev principles didn't really get involved i speak to a lot of people and i say i mean you're a developer i'm not a developer i'm an ssis person i'm like yeah you're a developer but the two never quite meet and then data scientists they've got an entire different perception that's all about ml engineering it's productionizing their models it's building efficient pipelines to deliver one specific data set into a model so you're not worried about building a catch-all star schema that can answer 80 of the world's questions you're delivering a particular data set and so because of that there's this driver there's this confusion across all of these different things and that's what we're trying to mean and that's why the world has become slightly more complicated for the data engineer because there's a lot more expectations there's a lot of stuff that we're trying to achieve so long live oh sorry that's wrong long running ssis packages because exercise doesn't scale um so bailey why i kind of want to break the reliance on ssis a little bit which i want to talk through because of that new world because of those expectations on me because of people wanting me to be a software engineer and be agile and be reactive and be able to quickly react to changes how that balances against what we currently have as ssis so the old world the way that we always did things this pattern essentially i'm going to get data from somewhere it could be from a relational source system my oltp system could be csvs could be henna think i'm going to dump that into my warehouse i'm going to land it into a staging table i'm going to pick it back up do some cleaning some quality some validation put it back down i'm going to pick it back up i'm going to turn it into facts and dimensions i'm going to shape it then i'm going to land it into my star schema and ssis is that tool each of those hops i'm it up into ssis and pushing it through familiar pan that sounds right okay good and then orchestrating all that managing all that i usually have something like this the ssis master package the global executor of doom and i have a load of things that are staging my data landing into my lake in a sequence container saying run all of these things then i have all my dimensions i'll have all my facts familiar yeah okay i'm just making sure i'm on the right page and things haven't changed in the two years since i last ever touched ssis thank god weirdly it hasn't um okay so why why am i having to change why am i having to challenge so i've got a few little quotes for the kind of things the business say i would say the business report consumers people who need this data to make decisions to actually get on with their job the kind of things that they're coming out with let's see what a bit of people want i need new reports we don't have the data yet does that matter it's like uh yes my report says profit's 100 but you're saying 87. why they've got different numbers why we got disagreement i mean that's the age-old argument for why we've got a warehouse right i mean the event is amazing they've managed to find a way to encode our data into image files you guys can read that right it's just ocr it out is that cool yeah no we set up a satellite office i mean you don't mind just not doing having an etl window just half that so they can have their reports live at the same time right suddenly things have to go faster you could you can deal with that no can you add a new step in the etl process we've got a new validation so just break apart the existing schedule and just add it out that's fairly easy to do no and why did it take you three hours to get back to me on my support process and update that report it's not that hard fairly common thing the reason why i'm gray is these kind of questions coming from going really um but to take that and translate that into what do they actually want what are these people really saying they're saying can we reduce the development effort to onboard new data so can we react quicker can i actually turn that around and say you've got a new data set sure you can now start querying it not okay our release window is in two months so we can start capturing it in two months is that okay we're saying i want a single platform reporting which makes sense that's always been the argument i want to cope with a whole wide array of files and formats image files audio various different things i want to scale effectively to meet changing demands if someone says oh no the warehouse is corrupted can you reload it i don't want to have to wait a week for it to process and get all the way through it should be built in it should be agile just by default you should be building things that can react to change we can quickly chop and change how we're working it and we should have support built in we should be thinking about how we go about supporting these processes so i turned that those that wish list and turned into our checklist of stuff what are we trying to do load our effort many use cases flexible formats elastic scaling agile and supportable now guess what we're going to try and quickly benchmark ssis against these how do you think it's going to do well well so low def ever is an interesting one because actually exercise is fairly easy yes you can get quite deep into it you can perform and student you can get into the guts of it but then if i'm having to have to build a data flow for every single different data set for me that's high dev effort yeah it's not that hard but actually i'm just churning out an absolute ton of data but wait i hear you cry i've got my magical tool called bibble so i can just automatically spam out 100 or so exercise packages and the job's done right and yes absolutely that is really good that has solved a problem but for me that's fixing a problem with this it's not actually becoming low dev effort all of those i've got to deploy them i need to build big devops pipelines so every time i generate a new package i add it to source control i push it out i get it into my database and there's that for me is a fair overhead so yes you can make it less dev effort than it used to be but i don't call it low dev effort formatting use cases sure csvs sql databases whole variety of just data sets i can push it in can i do json xml yeah kind of it's not that pretty but i can but then how well can i do things like geospatial data the videos how can i stream data using ssis and how quickly do i then have to resort to having big c sharp applications having additional etl tools going ssis doing all my batch data but streaming goes a different path you know it's kind of not that extensible to some of the more modern ways of working and as everyone adopts the cloud as everyone starts getting cloud software as a service and actually our data is no longer in my own database it's coming from an api that delivers me of changing set of file formats i need to be able to react with that and you'll be able to ingest modern data formats okay scalability don't worry this is gonna get a lot more positive in a bit just bear with me so scalability i can run a cluster if i really fancy paying for multiple copies of an enterprise sql server license i can pay for as many as i want and i can run lots and lots of exercise packages in parallel but what i can't do is for that one really really big data set say actually just use all of my compute so i'll scale out for concurrency but it won't scale up for execution i end up having to build something horribly manual to partition up my data set and send that off to different packages so that they can do it and that's a huge amount of development for me to have something that scales or i have to have a bigger and bigger and bigger more expensive server pay for more calls in my enterprise license said scalability for me still the answer's not there talk about agility so i've got my master package of doom i've generated some more um exercise packages i need to go in and add them maybe yeah i can do that through bimmel but technically a lot of the way people traditionally work if you've gone in and edited that package then you've got to test that package so suddenly to actually take my new small additions and get it live get it out actually deployed after my production server i need to go through a whole test cycle and make sure everything works i mean you can automate that but then that's not low don't have dev effort that's a whole load of upfront work to get there these are kind of the challenges people have been dealing with for years it's kind of like stockholm syndrome we all love ssis but this is a little familiar pain we're like is it really abusing us do we know that um it's portable so i've got i've used bible i've got hundreds and hundreds and hundreds of packages deployed one fails and sure they should all be the same they should all be identical carbon copies but then what's the chance of we actually generated it and then we had to do a slight exception and yet that one's slightly different and how was it documented you know the the code footprint the very software engineering term the code for the total amount of code i'm looking after is huge and yeah i auto-generated it i didn't have to hand crank it all but i've still got it's actually quite hard to support because i've got this mass of different packages so for me even doing that little audit of is it an agile responsive good data engineering tool no okay good i had to get off my chest i'm sorry okay so what are the alternatives why are we looking at other tools now kind of what can we do to try and redress this so i've got data factory this orchestration tool is your only for doing that kind of control flow side of things we're saying run this then run this then run these in the loop and then do the next thing data factory is actually very good at that we'll have a look at how that works we've got data data bricks spark we've got a big crunchy data transformation engine so for doing mass transformations of data we've got an engine that can do it and we'll scale a hell of a lot better we've got this thing called mapping data flows so now inside data factory you've got the ability to have very very visual things that's going to use databricks under the hood so if you're looking at dataworks going i've got to learn a bit of python i've got to actually learn how to this tool works you can drag and drop and build things that look a lot like an ssis data flow but scale shock r and there's a thing called logic apps these are like lightweight integrations it's kind of like the azure version of if this than that if something happens do something else if this happens do something you might also know as microsoft flow also now known as powerautomate if you've happened to come across it so again these are the four things we're going to whip through and just have a what is it roughly how does it work what's the main takeaway for it and a super quick look at how it looks like and what to expect when you see it okay good so onwards data factory what is this thing how does it work so as a very rough primer data factory does two different things it's either copying data so it's saying take data from one place move it to another go to my sql table pull some data out land as a csvita go to sharepoint pull out some data go to dynamic plus and don't pull it down you can talk to various different things but generally it's pulling data from one place into another place and not changing the data on route not doing transformations on that side and that's baked into data factory that's just part and parcel of it on the flip side it can do transformations but by calling other things so aside from the new thing aside from mapping data flows it's generally relying on other tools to do the transformation so it can orchestrate things so it can say run a stored procedure and then run a databricks job and then run another storage proc so it's like layering up various different things poking these other devices and saying you do that yeah then you do that it's an orchestration engine okay under the hood it's all stored as json so if you're doing multiple people collaborating it's slightly better than xmla but still but it's fairly familiar and have a look what's going in there so the kind of things we're talking about we're saying go and pull data from a vast array of different sources it can read lots and lots of different data go to amazon redshift pull data out of it it's very good at going into things and extracting data then you can land it inside azure so all the things that you can read data from it can't necessarily write data back to it it's more like give me all your data now pull it into azure and then it can talk lots of things it can talk to c sharp so i can do my script component type things can talk to on-prem databases it can talk to my azure sql databases data warehouse it can talk to hdinsight and data bricks got lots and lots of connectors if it's in azure data factory probably has a connector or there's one coming what does it look like should be very very familiar to the ssis developer it's a sequence of activities precedence constraints linking them all together each activity you can drill into and set the properties and set the details you know it's deliberately made to be accessible to be really familiar if you're coming from the ssis world we'll have a bit more of a look at it to get a flavor it's got an orchestration plane built into it so automatically as it's running i've just got a monitoring plane i can go and have a look i can dig into the history i can see what's actually happening under the hood and i can get a nice shiny dashboard telling me that all my things have ran my never actually looks like that but you know in this case it's a nice green circle it's good okay now there's one uh slightly annoying bit of terminology that i'm gonna let you know they decide to call everything the same thing in true microsoft fashion so i've got a thing called an integration runtime now this is the bit of compute this is a server that sits somewhere magically behind data factory so if i'm doing a data copy if i'm saying take data from one place and put it into another one it has to use a server somewhere i need to pay for that compute now that is completely on demand it'll switch region depending on where i'm doing my data so it's always going to be the most efficient place so you've got the azure default integration runtime and that's just some data that sits behind data factory that lets me know where it's running it that just pushes data from one place to another to make it more confusing i've got the ssis integration runtime so if i want to run ssis packages i can and i can deploy and have it sitting underneath data factory so i can choose how big a server do i want i could use how many servers do i want and then i can i need to choose which region that lives in so unlike the default one which is going to dynamically switch and change depending on what i'm doing i need to say ssis you're running in this region and if you're pushing data in between azure regions there's an egress charge and there's performance constraints so make sure you choose the right place but again it is just ssis so if you know you don't have to get a special version of ssis you don't have to change anything in your local data environment and just deploy it out to azure and then orchestrate those packages using data factory so you don't have to give up on it but you can get some of the benefits and lift and shift you'd have to go back and rewrite your entire package you can get this as a starting point get ssis running inside the cloud and then use data factory use the nice points of that orchestration like the first step then there's also this thing that self-hosted integration runtime now this used to be called yet another thing called the gateway the data management gateway or the on-premise gateway forget which one's power bi and which one's data factory it's now called the self-hosted integration runtime so if you're trying to get data into azure and you're talking to your infosec guys and they're going you're not poking a hole in our corporate network so you can come and get data that's not happening that's what the self-host integration runtime's for so that sits inside your network and that's constantly peeking out and going is there anything for me to do should want to do anything join i think and then if data factory says yes i need to extract some data it pushes the data out what that means is you don't need an inbound firewall exception you don't have to poke that hole that can get to your sql server so the gateway's really good it's that nice secure way of pushing data into the cloud and integrating with it so if you're trying to get data into azure data factory self-hosted integration runtime again snappy name is used for that all right good okay so let's have a quick look at data factory and i'll tell you about some patterns and things that we use in there to make it go better okay so i'm gonna join i've got my portals i've got my data wrangler here just gonna go into one data factory and just have a look at some stuff so really really simple example i've got a copy data activity so i've not got a complex workflow it's got a single copy data activity and that's looking at two data sets so data sets underneath it i've got one on the lake side and i've got one on the adventure works site so a bit of data factory syntax a pipeline is the same as an ssis package that's my workflow of stuff each of my different things i'm doing is an activity so i can chain lots of activities up inside a data factory pipeline and then in order to know what i'm connecting to i've got linked services which is my connection managers if you want to use ssis so we've got various different systems that i've set up the configuration to i've told data factory how to connect to them and then i've got my data set so data set is within that connection within that sql server data sets point to specific tables so it's what data am i getting it's just a bit of metadata that describes how i connect to this stuff but i've decided to be fancy so i've given this data set a parameter i'm saying actually i'm not going to hard code you i'm not going to point you at a specific table in sql server i'm going to tell you which table each time and so in my connection i'm saying actually connect to that sql server but i'm parameterizing which thing that connects to so i can start to make things very dynamic it doesn't have a schema because the scheme is going to change every time i run this and on the flip side i've got my lake so i'm landing data into a lake and i'm building up the folder path dynamically so i've got two completely dynamic connections one thing that's going to randomly well not randomly it's going to pick different tables inside sql server and then for the specific table it's got it'll put that into a certain folder in my leg and then switching over to my pipeline again it's emoji to give you a flavor of what it can do so if i pick up my dynamic copy inside that activity it's passing that parameter down from my pipeline so i can give my give my whole pipeline a parameter and it tells that data set the parameter tells the other data set the parameter and i can link it together so this whole pipeline is now i can run this i can reuse this one bit of logic and pass it different table names and it will go to sql server extract that table and land it in my link i've suddenly just wiped out 100 200 of my packages that i would have in ssis that was that data flow pulling data out so i can then wrap that in something on top of it so i've got something sitting here so i'm doing a lookup so the pipeline itself is going to go off query the database and say what tables do you have so if i got my inside my lookup it's going to go down in my settings and it's saying just go to systables give me a list of tables so if someone actually sneaks in adds an extra table to my sql server i don't even have to update anything it's just the next time do i do an extract i'm automatically gonna get all of that data so if you talk about the what's the time time to insight how long between them deploying that changing that source system and me at least getting it into a lake it's the next time i run it it'll automatically be there so the whole idea of devops the whole idea of pushing out things as quick as possible is gone because what's faster than automated deployments no deployments it just happens automatically this is what we're talking about with this whole mind switch of things going a lot faster okay so i'm just going to kick that off and we'll have a look at it later i'll talk a quick bit about some of the patterns that we're seeing just trying to put in context what some of those things are doing so one yes i can have ssis in here and so i can have a pipeline with an ssis activity i need two things i mentioned that ssis integration runtime i need to have a server set up for to actually do the compute do the running of that exercise package and i also need to have a sequel db that acts as my catalog so if you've got those two bits set up you can just run packages you need to be a little bit careful though because that integration runtime you get a little dial saying how many servers do you want do you want one server four servers eight servers but remember the ssis package won't scale out automatically across them each individual package will only run on one of those server nodes so only pick multiple if you're gonna be running lots of things concurrently and it's so tempting when you first start to just take your existing master package and all the child packages and deploy that and then kick that off across a cluster do not do that that is bad times because if you've got an ssis package and it calls child packages within that same ssis package it won't scale out across all the nodes it'll all run on a single one so you might be paying a huge amount of money for eight different ssis servers and it's using one so if you do end up putting ssis into the cloud you do decide to use ssis inside data factory be very careful and do the orchestration that master package do that in data factory make a data factory pipeline that calls your child pipelines and then at least it'll scale properly across that cluster yeah good okay so some patterns this is the way we used to do things kind of ssis way i've got three different files to load i've got three different pipelines and a hard code and because of those things that we just saw i can do things like pass a parameter into my data factory and dynamically change what it's doing pick a different file based on what i'm doing i can do a lookup activity so i can tell it to go and change what it's doing i can have to query a sql server based on the results of that do something different i can have config files and say go read a config file change my things kind of use almost like exercise environments go and check the local file just make a load of changes to what that pipeline's doing depending on the results that you get back i can do my for each loops so i can say here's an array of stuff loop through all and i can choose to either do it sequentially or do chunks of them in parallel by default i can do up to 20 things in parallel scale up to 50. again you can have that suddenly gives you a huge huge amount of flexibility again i can call child pipelines so this the one thing is like a bit like the child packages that we saw so i'm just reusing logic again and again again by pointing a for each loop on top of that child package but i can also use common workflows i can have error handling packages whenever an error happens just go and run this same logic and just reuse that pipeline so i can start to build reusable components so not only am i building dynamic things i'm building actually tools that i can repeatedly use in lots of different scenarios it's all about thinking about actually how do i minimize the amount of code i'm doing and just get it to be as flexible as physically possible so that kind of pattern's the one that we saw that data acquisition is go after look at the sql server bring back a whole list of things to do and then just loop through and execute the same thing for each and that's kind of a massive game changer in terms of building out the speed that you can react to people okay so a couple of different ways we can trigger these pipelines so i can have just a time box every half hour take that half hour chunks of things every day run for your current thing i can have it run on a schedule just saying 2am midnight i can have an etl window if i desperately want to tie myself to having an etl window but i can trigger it dynamically i can call a rest api i can have something happen that triggers my pipeline i don't have to have it arbitrarily running every hour because i feel like i need that schedule i can have actually something's happened react do that immediately suddenly i've got the power to be a lot more dynamic in terms of when it's executing and we'll go on to that with logic apps okay so quick recap data factory is for orchestrating all of our data workflows no matter what other tool i'm using i'm still using data factory as my control plane it's that orchestration layer it's the best way of getting data into azure unless you have like a huge massive thing and you need to ship a hard drive to microsoft it's the best way of getting data into azure and remember to be thinking can i parametrize this is there a way that i can make this dynamic if you find yourself copying and pasting and you've got hundreds and hundreds of pipelines go am i doing this wrong okay so that was a lightning quick view of data factory let's just go and have a look if my pipeline actually ran so in my monitoring plane i should be able to see there we go so kicked off it's in my time apologies all these different pipelines ran and i can go and have a look and say which parameter passed it so that was pulling out the product table i was pulling out again product was i doing the same thing then i support catalog so it's got to span out it's got that list it's executed the same package for each of them and my guys can go in and have a look what happened inside that execution what was the results of it how many rows did it copy across and go and have a nice view in terms of what was the actual thing you know the actual support process of going back and saying what happened in that run is quite nice it's quite interactive it's quite flexible and we'll come back onto that a bit more with the next stuff yeah yes so the question is so can be triggered by time can the pipeline be triggered by an event and there's two answers to that it has some basic functionality built into it to trigger itself so you can have a blob trigger when a file lands in a blob trigger data factory it's a little bit limited in terms of what it can do with that triggering but you can trigger it from other things and we'll come on to logic apps which is really good for saying an events happened do something and we can use logic apps to trigger our data data factory and i'll show you how to do that okay so databricks also known as fancy spark we're doing this thing called databricks we're using spark using a big data engine and it comes back to our data scientists are asking us for big data engineering and it used to be a really really hard thing to do configuring databricks used to take a long time not databricks spark and database is just making that easy it's making the ability to write massively scalable projects really really quick but it can be really tempting to when you're looking at this because it's got the name a big data tool oh it's not for me i don't have big data i'm not dealing with hundreds of terabytes of data that's a specialist tool for special people uh and i disagree i massively disagree if you've got a few hundred meg to a few gig to several terabytes of data i still use databricks at the core of that and that's because of the flexibility of it the agility of it and i'll show you one of two things i can do with it that suddenly make everything go faster even if you're not using the full potential scale you can have it scaled down really small and still reap a load of benefits so one of the tenants the fundamental things we're using you saw me i was taking data from a sql server and landing it in a lake it's landing it in this fancy data lake thing which is more than just marketing spiel so you've got this thing called hdfs that's the hadoop distributed file system essentially it's saying if i landed a file it's not just going to be a single file so i've got my csv that's landed nicely and actually automatically under the hood it's going to chunk it up into several different chunks of data this thing called file extents and each of those extents is going to make several copies of them distributed in different arrangements across lots of disks so whilst to me it looks like a single file i see a csv sitting there in actuality it's held as lots of multiple copies spread across different disks so when i query it i can query it massively in parallel i can be reading all of those different chunks of data via a different system at the same time and suddenly everything goes so much faster and again that used to be hard that used to require me telling it how to distribute this is the distribution key this is how many clusters and chunks to make that's just done automatically blob storage and azure will do that automatically for you it is a hdfs backed system as your data lake store gen 2 or gen 1 we'll both do that for you so you don't need to worry about that it's just if you have a lake lakes are fast because of that stuff and then when we're using spark we're essentially using that so we've got all data stored in this data lake thing as flat files and then spark is a distributed compute engine so i've got this driver i've got this thing in charge this brain and i give that a query that then says okay each of you each of my different workers you go and look after that bit of data you look after that bit of data so they all actually just do their own little chunk of the processing so i can just do lots and lots of working out in parallel they'll each go off query the like bring that data into memory on each one of those workers each do their own bit of work and then bring it together aggregate it up and that's coming from very traditional old school map reduced big data kind of stuff but it's now been made actually really accessible for us this kind of stuff's actually very easy to do these days so we're talking about databricks talking about where do we use databricks sure for me the most basic thing i've got my data held in a relational database or csvs i'm pushing it into my database prepping it to get it into that warehouse yes i use databricks to do that i use it as my etl transformation engine but i can also use it to do geospatial i can use it to do streaming i can use it to actually integrate with lots and lots of different more awkward types of data even working with my data scientists i can solve one of the hardest problems in data science which is is it a chihuahua or a muffin i can do image recognition i can do custom vision i can do automatic subtitle you know i can actually use it to do some real world problems not just silly things but it's so flexible in terms of what you can plug into it right across all these things i can write lots of different languages i can write in the same tool python or scala or r or sql so i can actually have a load of analysts writing sql and then a lot of data scientists writing r and python and actually i don't need to switch and change and have a different tool and integrate it just one consistent environment and it plugs into lots of things so i can plug into an ml library i can plug into cosmos tv i can call cognitive services and do automated ml have a question yeah so i could have something i could write basically a sql statement joining together data sets making my dimensions so it would so if i'm doing it in sql database would it look be like doing it in store procs yes except it'll scale so you write the sequel and it is an ansi stand to the sequence not the most modern and it will take that and turn it into a distributed query on your engine so it's made so it feels familiar it looks like sequel but it's doing big heavy spark parallel jumps under the hood okay so the thing i'm going to briefly show you i'm not going to go into the nuts and belts of it i've got a thing called a jupiter notebook so i've got a data processing script i've got a starting layer in my lake so that raw layer that first bit of my lake where i had that data factory and it landed those csvs in it i'm going to take that i'm going to apply a structure to it i'm going to say okay this csv does that match my structure i'm expecting it to have five columns and these data types does that actually work and anything that matches i'm going to put into the next stage and store that as a park a file essentially column store i'm going to have it heavily compressed built for analytics and that's just my good data and anything that's bad i'm going to chuck out and say there's my rejected rows super traditional first part of etl right validate my data apply data quality check my structures but i'm going to parameterize that so i give that one single just a single data processing script different parameters and it will do that for different tables so rather than having a hundred different packages saying this is how you clean that one file this is how you clean that file this is how you clean that file i have one generic script which i can just parameterize and use for everything and again suddenly everything becomes quicker let's have a look at databricks how we doing the time yeah not too bad okay so this is databricks and again is one of the new fancy tools it's entirely initial so all the development you can do through azure itself you don't have a visual studio project to go and work with it necessarily and do a few things in here so i've got clusters so i can go and specify different clusters i can say this is my cluster that's got a couple of really small boxes in it this is my cluster for doing the big heavy lifting i can have lots of different compute profiles in here and when it's turned off i'm not paying for it so i can turn the small one on to do some quick jobs i can turn the big one on only when i need to so you can really heavily manage actually how you're doing scale if my analysts are saying no it needs to be up i need to query something i can give them a fairly small box that i'm happy leaving turned on and not have to buy a massive server because there's a compute spike so all the things that we used to do with sql server when i'm planning for that etl workload and because it's just like one massive spike of work i have to buy a huge server i don't have to do that anymore i can have just little compute spikes turn it on for 10 minutes and then just turn it off and leave another one turned on because the data is separate because my data is in a lake i've got separation of how what computer i need versus what storage i need so i gain massive amounts of flexibility here so i've got a workspace i've got a load of different scripts there's a few things i can do so apologies better python give you a little bit of dip in it so i've got a quick script and this is saying here's a structure this is just hard coded i'm expecting to have this as a structure of something so there's going to be a data set it's going to have four fields you can have an inter string a string and a string fairly easy now normally i don't hard code that normally that's going off talking to a metadata store going and saying what am i expecting to get here just for illustration i'm giving and placing the legs i'm just pointing and saying this is where you can find that csv sitting in my lake again normally i'd parameterize that and then this little bit of magic i'm saying there's a data frame essentially it's kind of like having a table variable i'm saying this is an object it's going to have a shape it's going to look a bit like a table and do that by going spark.read so use my spark engine go and read some data this little switch bad records path is saying anything that doesn't fit that structure actually separate that and put it out as a little bit adjacent that is all i have to do make some rejected rows record anything doesn't work over there and i'll take the rest of the data and carry on with it some really small little tweaks and yeah there is a little bit of a learning curve to get started with this but for doing something really really kind of previously quite hard to do things there's actually some things built in which make our lives insanely easy so to write that out i take my data into of that data frame so i'm referring to the variable i created up there dot right dot park a and push it out it's a fairly simple example just to illustrate what this kind of syntax looks like but a lot of our data processing scripts we're doing that so go get some data from here apply some logic apply some transformation write the data out to there kind of the same as processes we're doing with ssi section this time it's script-based and i can have it to be on sql so i could show you go back to this one i can show you my notebooks can be quite interactive i can have sql queries i can go and register data and be able to use it like i would if i was querying a sql server except this is over data in the lake so it suddenly becomes again it's not that scary so daddyworks can do lots and lots of things and then the other thing i need to do if i'm working in this way i need to be able to work with it so this notebook at the top i've got an input parameter i'm expecting something to go into this notebook to tell it what to do so i've got this little bit of code saying add a widget so in this notebook you're expecting to have an input parameter i'm going to call it filename i want to default it to product so if no one gives me that file name i'm just going to give it a default variable and then make just label it as input parameter so it looks nice and accessible then inside my code i can say well actually i want to refer to that whatever value of that parameter is i want to refer to that as file name so actually i can just then line up here's what i'm going to be expecting inside this this is how i use it i can pass it and again it is heavily script-based you do need to kind of you know learn a little bit of coding to get started with this but then the plan being i build one script i've got that hit of complexity once and then hopefully most of them i just reuse that same script so an initial little spike of hard work and then life suddenly becomes easy and finally i've got this little bit at the bottom dbutils.notebook.exit allows me to pass some data back so if i'm calling this from somewhere else i can pass parameters in and then i can build up a string and pass that back to whatever called it so i've got like that two-way call and response input parameters do something give me some output branches tell me what happened now i'll show you the example of why that actually then makes sense when it comes together so i've got this example so validate adventure works so all those tables that i've just landed the csvs in that first example i can now say actually pick those csvs up try and apply each individual structure to it separate it that into those rows so i've got a notebook i've got my notebook activity i'm pointing it at a specific databricks cluster so again i can say actually run this on my big cluster run this on my small cluster i can be fairly flexible with it i'm possibly pointing at a specific notebook so you can see i had all my notebooks in there i can go and browse and see what notebooks are deployed to database currently so i've associated with a notebook and i'm telling it that parameter except this time like i did the first time i'm passing a parameter into that parameter so this data factory pipeline i can call it several different times and give it different values go so i can just control this and say data factory run it for this table then this table then this table and then guess what we can do same as we did in that first one i can do a lookup and a for each so i can say actually go to my configuration go to a if i've got a config database that's just i'm using to orchestrate everything so go get a list of tables i want to be working with run my notebook for each of those tables which then again means if someone gives me a new data set i can just add it to that list i change the metadata i add a record to a sql database i don't do a code deployment things are quicker things are much more agile so if i just run this again can i add a trigger it's going to go through it's going to run that notebook for however many times i've got my different files i go and have a look for time yeah not too bad okay so you can see all those notebooks have kicked off so again i got my master one kicked off all these child pipelines passing at different parameters for each one and that parameter that i've passed to each one gets passed down to that database notebook so it's doing different things depending on each point runs this is what we're thinking about when we're doing data engineering and etl now it's how can we be as flexible and dynamic and agile as possible so i dip down into one of these ones it might have even finished by now no it's still going i can have a look at it i could say what's actually happening oh and i didn't start my cluster so it's going to take four minutes and then we can go and have a look at those results so we'll crack on while that waits and that is it is kind of there is a trade-off because i'm using databricks because i don't want to have a cluster permanently turned on that i'm paying for i let my cluster get turned off there's a setting in my cluster saying auto terminate i tell it if i've not used you for 15 minutes just turn yourself off and that's not me building a load of powershell building a lot of things to automate and work with it that's just a setting on my cluster that will automatically shut it down it means the next thing that tries to run has to wait for four or five minutes while that cluster starts up and that used to be super painful ensure i'd have to do tons and tons of bits of powershell going okay here's a bit powershell to start my thing wait until it's finished starting up now run a load of things let's just baked in now if data factory tries to run a databricks notebook the cluster's not turned on you're the same okay i'll turn it on as soon as it's on it'll run through everything so data breaks in a nutshell it is spark as a service it's spark made easy it's made by the guys who wrote spark so they kind of know what they're talking about it's hugely powerful transformation engine so in the one i example i was using i've just got two vms there's just two servers running it they're fairly small it's fairly cheap but i can scale this up i can pay for more expensive vms and i can just keep scaling them out i think the biggest one we've heard about was six thousand nodes it can go quite big and that's the really nice thing the configuration the script for doing it on a few 100 mega data and the scripts we're doing on a few petabytes of data is the same i just need to change the thing underneath it that's running it suddenly everything's a lot more flexible okay and you can use python scala r and sql and c sharp's coming so you know it really flexibles under the language it's almost like pick the language that makes sense for you and as long as you're using data frames as long as you work within the most common libraries there's no difference in all of those languages with some caveats and exceptions but mostly if you want to pick python because python's easy and you've got some grads you know python you can which is crazy okay so logic apps what are these crazy logic cap things i like thinking like this i've got a load of different services and that's just a handful of the money that it provides i'm saying well actually if someone adds a new record to salesforce then write that to a sql db if someone does if a build completes an azure devops then actually post it to slack channel it's that kind of thinking of there's something triggering it then go do something else so if i send a message if something sends a message into a queue check my metadata database check what i should do with that return a response and then kick off my data factory suddenly my data factories can be automatically triggered logic apps is that really lightweight little workflow thing now very important logic caps are not for moving vast amounts of data they are small little messaging things it's good for integrating you can it'll scale out really effectively you throw lots and lots of messages at it but if you throw 100 mega data at it it's going to start struggling it is not something for doing big chunky data transformation that's not what it's for but if you're doing small amounts of data passing it back and forth is great i mean think of it like in ssis if you've got an object variable you don't try and pass a massive data set into the object object variable you're gonna have a bad time so logic apps small little bits of integration decision making that's how it fits in for me let's have a bit of a look at logic apps do you have a logic app in here okay and again heavily web-based the designer is web-based it all lives inside azure okay so i'm going to go in and edit this i've got a really really simple logic happen here okay it's got two things when a blob is triggered or modified so it's i've attached it to a blob store it's constantly looking and monitoring that blob storage so i've told it a container i'm saying you're going into the landing so it's watching a particular container in my azure blob store it's checking it every three minutes and that's it so i can just drag and drop i want that container go and do that i'll show you the kind of things i'm working with i want to add a new thing i can add an action and she comes up with a big list of here's all the different things i can do so i can go and do things like text analytics i can hook it up and work out sentiments on things that are going through i can do things like trigger my logic app every time there's a tweet with a certain hashtag check the sentiment of it if they were being nasty about me send them a mean tweet back i can i can automate automate my sarc on twitter there's loads a lot of flexibility that you can do it can talk to lots of different things so in this case really easy i'm saying when that blob is triggered create a pipeline run so i'm pointing it to my installation i'm pointing at a particular data factory so saying that inside that resource group that particular data wrangler data factory i'm picking which pipeline inside data factory to run and then i've got that little bit of json to say pass it the table name so actually on the outside of logic apps you get these things so each of these actions so that blob activity automatically pumps out a load of variables i have to tell it this is how to understand the things i get a choice of the name of the blob the list of files i can work with it quite easily it pumps out data so depending on the trigger you've picked if you picked a twitter trigger it has the name of the user the contents of the text the picture they had if it's a sharepoint one it has the list context it has the name of the file the metadata so each action automatically just pushes out a load of parameters a load of metadata that you can work with so i'm just taking the file name the only reason i've got a little bit of code there i'm just stripping out the file extension so this is a really really simple triggering mechanism it's just watching a particular container if i land a file in there with a name it'll run the pipeline for that name so i land something saying put a category it's going to run that particular data factory for my product category so let's just do a quick example of that over here so i've got storage explorer this is how i'm going exploring my lake this is my my network browser so i can go and see what's sitting inside my data lake and i can go you know what i'm going to add a file i'm going to pick the file to load this isn't how i normally trigger things by the way just as an example of what you can do so this could be someone's landing data someone's sending me data and then because they've sent it this is how i'm triggering everything else so they then run and have my logic app sitting watching that particular blob i should see that in a moment execute and get the context of that file and then run a load for that particular file possibly should have ran it before i actually loaded the thing but it might actually check let's give a second to make sure it goes through but you can see there we go so now in my logic app i get the context i can go and see why did you trigger what did you run i can say okay it's picked up the fact it's that product category so it knows that that file happened so this is my trigger this is my event watcher it's gone off it said i'm triggering my data factory and i'm passing that in as my parameter so if i go back over to data from my wrangler i can have a look again in my pipeline runs and i should see right at the top it's already run they go i run for product categories already happened so again moving away from etl windows moving away from the idea that arbitrarily at midnight i'm going to start my data processing because i'm using compute in azure because i'm entirely isolated from any actual on-prem computer i'm not sharing a server with my oltp system i can actually just compute whenever i need to as soon as i get data i can start processing my data and i have to arbitrarily wait and do an etl window just because that's how we always used to do it so you can challenge a lot of the way that we used to work so you see that uh my databricks one's also now ran so we have a look and what actually happens inside there so my cluster must have started it passed into my cluster passed that file name again i'm using the parameters then i should be able to see on the other side the output so it's telling me how many rows it rejected how many succeeded we've got automated etl again it's not doing transformation for me it's not going to magically know how to work out what a dim customer is but a good chunk of what i normally spend a lot of time working with that's no longer a problem so i'm just reducing the amount that i need to customize on a data set by data set level useful good good thank you we're not done yet okay so cracking on that so that's logic apps really small lightweight little bits of integration something's happened make a decision call something else triggers and actions okay again repeated for lightweight messaging only okay huge array of connectors can talk to lots and lots of different things many more things in data factory again because it's all about looking at small amounts of data not doing big data extracts and i think it's really great for triggering data pipelines for working out when things should run and reacting and having reactive data processing i used to call it event driven processing but then the edl is a massive right-wing thing in the uk and that's a bad name to give things so we moved on reactive data processing i think works kind of makes sense okay so finally adf mapping data flows so this is fairly new and yes it is another thing called data flows it's not the same thing as power bi data flows or power platform data flows i think they're now called different so inside data factory we've got this thing called data flows and what it is is data bricks but with a nice ui over the top of it so if you were looking at that bit of python and going oh god i don't want to start writing python that's that's too cody that's too deep that's not for me it looks like this it looks like good old ssis data flows except when i hit execute on that it runs on a databricks cluster it does it big at scale over memory so it's a good scalable fast way of executing this stuff but it's familiar this is kind of the modern nice way of working with this stuff when you click on it you can get the list of actions and it's like i want to aggregate i want to do a derived column i want to do a split they're suspiciously similar to what we used to call things so again it's not that massive a shift to start working with it because it's built to be familiar it's built to be the same kind of things that we work with so again it's building up a list of transformations when you execute it does have a bit of a startup time because it has to start a database cluster it's provisioning spark for you behind the scenes but you never see the code so it does something like that so inside data factory it's generating a scala file it's compiling it down into a java executable a jar file and then pushing it into its own databricks cluster so that currently runs in its own one it doesn't run on your database cluster so it's still going to take a while to execute because it might want to turn on the one in data factory is not and actually i've got two flavors of it so the one the one i had a picture of that's mapping data flows there's one currently in preview called wrangling data flows what is that it's power query so if you're happy in like power bi land and you're doing building out extractions you're writing m you're using power query to bring data into it currently in preview is the ability to do that inside data factory literally lift and shift the same m code but it'll run on spark it'll run on data bricks so if you've got a lot of analysts who are going out and building their little etl little etl that's really patronizing you can then pull that out and then productionize it and make it scale so those two are doing exactly the same thing mapping and wrangling they're turning some instructions turning it into a jar file running it on databricks so i've got my mapping dataflow doing that kind of stuff list of transactions making a jar file pushing it to databricks wrangling is just the power query version of it so if you're used to the web power query it looks exactly the same except you get to it through inside data factory how are we doing for time yeah not too bad okay so if i'm back in data factory land i can go in here i've got this thing down the bottom i've got this extra data flows bit that's where i'm designing those transformations so i've actually created several different data sets and unlike the one i was using earlier which was dynamic and i used a parameter and it didn't have a schema i've actually physically created these datasets pointing at for this parquet file for this particle file i'm attaching these ones to actual data sets so saying this is where to go and find it in the lake this is the full path i'm pointing at the product table which means it has a schema so these are actual registered bits of data so whilst my earlier bit my automated bit was super generic super dynamic the later bits i'm saying i'm applying business logic i can't really automate that too much so i've got a few of those i've got a parent pipeline so i've got my main pipeline an inside move and transform i've got that data flow element so this is saying i want to use data flows i want to use a graphical thing which is going to run spark for me and inside there i need to tell it which data flow to use again i can use that to trigger the creation thing i need to tell it which integration runtime to run on so because this is databricks baked inside of data factory it needs to spin up its own cluster it needs to know what that cluster looks like so i can quickly create a new one pick an azure one then under this dataflow runtime it's going to ask whoops i can tell it how big should it be how many core should it be how much money do i want to pay every time i run this so you've got some flexibility in terms of deciding it that's not quite as dynamic as some of the other stuff in data factory okay and i can press parameters into it can play around with it so the main thing is data flows down here so on my data if i want to get this design surface again feels a lot like ssis so i can go and attach various different streams i've got my different sources of data coming in it's a little bit zoomed in so i can see i'm pulling from product and then deciding which columns i want so i'm dropping columns i'm renaming things i'm shaping it a little bit i can have a derived column that's saying actually for some reason my id came in as a string change it so it's an integer a normal kind of some prep kind of things and i'm doing lookups again much like ssis i'm saying i've got my product table look at the product category now look at the product model i'm building essentially a dimension so if you want to have a visual way of looking at data transformation a visual way of how you're building your various different bits of the warehouse you can use this for it it's looking like different streams coming in just pointing at those different data sets and pulling it together now there's some really advanced parts of this in terms i can allow it to have some schema drift because you know what if the data types change slightly if they get a few extra columns don't break just carry on going do what you can i can build some kind of protection into it i can actually have parameters and wild cards actually push it through a lot for me generally i use this when i'm doing that transformation bit i don't do the cleaning because cleaning it's easier for me to have that one databricks package i've got a more powerful transformation engine in databricks than i do in here currently but again this is it in development there's lots and lots of new features coming and pushing and pushing and pushing and this actually came into general availability a few weeks ago so you can go and use this is live and supported okay so that will take a little while to run so i'm just gonna kick off and then we'll look at some conclusions while that's going okay so again better reminder there it is low code data transformations i can do it's actually got a lot of functions built into it i can do regex i can do lots of different derived things i can have some very powerful calculations in there but it's deliberately built to be accessible to be fairly easy to get started with to be really kind of easy to pick up so your data analysts can look at it and go okay yeah i kind of know what's going on here um not quite as flexible as pure data bricks i can't plug in lots of machine learning libraries i can't use a whole plethora of different uh our libraries and scala libraries and python libraries i can't just write massively recursive bits of code but for doing that kind of shaping of data super super useful just remember that it is using spark so it will scale really really well so going back to the start when i was slightly mean about ssis it's kind of because against that checklist against all those lists of things i just don't find it performs as well as those other tools together now i'm not saying that one of those tools individually fits all the books there isn't now a one-stop shop that does everything in the world although data factory is getting closer and closer so my data factory itself it is loaded fairly because i build one thing and then i reuse it massively it is super flexible some things it can't do which is why we rely then on database behind it but you can see with those patterns with that look something up dynamically decide what's going on it is flexible and it's scalable suddenly i can actually really really much more quicker be able to kind of manage to change databricks it's definitely not load data version you do have to write a bit of code and yeah you can write sql but you normally end up having to write a little bit of python a little bit of scholarly raw depending on your preference but for me it's an early spike at the start once you've built some of these really common packages you can then just use it flexibly for lots of things and not have to go in and change it that much so it's initial hard dev and then actually not that much dev depending on how you're using it you can get really really deep and build heavy heavy applications but fits it is massively scalable massively flexible super dynamic and parameter driven logic apps super low dev effort is really really it's deliberately built to be accessible to pretty much anyone um not that scalable can't do lots of things don't throw massive data sets at it but for doing that little integration something's happened to kick off the next thing someone's landed an ftp kick off the next bit it's fantastic uh and then mapping data flows getting better getting more features and it is low dev effort but it's not as parameterized as it could be it so that idea of saying build once use a hundred thousand times it's not quite there yet it's getting there and when it's at that point when it's at the point where i can actually just list a load of really generic transformations and say apply this set of transformations to all these data sets then it will truly be there in my eyes so it's getting there not quite there yet so typical usage this is the kind of thing that i'm building these days so i've got a whole plethora of different new data sources i'm landing them in the lake because i don't want to have to deploy a scheme a design a table change a database so i'm bringing data landing it in the lake picking it back up doing some cleansing automatically landing it back down again maybe picking up and doing some shaping transforming it into facts dimensions if i'm doing that kind of thing and then pushing it into a warehouse it's kind of the more modern data warehouse platform so i'm using data factory to do that data acquisition because i can it's nice and it's easy and it's scalable i'm using data bricks to do that validation because as you saw i can write automated data validation packages that make me happy inside and then maybe so it's accessible to my business analysts i'd be using mapping data flows to do that shaping that assembling my actual business transformations and then use data factory to put it into a warehouse these different tools coming together in that data journey quite nicely those different steps my data bricks my mapping data flows that's all orchestrated by data factory and maybe if i wanted to be really really really dynamic really dated like event driven then i have logic apps triggering those things that's kind of the new combination of different tools that just unlock so much more potential so the kind of architectures we're talking about these days are a hell of a lot more complicated than it used to be it's no longer just a picture with a single sql server in it if you're working in the cloud you know you are stitching together the best of lots of different tools it does become more complicated to achieve these things but then it's not that much more complicated these things are built to talk quite nicely to each other and if we're talking about sql server you know actually realistically integration services reporting services analysis services they're all just different tools they just happen to come bundled under one install so we're dealing with lots of different things here but they're all very specialized for doing their various different things so we're pulling data into a lake we're doing some data processing we're pushing it out into a warehouse so actually hopefully we should have a mapping data flow that's run let's go and see it's still in progress boo so i'll be starting up that databricks cluster so i can't go in and show you the results of that i should have run it recently let's go and see do i have great dim product in there there we go right at the bottom so once it's run it's spun up that spark cluster i can then go say what actually happened inside there and i get my full history i've got all my results that's going on so i can go and step through and say well actually what happened how many rows were there i've got this nice idea in through all my transformations what was happening then if i was filtering out i can get a nice actual auditable history of all the stuff going on you know that just happens automatically as a result of it when i'm writing out data i can see how many rows i wrote out how long it took so mapping data was really really good for that kind of easy easy management of that stuff okay cool so that is everything i had to go through again session evaluations super super useful don't forget that the credentials the credentials the certification are out there so if you want to become a data engineer if you want to get into this kind of thing the credential is a really good way to start that certification actually kind of makes you read into a lot of these stuff gives you that chance to practice make sure you've covered these bases so hopefully you're now all inspired to be a data engineer and stop writing ssis thank you
Info
Channel: PASStv
Views: 515
Rating: undefined out of 5
Keywords:
Id: 0mTxcRbu__8
Channel Id: undefined
Length: 70min 29sec (4229 seconds)
Published: Tue Feb 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.