☁️ Azure Synapse [Full Course] 💥

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody my name is mitchell pearson here at pragmatic works welcome to another awesome event today we're going to be taking a look at azure synapse analytics so stay tuned [Music] all right so welcome back we're going to be taking a look at azure synapse analytics there has been a lot of interest in this topic we've been very busy around this this year and so we thought hey we'll do an event around this topic a little bit about myself before we jump in i am the training manager here at pragmatic works i've been doing consulting and training for pragmatic works for a little bit over nine years now i started my career obviously doing consulting and then over the last four years i've been doing a lot of training specifically i've had an opportunity to author a few different books books that are on power bi on the power platform as well as on sql server when 2019 came out i do blog at mitchellpearson.com but as most of you probably know a lot of the interactivity and our kind of community involvement now involves youtube so we do a lot of recorded videos on youtube we've done a ton of these learn with the nerd events that are three hours long so definitely go back and check those out i have a wife and three kids right here in fleming island it's right outside of jacksonville florida on the east coast i do enjoy playing tabletop games that has not been something we've been able to do a lot recently as you might expect um and then a couple of channels i put down here at the bottom these are channels for specifically i would uh encourage you to go take a look at pragmatic work's youtube channel because if you know we're doing synapse analytics today but it's very closely tied to another event i did which was azure data services three hours talking about azure the data platform specifically and today we're not really going to get that high of an overview we're going to be jumping in to asa we're going to be doing demos and we're going to be having fun with that all right so and then some pictures of course of some board games so real quick let's talk about the agenda what should you expect over the next three hours that we spend together first of all we're going to be taking a look at what is azure synapse analytics or i call it asa what is asa we're going to take a look at the data lake store that's part of it sql pools dedicated and serverless how we can leverage both of those within the azure synapse analytics environment we're going to take a look at of course pipelines and data flows how do we do etl that's going to be cool we're going to look at notebooks right and then we're also going to look at power bi for synapse and then matt my trusty helper is letting me know that i am not sharing my powerpoint so thank you matt for that there's the slide again for everybody that's going to be the agenda for today i'm going to pop it right back up there since i missed that and then our logistics for today so it's going to be 11 to 2 pm eastern time we're going to take a 15 minute break here around 12 20 that'll give you time to stretch your legs go grab something to drink and then we're going to jump right back into it so those are going to be the logistics and then the last thing i want to say is if you're interested in this right if you're watching this event today and you're like i really want to dive deeper into it we do a lot of these boot camps around azure synapse azure power bi dax power apps and most of the stuff we do is private training for customers who want to train their team so if you want to get deeper into this technology we do boot camps with hands-on labs we would love to do training for you just reach out and let us know what you're interested in and we would love to help you with that we also do public boot camps right so if you don't have to train an entire team you just want to train a couple of people on the team then that's the most cost effective way to do that you send them to those boot camps they usually fill up pretty quick so if that one fills up and we can find room on the calendar we'll try to add another one before the end of the year um but take a look at that right if you're interested this is a good opportunity to take and do that training all right so we're past all of that we're past the introduction we're going to jump in and take just a quick look at azure services really as it pertains to the data platform because a lot of these pieces that we're talking about are in azure synapse analytics so we're going to take a look at the data platform then we're going to transition over and say what is azure synapse analytics why do we have it what problems does it solve and then we're going to spend the rest of the day doing demos well we're just going to be working in the environment and checking that out so this right here is a pretty standard slide that i use for a lot of my training and i make some customizations to it very similar to the one that i did when i did the azure data services class but i made a couple small modifications the modifications that i made you might notice are going to be right here in the middle of the screen right so right here i have my pipelines and data flows my it's really that icon that icon that rep represents azure synapse analytics so the data platform in azure is essentially how do we take all of our data that we have in all of these different data sources whether they're in dynamics in on-prem they're in azure they're in sharepoint wherever that data might live and how do we get that data into our dedicated warehouse right and so there's a couple of different methods that we can do that and then where do we load the data do we load it into a sql pool to an azure sql database to a data lake there's lots of options that are available to us in fact part of the reason that we have asa today is because there are so many options and the goal around asa is to kind of simplify that process but we're going to circle back to that in just a moment so we have all of these different data sources and we want to get that into our data warehouse well what we can do is we can leverage pipelines and data flows now if that terminology sounds familiar to you that's because it is the pipelines and data flows that we get in azure synapse analytics are the same pipelines and data flows that we used over in data factory there's a couple of small little differences between the two there's a couple of little things that you can do in you know snaps you can't do over here and you can do over here and you can't do over there but they're very minor and there is a document out there on microsoft docs that kind of highlights the differences so if you're trying to make the decision of which one do you go with um look at that document i try to point most people to asa to azure synapse analytics but that's why this would sound familiar if it doesn't sound familiar to you at all if you're brand new to azure pipelines and data flows is a graphical user interface it's a ui driven way of building a workflow that essentially does etl extract your data transform the data apply business rules and then load it into the final destination most people who are on here have probably worked with ssis right and so it's very similar to ssis in that respect now the other method that we have for doing etl and transforming and cleaning data is notebooks you've probably worked with notebooks or heard of notebooks maybe databricks or jupiter notebooks well inside of azure synapse analytics we have notebooks and very often i have conversations with data engineers who are responsible for cleaning and transforming that data who want to and prefer to actually use notebooks rather than using pipelines and data flows now full transparency that's not me i i loved ssis i like data factory i like pipelines and data flows but i get it notebooks give you a lot of flexibility and a lot of scalability they give you a lot of control that you might not get with something that's pre-packaged and you have those components right and so notebooks are pretty cool a lot of capability there for cleaning your data and we're going to dive a little bit deeper into notebooks as you can imagine now finally once we clean the data up we've transformed it we've curated it in some way we want to load it into the final destination and once again when you're working in azure there are it seems like unlimited options for what you can do with your data the different services available where i can load my data really when we're looking at the data platform though we're looking at either probably a dedicated sql pool an azure sql database or maybe our data lake okay and so that's what we're going to be talking about as we progress through the next three hours together once we load that data into our dedicated sql pool then we can choose to maybe put that data or read the data using power bi so power bi is another critical piece of this integration in azure synapse analytics and the one thing that i want to be a constant theme as we go through this workshop together today is the integration that exists within azure synapse analytics and how it really just makes a lot of things a little bit cleaner a little bit easier than how we used to do things in azure so i really enjoy the tool i like the direction of where it's going and i'm excited to be doing this workshop today all right the other thing that's on this slide is it's kind of a conversation around asa it's the fact that azure synapse analytics and we're going to get into this in a second is a unified development environment for really all of your personas within that data platform space and what i mean by that is your data analyst who just need to do ad hoc queries against the data your you know your data engineers who need to clean up the data transform it and load it and then your data scientists who want to build those machine learning models all of those personas within your organization can now work together in a collaborative way all right and so this is really awesome obviously your data scientist is probably not going to be building pipelines and data flows right the engineers are going to be doing that and notebooks are not going to be something that are used by your data analyst they're not going to be writing python and net code to build a notebook but your engineers and your data scientist might be so you have all of these different technologies that are built into asa so you can get that collaborative development and really integration with the different personas and the different technologies the other thing that i want to talk about real quick before i jump into the next slide there are a lot of people that join these live events and we love it unfortunately there's a lot of comments and a lot of questions i am not going to have time to like scroll through those i'll be distracted the entire time so matt's going to feed me some questions throughout the day and i'll glance over and try to answer them as i can and then when we take a break later i'll spend that break looking through questions and trying to do that but expect a follow-up video from us we'll do a q a video where we go back through and answer a lot of those questions that you guys have and then subscribe to our youtube channel we're going to be doing a lot of videos on synapse analytics and dropping videos on that so you can stay up to date with you know whatever it is that we're doing videos on all right i just wanted to put that out there real quick in case you're like hey they're not answering my questions you you guys see the chat there's a lot of stuff that's going to be happening in there all right so the next thing we want to take a look at is just kind of some of the core technologies that are in asa first of all every azure synapse analytics workspace requires a data lake to be set up and configured and so every asa workspace will have one primary data lake by default that is a requirement we're going to need that the data lake is just going to be a general purpose storage account that we use and this is where we can store our files obviously that we can then load for our data platform or maybe we just store them there and we query those files directly that's pretty awesome we're going to be talking about that that's going to be one of the first demos we do how do we interact with those files that are in our data lake using azure synapse analytics we're also going to be talking a lot about dedicated and serverless pools sql pools give us the capability to do analysis against big data right dedicated sql pool used to be known as the azure sql data warehouse on-prem it used to be known as parallel data warehouse or pdw and this gives us that massive parallel processing architecture so that we can scale out horizontally and we can run big data analytics when we have lots of data and so you know when you're working with something like sql server on-prem or azure sql database you keep throwing as much hardware as you can at it and you're trying to scale up vertically by adding more cpu adding more memory and you know improving your hard drives but at some point you get diminishing returns it just doesn't benefit you so if you have a lot of data dedicated sql pools serverless is going to help you out with running those very large queries so you can quickly run complex queries against petabytes of data with that notebooks another awesome and cool feature that's available within synapse analytics as we talked about a moment ago notebooks give us the capability of cleaning up the data transforming the data doing some cool stuff with it but also you can build your machine learning models here you can test them you can train them then you can save that off and you can score those models so if you're a data scientist and you want to do your work in here you absolutely can and i do know people today there where we've trained customers and they are using this as their tool of choice right it now one of the things with this is you're probably using other tools today as well so how do you make that migration how do you eventually get on this tool that's a conversation that we're having to have with a lot of customers also and then of course pipelines and data flows you'll notice i left a picture of data factory on here because the pipelines and the data flows that we use to move that data to transform that data and load it into the final destination are the same components and the same things that we were working with in data factory so if you're coming from data factory and you're looking at asa for the first time you're going to find out that there's a lot of similarities and a lot of functionality that is very very like what you're used to so you'll pick it up very quickly there all right and then of course the last but not least the integration with power bi so from azure synapse analytics workspace you can connect directly to an existing workspace you can build new reports off of the data sets in that workspace you can look at existing reports and you can edit them and change them so there's that integration which is awesome super easy to use but then you can also connect to your azure synapse analytics environment from power bi desktop that can be done using both dedicated pools and serverless pools which gives you just a lot of capability there that we're going to talk about as we get near the end of the three-hour session today so there's a lot of technologies here there's a lot of stuff here i'm going to go through some slides i think i have about 10 or so slides that talk about asa and then we're going to spend pretty much the rest of the day doing demos and we'll see how many demos we get through for this three hour workshop and then of course take a look at our youtube channel for follow-up videos that dive deeper into different pieces of asa after the event is over all right so with that being said we're going to go in here and we did have a question so why azure synapse sql is so basic compared to ms sql server 2019 and that question came from ed so i don't know exactly what that means when you say basic it is important to understand that you know when you're working with so one of the complaints that people have had around dedicated sql pools for for years right even back when it was on prim is that one of the limitations is the syntax that you can do from a sql perspective there are limitations on you know foreign keys and prior primary keys unique constraints there's a lot of definitely limitations there but those limitations exist because of the way that the engine in the background is processing big data there's concurrency limitations so there's definitely times where you know dedicated sql pool is the right choice and there's times where you know a regular sql database is the right choice and i see companies having both right i have big data terabytes of data 10 years of data that we're you know we're analyzing that we really need to be able to churn through quickly so we're using that for our dedicated pool and then maybe your smaller data marts you might be using an azure sql database and so i i i see what you're saying and where you're probably coming from with the limitations they do exist when you're used to sql server right and probably all of us are coming from that background but they're necessary and i will say i worked with and it was on-prem right i know a lot of consultants here at pragmatic works who work with pdw on-prem and sql pools today have come a really long way from where we started so there's a lot of new capabilities that have built in so it's not as limited as it once was all right so let's talk about azure synapse analytics specifically let's dive into that right now we have something called and this is the microsoft i love it the microsoft marketing we have what's known as the paradox of analytics i want to dive into that one thing you know when i go back to my consulting days right i had sql server bi suite i had ssis i had analysis services i had reporting services and that was kind of it and you know i'd go three or four years and i'd work with the same technologies and then there would be an update and i would learn it but i'd still be working with the same technologies and you'd get new technologies from time to time but it was always a pretty hefty upfront cost right so customers didn't always introduce those new technologies but once we move to the cloud we have all of these technologies at our fingertips i mean we're going to be talking about dedicated pools today a dedicated pool if a customer wanted to implement this mpp architecture on-prem they would spend i believe the number was roughly about a quarter of a million dollars 250 000 to buy an appliance then they would have to pay for consultants in training to get that up and running to build a proof of concept you're you know a quarter of a million a half a million dollars in the hole but in the cloud we can spin up a dedicated pool for a dollar fifty an hour and start working with it and build a proof of concept and that is across the board right so we can we can experience and we can explore and we can learn all these new technologies and all these new resources are at our fingertips and that's great in a lot of ways but it also can create some problems across an organization and that's what the paradox of analytics is the paradox of analytics is where a lot of companies are now having siloed data data that is difficult to bring together because i have these developers over here that are using data bricks these developers over here using synapse notebooks these developers over here using some other technology that does something similar but is ever so slightly different and it doesn't really i can't i don't have a good way to bring it all together right so we've always had siloed data because of data stores i have data in sharepoint in excel and on-prem and azure that's always been a problem and that's the evolution of why we have a data warehouse for analytics we bring it all together but now we're creating these silos of data as a result of technology and this is being created by the fact that we have all these technologies at our fingertips so that is kind of the paradox of analytics right we know that data is growing we know that companies are storing more data than ever before and as that data grows uh you know less and less data is really being analyzed because systems don't really can't keep up with it and that's why we have things like dedicated pools to be able to process that big data to be able to really churn through it but the data volumes are growing all right analytics and ai is the number one investment for business leaders yet they struggle to get a return on their investment we see this across the board right we've seen this for many years and so one of the biggest problems with that return though and one of the big problems that companies are having is exactly right here what we talked about before these siloed sections of our data that result as a result of you know data the data stores where it's stored where the data is coming from but now we're running into skills and technologies right so we have all these different technologies that are causing our data to be siloed as well and so one of the things and i'll talk about it here on probably the next slide one of the goals with azure synapse analytics is to solve this problem with the paradox of analytics right so each new technology creates another siloed operation that it brings it harder and harder for us to bring our data together so with azure synapse analytics we're introducing here the analytics continuum and the goal is that all of the jobs that you need to do as a data professional for the data platform we can bring those technologies into a single platform a single workspace a unified environment for all of your personas your analysts your engineers and your data scientists so if you need to do data exploration or artificial intelligence machine learning data warehousing real-time business intelligence or data integration guess what we can do all of that now in azure synapse analytics with the core technologies that we've already reviewed and so a big purpose and a big goal of taking a look at azure synapse analytics is figuring out do these core technologies work for our company and yes maybe there is some flashy new technology over here that does this thing but do we really really need it and understanding that as we continue to adopt new technologies it can create problems and i've done training consulting for so many companies that do have this problem today right they have exactly these problems where they're using all these different services and they're even bringing on people and they're like oh you don't have to know this technology because you know this technology which is going to obviously cause a problem right out the gate because they're going to use what they know right and so right here matt says we got another question from digital will we be covering expression builder in azure pipelines so we won't today but i have a ton of youtube videos that go deeper into that intermediate advanced level youtube videos and demos definitely check those out i won't be getting into those today but if you email me directly you own pearson pragmaticworks.com matt will give it i'll send you those links considering the various technologies used today would you recommend etl or elt approach when using synapse and why that's an interesting one so here's the thing that's a great question fsco digital because when we start developing in the cloud there's a paradigm shift in the way that we think on-prem give me the hardware give me the infrastructure i don't really think about cost a lot of times you know you're thinking about cost in the cloud and so do i use data flows within synapse pipelines which is going to cost me money it's going to spin up a databricks cluster in the background or do i maybe just dump the data directly into my dedicated pool which is always running that always is going to cost me x number of dollars per hour and then i just do the transformation in a stored procedure afterwards great questions right it really depends on a lot of different factors how much data you're working with i like obviously the ui experience of pipelines and data flows and being able to graphically walk through it i'm a visual person i love that experience so i would lean towards that but from a cost perspective if i'm noticing that those data flows are costing me a lot of money i'm going to say hey maybe i can re-engineer this load the data into the dedicated pool then run a stored procedure to clean it up afterwards right so elt is very popular for getting the data out of the source system fast just get it out but now in the cloud we can also talk about cost right does it make more sense to do the data transformation maybe cleaning the data curating it in my dedicated pool after i load it into a staging table or is it just as well to do it as i go through so there is no exact answer on this i don't have one for your organization we'd have to look at a lot of different criteria there but it is a great question great conversation starter there maybe that can be a youtube video we follow up on and look at a couple different scenarios but here it is right this is all the different things that we need to do and all of this is part of azure synapse analytics it's part of this continuum all right i think we're pretty close to getting near the end of the slides here what i want to take a look at as we transition into the demos is i do want to walk through the process of provisioning asa and talking about different factors now i've already provisioned my environment i've already provisioned resources right we want to get the most out of this three-hour workshop but i do want to take a look at what does it look like to provision it what are some of those key elements that you need to know about and then we're going to kind of move on from there and then we're going to do a navigation of asa so if you've never been in asa before azure synapse analytics we're going to jump in there right now for the first time you're going to get a view of what this looks like if you've been in data factory but not asa you're going to find out that the experience is very similar it's very similar and there's a lot of great capabilities in both so i think that brings us to our very first demo here let me go ahead and bring up my asa environment and my azure portal all right so i'm going to be in the azure portal like i mentioned from the very beginning earlier i'm not going to be doing a introduction into azure and going through the azure portal we've done that class learned with the nerds i think it was called azure data services please if you're interested in that go back and check it out but i do want to walk you through what it looks like to create an azure synapse analytics workspace so the first thing i will do is in the azure portal i'm going to go ahead and click right here to create a resource and then for the resource right here i am going to search for synapse right so we'll go ahead and search for synapse i'm going to look for azure synapse analytics right there and then i'm going to go ahead and create and walk through the steps now i would encourage you for the video here i want to give you guys as much bang for your buck so that's why i didn't give you files in advance and i don't have you following along but if you look at the description after this video is over i'm going to upload some files so you can kind of go through it and watch the video again alright so i will provide that afterwards but for now i'm going to go a little bit quicker than i would if you were in one of my boot camps and we were doing a deep dive and that's because i it's going to be recorded you can pause it you can stop it you can go back so i want to give you as much as i can right i'm going to go ahead and click create here and when i click create just like any other resource i'm going to go through and choose my subscription i'll choose my resource group here azure synapse workshop and then i have to choose a managed resource group name alright so what this is is this is going to store ancillary objects and items that are managed by azure synapse analytics so that is a requirement if you don't provide a name it will actually create this like crazy guide so i like to provide a name a lot of times i'll give it the exact same name as my asa environment and just add mrg to the end or i'll name it the same as my resource group if i know that resource group is only going to have one asa so i would come in here and say you know what let's go with azure synapse workshop learn with the nerds manage resource group and i've probably already created it so i'll type it in twice all right so that is going to be a requirement you'll give it a workspace name so azure synapse analytics workshop whatever you want to call your asa this does have to be unique because that workspace name is technically a server that you can connect to so whenever you want to connect to your dedicated pools whenever you want to connect to your serverless pools you're going to be connecting right here to this all right and then i got a good question here i don't know if it's a good question but peter smith is always here so because i saw his name i'm going to answer that question as someone who has never worked with synapse before what would be the compelling argument to pick synapse over sql server database other than microsoft pushing synapse at all cost so the big thing and i still push a lot of customers peter to sql database right i like sql database it's intuitive i understand it i understand sql pools too but sql pools and here's another thing let's talk about this real quick a lot of people today have a lot of confusion over they think synapse is dedicated sql pools and that's just not the case there's a lot of other capabilities in azure synapse analytics so that you can still open up and create an azure snaps workspace without ever provisioning a dedicated pool i get this conversation all the time with customers when we're scoping engagements and they're like no let's do everything in data factory because we're not interested in dedicated pools i'm like no no no we can still do everything in synapse analytics and it gives us great integration capabilities without ever provisioning a dedicated pool so that was a little bit of a marketing snap food by microsoft where everybody thinks that synapse analytics is dedicated pools and that's not the case it's all these different technologies we can create an asa environment without ever provisioning a sql pool the big reason for dedicated sql pools is big data right if you have a data scientist that are going to be churning through terabytes petabytes of data sql pools are going to be the most cost effective solution for that you can spin up a sql pool you can run it for as long as you need you can pause it and you can save money on the compute because it doesn't have to run all the time so if you're doing big data analytics you're doing you need to improve the performance of your dashboards and you can't really get it there with the sql database or you know that your data volumes are going to grow and you just want to go ahead and get it set up in dedicated pools that's where you can start but there's still a lot of customers that are on sql database and sql database is still a very very strong tool for analytics as well so good question once again not an exact answer it depends on your data your organization how much data you're working with and there's been customers i've pointed both directions absolutely so we give it a name we go to east u.s too and then we have to now choose our data lake remember i told you we must set up a data lake a primary data lake for the asa environment that is a requirement so we'll do that and then we have to provide a default container or a default file system and so i can go down here and grab one that i've already created and there we go now next and i'm just going to walk through these real quick to talk about a couple key elements next you have to provide a server name and a password this is sql server authentication by default now once you provision it you can go into the azure portal and you can set up your azure active directory you can set it up to be yourself or some other service account whatever it is but you can set up an aad account on this workspace once it's done and never use this server admin again but you do have to set it up from the beginning and so that is going to be a requirement and then there's one more thing here that's pretty critical i might have missed it i missed it on the last slide this right here is the main problem that users run into with asa specifically with serverless pools okay if you want to query data that's in your data lake and you get this it's it's i haven't had this error in probably i don't know six months but it's a very random error message that you get but you got to make sure that you are a storage blob data contributor on the data lake to be able to query it and you might say well mitchell i'm an owner on the data lake right owner surely surely is enough permissions uh no unless they've changed something recently no you have to be an owner and well you don't have to be an owner but you do have to have storage blob data contributor that is what asa looks for when you're running those serverless queries against your data lake so if you add additional data lakes to your asa environment you need to go back and make sure you are a blob data contributor on those all right this is this is one of the most common errors that i see this will do it automatically for the primary for the person who's provisioning it but if you add additional people to your synapse workspace you want to add them in all right very important very important you put this in under networking this is very critical i have a 30 minute video that dives deep into understanding virtual networks for securing your environment so i'm not going to talk about it here that's a video i would recommend that you go out and watch as well how to set that up and how that helps you get around kind of you know public ip issues and stuff like that and then right here at the bottom you can tell it to allow connections from all ips not safe great for proof of concept testing right i turn this on and then usually afterwards i go into the firewall rules and i turn it off i put only the necessary ip addresses and so the one i'm doing today only has my ip address nobody else will be able to try to authenticate or connect all right and that's it that is the steps for creating an asa workspace it's not hard and notice going back to peter's question notice i did not provision a dedicated pool so you don't have to provision a dedicated pool in this process all right i'm going to hit cancel because like i said i've already provisioned that resource and now we're going to go over to my asa environment now a shortcut for this is if you don't want to go through the azure portal you can just go straight to web.azuresynapse.net and when you go straight to azure web.azuresnaps.net it'll log you it'll let you log directly into your synapse workspace without having to go through the azure portal so that's very convenient and a really nice feature there let me walk you through real quick a real quick overview of asa and then we're going to dump into really our first demo here where we take a look at querying data and the integration of all these different awesome components that exist within synapse analytics so first over here on the left hand side we have this navigation pane once again if you've been in data factory and i bet a lot of you have this looks very similar in a lot of ways if i scroll in over here you'll notice we have the home hub these are called hubs the first hub here is the home hub this gives you kind of your recent resources that you've been working with it shows you how you can do a quick ingest of data explore and analyze data also visualizing power bi i don't really use any of these links on this website not on this page however if we go down to the data hub the data hub is where you have your various databases your dedicated pools you can see i've already provisioned one your dedicated pools your serverless pools we'll talk about that your serverless databases and then also your spark databases so you can also create a spark database right here that has tables on it as well so all of my databases are going to show up here under workspace in the data hub i'm also going to go over to linked linked is going to show me all of my essentially all of my storage accounts so you'll notice that by default i have my primary storage account right here and my primary storage account is called azure synapse i'm trying to pull that over there we go my primary storage account is called azure synapse workshop and that has you know all of the data that i'm going to be using for the demos that i do today and then down here i have a file system called training a container called training and within that it opens up a tab across the top i have a couple of different folders and within those folders i have files right so this is this is going to be cool you're going to see the integration as we work through the class today we're going to be doing a lot from this home hub or this data hub because we're going to come back and we're going to we're going to do some cool stuff with this all right so we'll come back there in just a minute under the develop hub this is where you create your sql scripts this is where you create your notebooks this is where you create your data flows so if you want to create a data flow you want to create a notebook you want to create a sql script you do it right here from the develop hub the other thing i want to point out that's really awesome with azure synapse analytics is the ability to come down here and browse the gallery when i was first learning asa when it first came out i first got access to it microsoft always does a really great job of giving you templates right pre-built templates they do it in logic apps power automate they do it here they do it everywhere and i like to open up those templates and kind of reverse engineer them and figure out how to do things right and so if you click on browse gallery right here it brings us over here to where we can see all of these different notebooks that are pre-built and we can work through them and see exactly how did they connect to those data sources how did you use those packages how did you import them how did you do those things you can reverse engineer them and you'll notice right here this is a machine learning notebook that uses pi spark this is another one that does some predictions using pi spark and then there's another one here that's machine learning this right here is how to do charting in synapse there's a lot of great examples that use publicly available open data sets so if your company hasn't blocked it you can come in here you can open these up and you can walk through all of these different notebooks there's a ton of them lots of great learning opportunities in here but there's also things out here for data sets so if you want data just to play around with and you're like i don't know where the data is you can come out here and you can grab some of these data sets that are available you can also come over here and look at some pre-built pipelines and maybe get some inspiration from that as well so we're not going to be doing that today i'm going to kind of walk through some things from scratch but i would encourage you if you want to learn more about this and really dive in i would encourage you to check out these different templates right and so the way that we got there again is over here from the develop hub when i click the plus icon where i can add a new sql script or create a new notebook create a new data flow i can also come down here and browse gallery all right you can also import notebooks and data flows and sql scripts into your environment as well and that's what that import means so that is the home the data and the develop hub the next hub is the integrate hub this is essentially where you create pipelines so if you want to create pipelines that are going to copy data from one place to another run a stored procedure procedure filter tables out send an email notification all of that cool stuff there we do that right here from the integrate hub once again this class this workshop is not going to be focused heavily on etl although we will get to this later but we do have you know other videos on our youtube channel tons of other videos focused specifically on this aspect etl pipelines and data flow so definitely check those out afterwards not now check them out afterwards all right we're going to go over to the monitor hub i love the monitoring that's built in the built-in monitoring is of course it could be more robust there could be all these other things but just out of the box what you get is really awesome here and so you can go in and do monitoring on your sql pools you can look at your spark pools any sql request i'm going to be talking a lot about this here in just a moment i kind of served up a bunch of sql requests this morning so we can talk about it you can look at any data flow debug operations that are currently running and turn those off and then you of course have your pipeline runs uh and then you have some your scheduled runs and then integration runtime so that's pretty self-explanatory there and then under manage this is where you can create a dedicated pool or this is where you can create a spark pool this is where you create connection managers with link services if you want to set up azure peer view which is still in preview this is where you would come and take a look at that if you want to set up scheduled jobs you do it right here under triggers and then we can set up and do some security things here as well and then of course we can also set up git configuration this right here workspace packages this is related to notebooks so you can actually connect and import workspace packages or different packages you want to use in your notebooks you can import them right here and then they'll be available to your notebooks all right all right so that's a real quick overview like i said we're not going to be diving deep in all these different topics i just want to get you familiar with the interface get you familiar with the integration of all these different tools and how they help us to solve what we were talking about earlier that paradox of analytics because you're going to find out that the technologies that are built into azure synapse analytics really do help us with most of what we need to do for the data platform those different roles all right so the first demo i want to do i'm ready we're going to jump right into the first demo here is going to be using serverless pools all right serverless on demand what is this this is cool i'm trying not to go back to the powerpoint any more than i need to serverlesspools is where we are essentially going to be saying look i want to query data that's in my data lake using sql and i want to be able to scale that out i want to get great performance even if it's big data but i don't want to have to go spin up a dedicated pool and so companies shouldn't have to spin up and leave a dedicated pool running all the time and so if the pool's paused and it's not running and i want to run a query then i have to go and turn the pool on and that pool is going to be running and it's going to charge me x number of dollars per hour for me just to go and run a couple of ad hoc queries right that's not very flexible not very helpful and so serverless sql on demand gives us the capability of kind of just in time reporting it gives us the capability of running a lot of ad hocs and interacting with our data and doing some you know just it kind of exploration of our data without requiring a dedicated pool this is game changing in a lot of ways because before this you would have had to you know had your dedicated pool running which does can cost quite a bit of money per hour that it runs right so this dedicated pool that i've provisioned right here for this workshop this is costing me right now 1.50 an hour it's not a lot of money it's the the lowest tier you can absolutely go with but it's going to cost me every hour that it runs i can pause it and i can turn it off but if i do anything that i was using any scripts any you know power bi reports whatever anything that was using that dedicated p1 database is not going to work until it's turned back on so i'm saving money but it's highly inconvenient so instead the way we can kind of work around that problem is we can use serverless let me show you what this is every asa environment by default has serverless pools available to you it's there you can use it if you want don't use it if you don't want serverless pools cost you five dollars per terabyte of data processed that's can be pretty inexpensive and i'm going to show you that here in a minute when we go back and look at monitoring okay so five dollars per data per terabyte of data that is processed so let's do this let's go out to our data lake and just query some data and take a look at it so i'm going to go back over to the data hub under the data hub i have some data that i've already uploaded to my data lake right so i'm looking at my data lake under linked right here we're in the data hub i'm in the linked category and i'm going to come down to my training folder and i'm going to go into my let's just do training data for this demo here and i'm going to right click on this holiday file now i'm going to i'm going to be using this parquet file i'm going to right click and notice what happens this is the integration that really puts asa over the top i love this when i right click on this file i can do new sql script i can create a new notebook from here i can start a new data flow that's connected directly to this file i can also go in and create a new integration data set right here and then once again i'm not sharing my screen thank you matt for that letting you guys down i'm going to click data right here and then linked and i'm right here inside of the notebook all right so now i'm going to go back over here and right click on this file and when i right click on that file i can do these different things i'm repeating it because i wasn't sharing my screen again so we'll do sql script new notebook new data flow and then new integration data set we're going to kind of work through all of these through the remainder of this workshop but for now i'm going to do new sql script and i'm going to say i want to select the top 100 rows all right top 100 rows and that's going to write a sql script that looks like this now a couple of elements i want to point out first of all right here at the top you can see that i'm connected to the built-in sql pool which is serverless i only get billed for what i use right so it's running all the time pretty much kind of it's running it's ready to go i don't have to worry about pausing it or turning it on or scaling it up or scaling it down it's running all the time that's huge that's good and so now anytime i run a query against that i'm going to get billed usually fractions of a penny and i'll show you that in just a moment and i'm going to be pointing to since i'm not really pointing to a serverless database we haven't created one yet i'm just going to be running a query to my master database all right so now i'm going to run this query right here that is returning top 100 rows from that holiday file and so i come over here and click run and it's going to show me the results right here at the very bottom there we go and now we have our results and so we just took data that was in our data lake and we queried the data super easy super fast we didn't have to load it in table if i'm a data analyst i don't have to reach out to my it team and say hey i want to see what's in that file i don't have to download it right i just query the data you will notice that we're using open row set so it's a little bit hard from a syntax perspective to kind of just type that in if you want to run it from scratch but there are a couple little workarounds for that that we're going to talk about okay you can also change this this is sql at this point so you could do aggregations you can do where clauses you can do group buys all of that's available to you now here's what i want to talk about here's what i want to dive into how do we in fact let me take another example real quick that has more data so i'm going to close and i'm going to discard that script and i'm going to go over to my training folder again let me go back one and i'm going to open up the taxi data in my taxi data i have about a hundred megabytes of no i have about a thousand i don't know i have a lot of data in here and i'm going to right click on one of these files and i'm going to do a new sql script select top 100. now this is quite a bit of data i think this is about 100 million rows of data so not too much but it's quite a bit if i run this script it's going to run pretty fast right it's pretty pretty efficient because it is using the sql pool capability but what's happening here is actually that just ran for the one file so actually remember we did it on a file so i'm going to change this file right here and i actually want to run it for all files in that entire directory so what i can do is i'm pointing to my training folder pointing to the taxidata directory and i'm going to just put a little wildcard in here with an asterisk there we go and now i'm going to run it again i can run it with shift enter so i'm going to hit shift enter on my keyboard let that run and now i just queried all of the data in that entire folder which is once again about 100 million rows of data the only problem with this is this is not a lot of data in terms of big data and we know that with serverless right with serverless every query that i run with serverless is going to cost some amount of money right usually it's not going to cost that much i'm going to teach you kind of where you can go to find this so let's take a look at this cost real quick if i go back over to the monitoring hub we just ran a sql on-demand query if i go to the monitoring hub and i go over to sql request right here monitor sql request you'll see all of these different queries that i ran the one that i just ran was this one right here all right 11 45 it does look like it might have been pulling back some cache data there so i need to change that query to get better results but what we should have seen is this right here this query right here this is what we should have seen all right and this is going to return about 84 megabytes of data processed so remember the way you get billed for sql pools on demand is five dollars per terabyte processed which means that if you do the math five dollars divided by one million times 84 this is literally fractions of a penny all these queries are like fractions of a penny so it's not a lot of money for sql on demand this is just one way you can come in here and see which queries are costing you the more most money and you can also connect to this from management studio and there are dmvs you can run and there are scripts you can run to tell you exactly how much data is being processed but also you can come in and actually look at the query here so you can dive deeper into the actual query so if you're having a problem with queries that are costing a little bit too much money that are not effectively built you can come in here find the problem and then maybe figure out a better and more effective way of doing that right so this is a great way to kind of look at the cost the other thing i want to show you here for serverless is you can control the cost a little bit so if you're afraid that you're going to have this turned on in asa and you got a thousand users that are going to be working in here and doing all this and it's going to just the cost is going to grow out of control there's a really really easy way to control the cost what you can do i'm going to walk you through this is we can go back over to manage right and so in the manage hub we can go to sql pools and we can find this sql pool right here this is the built-in this is the server list and if i click on this little cost control button when i hover over it it pops up i can click on that go over to the blade on the right and then right here right here i can come in here and i can say hey i want to set the daily limit up to be you know 10 terabytes or 5 terabytes or one terabyte so you can minim at the daily level the weekly level and the monthly level you can minimize the amount of data that's being processed and once you hit that limit it'll stop right it won't let it go any further so you can set these hard limits to make sure that you are managing cost all right so that is cost control really easy to set up and that gives us some insight into working with serverless on demand now here's the deal mitchell all right we have all of our data for the last 10 years literally petabytes of data terabytes of data and it's all in one folder and whenever we query that data even when we add aware clause it's reading all the data that's in that folder is there a better way to query the data so that we don't process as much data we return the results faster and we don't get charged as much money right great question the answer to that is yes and so what we can do is we can store the data in our data lake in a different way we can store it maybe as partition data to where each year is in its own folder and then when we write our serverless queries we can write it directly against that folder and that way we get partition elimination reducing cost and also going in there and improving possibly improving performance um so when you run i got a question here from big picks 2000 so when you run it it charges against your subscription instantly no it does not what it does is it shows you how much it's cost or or what you know how much data you've processed and when you get your monthly bill you can kind of see that that's where it came from so if your monthly bill is you know out of control for serverless and you didn't set up a cost control center and you don't know what's going on you can go back into monitoring look at all the sql queries that were run see who ran them see how big they were and then you get into what i'm talking about right now which is how can we maybe manage cost by improving the queries that we write so good question let me go ahead and open back up the powerpoint and we're going to jump all the way down to notebooks real quick and i think i just have like two slides on notebooks that i want to show real quick here so let me pull that back up before we dive in so i like notebooks i don't do a lot with notebooks i prefer pipelines and data flows but what i do like about notebooks is the documentation capabilities i love that with notebooks we can go in there and we can add all of our documentation right there with the code so as you're stepping through the code you have the documentation that explains what you're doing you can do that i loved it in databricks i like it in azure synapse notebooks i really like that capability so in a notebook you can store your live code you can do visualizations i don't know that i would really do it for that when we have tools like power bi and then you can also do documentation the spark pools is the compute that is used to run those your code whether you're using python or spark sql or you know scala or net all of those notebooks are going to run against a spark pool all right and apache spark cool so that is going to be your compute and i'll talk a little bit about provisioning a spark pool and what the options are here in just a little bit as we continue to go through this why would we use notebooks well you can absolutely use them as a replacement for pipelines and data flows right you can definitely use them to build and prepare your data to clean your data up to curate it but a lot of times this is what's going to be used by your data scientists they're going to be using this to do machine learning predictive analytics um just like your dedicated sql pool which is a massive parallel processing architecture designed for big data so is apache spark apache spark also is an mpp system that scales out horizontally so it also works really well with big data and it's been used for big data for a very long time now but that is going to be your compute for notebooks inside of a notebook this is different for synapse notebooks versus you know if you go back over here this is different for synapse notebooks versus data bricks because in data bricks we have python scala spark sql and r in synapse we have python scala spark sql and net so there's a little bit of a difference there but these are the primary languages that you can use inside of your notebook so if you're a python person and you like doing python and building notebooks this is for you right all right so that is a super super fast super quick introduction into notebooks now we're gonna dive in and actually look at a really simple notebook here uh for the most part now a minute ago we had a problem we were querying data 100 million rows it's 88 megabytes of data queried and i i don't even need that because you know if i go back over here and let's go back and look at this sql query if i come over here i might think that if i add a where clause in here and say where you know pickup month equals one so i'm only bringing back january of 2018 which is the data that's in here and i run it i might think oh i'm processing less data but remember this is reading data from files it's not reading data from an index table right so that's not necessarily the case so i run that query we see the results down here at the bottom if i go back over to monitor real quick just to see that query here it is you'll see that it's 108 megabytes so that's not quite what i was looking for it had to filter down the entire table and it's really more than when i was not running aware clause earlier so it doesn't give you quite the expected behavior but if i know if i know that all of my users are only querying the current year or they're really only querying the current month then maybe i can partition my data to improve performance and reduce cost reduce the amount of data processed so how can we do that though well obviously we could try to do this through you know pipelines and data flows absolutely but this is a good opportunity to kind of highlight the capabilities of notebooks as well so let's do that we're going to fix this problem so here's what i'm going to do i'm going to go over to data we're going to click on linked and from linked i'm going to go into my training folder and then we'll go into taxi data all right we're already there now remember this has all of my data for 2018 january february march april and i really want to split this out i want to partition this data into different months so that when i query the data it's partitioned and it improves performance so i'm going to grab one of these files doesn't matter which one and i'm going to tell it that this time and this is it this is the interactivity this is the cool thing about synapse analytics i'm going to tell that i want to right-click and create a new notebook and i'm going to load this file specifically into a data frame in a notebook that's what i'm going to do so i'm going to click on that and you'll notice right here across the top look at that we now have a notebook and i'm going to walk through the fundamentals i'm going to walk through the basics of this and then we'll kind of step back and then i got another question from peter do you mind listing the dmvs or queries that show the top cost so there are i don't know if i have a query specifically for that's a good question peter i don't know if i have a query specifically for that i mean you will be able to see all the individual queries and then you can just sort them um matt i sent you an additional resources folder mount if you can put in the link for data processed for serverless pools if you can put that link in there i think that talks a little bit to what peter is asking about right now and i think that'll help all right so under notebooks right here what i'm going to do is i'm going to first point to a spark pool i'll circle back to spark pool in a minute and i'll show you how we provision that but for now i'm going to provision or i'm going to point to the spark pool the compute that i want to use to run this notebook then if i want i can change the primary language i don't want but i could so you can go with any of these languages here for your notebook now what i'm going to do is i'm going to modify this query like we did before and i'm going to grab all of the files from that directory so instead of pointing to a direct file i'm actually going to put an asterisk in here a wildcard and i'm going to pull back all the data so i'm loading all the data from my data lake into this notebook into this data frame so i can do something with it right so that's going to be the first thing that i do here and then i'm going to go ahead and run this all right now this is going to take a minute to run because it's going to have to go out there and it's going to provision the pool it's going to spin up my spark pool which is going to take about a minute and a half or two minutes which gives us plenty of time to now go talk about how do we provision a spark pool so if i come over here to manage the manage hub this is where you come we talked about this at the beginning to create sql pools to provision your spark pools your link services your triggers your run times all of those different things we can do right here and so i'm going to come over here and then i'm going to click on my apache spark pools and i can click new and when i click new right here i can now come in here and actually very easily set up and provision an apache spark pool you give it a name tell it what size you want the nodes to be right so you choose your node size so each essentially each compute node or each computer is going to have this amount of compute associated with it you can tell if you want to do auto scale which means when it's running it'll scale down to the lowest number of nodes if you're not using it but then it has the capability to scale up so you give it a little bit more compute or actually it'll scale up the nodes i'm sorry it'll scale up the number of nodes that it works and you get charged per node so i can say i want to start with three nodes but if i have enough work go ahead and distribute it across more nodes to improve performance right and then it gives you an idea of your estimated cost if you turn off auto scale it gives you a better idea of what your cost is because you're not worried about it scaling up and down all right ariel harper said does selecting top 100 limit how much data is processed it depends so i think that if you do top 100 with no where clause it it'll just grab 100 rows so it will limit because it's not it doesn't have to read all the data but when you add that where clause on there i think it reads all the data in it then filters it down then it does top 100 so it kind of depends on the scenario yeah all right so this is where you go you go in here additional settings there's really not you'll notice this this is important and then matt's going to put a couple of links in the chat window for you guys but you can only have there's apache spark 2.4 or 3.1 i think 3.1 is still in preview but it actually doesn't say it here so i'm not sure maybe it's now ga you'll notice that for each version you'll get a different version of python a different version of scala java.net so you know you might have to choose the spark cluster the spark pool that works for the libraries and packages and code that you have right but this is where you might have to make some changes there and that's really it i mean there's obviously a little bit more to this but just giving you a really quick rundown of where you can go to set this up all right so that's where we could go to do a spark pull i'm going to go back over to develop and if i go back over to develop here this should almost be done running so this is going to essentially read all of the data from that folder into a data frame and then we're going to display that data right here we go we have 10 rows of data now what are we going to do with it well we could clean it up right we could get rid of columns we don't want we could use a dot select we could do some counts and aggregations and all that but we don't need to do that our use case just to show off notebooks very quickly is we want to take the data that's in the original folder and we want to load it to a new location we want to partition it out so let's do it here's how synapse notebooks work probably a lot of you have some exposure and experience with data bricks if i want to add a new code cell that cleans up the data frame that does something with it down here at the bottom i can click on a new code cell also i can choose a markdown cell a markdown cell is for documentation purposes descriptive purposes right so we can put some documentation into our notebook in databricks you just switched between you know you switch magic so you put percent md and it made that cell a markdown cell here you choose if it's markdown or code you can always change it after the fact you'll notice that if i come back to this cell right here i can come to the top right and i can change it to a markdown cell i can change it to a documentation cell i can also clear the cell output which i'm going to do so i get rid of all that there and gives me a little bit more room here and then there are some other options here as far as moving it up and down hiding the input and also creating a parameter so we won't get into that today but you can create a parameter cell so let's create a new code a new code cell and what i want to do is i want to take that data frame that has all the data and i want to write that data back to my data lake so let's let me kind of walk you through this so we're going to do df dot here and i'm going to do a write command so i want to write that data i'm also going to partition that data by my pickup month let me make sure it's pickup month yes i'm going to partition that data by my pickup month and i'm going to do that as a parquet file okay so i'm going to do it as a parquet file parquet file by the way is kind of the file of choice because the file of choice here is parquet gives you that schema rich capability it also gives you great compression which if you think about sql on demand i'd rather query files that are compressed and better compressed because it can reduce cost as well but parquet files are great you also get better performance with sql on demand with parquet files so definitely something to look at and while i'm going through this daniel asked is spark pool equals data bricks uh so technic they both use apache spark yes they both use apache spark definitely spark pools and map go ahead and put the links for 2.4 and 3.1 in the chat window for everybody spark pools come pre-loaded with a ton of packages so there's definitely differences once you start working with you know the different spark pools but in the in the background yes they're both using apache spark databricks has just been around a little bit longer it has some more proprietary things built into it all right so i'm going to tell it i'm going to load it as a parquet file and i have to tell it where i want to load the data so i'm going to load the data to here let me see right here like this all right adfs training azure synapse workshop dds and then i have to tell it the exact container where i want to put it so i'm going to say training output and then partition data i'm going to check my notes before i run this i'm going to make sure the other thing i might want to do is i might want to say that i want to override it just in case it's already there so it doesn't fail so i'll do mode equals overwrite and then let me check my notes over here real quick so partition by parquet all that looks great training output partition date that's where i want to put it okay so i think we're good with that and this is it i'm now going to use i'm not going to go spin up and do all the work this would take a lot more time to be honest with you to do this in a pipeline or a data flow i feel like would take me a lot more time so within seconds i can just read the data pull it into a notebook and write that and partition that data so that now i can write sql on demand queries directly against the data i want which we could use for power bi reports we can use for dashboards we can connect directly to that data reading less data and improving performance so now i'm going to run this cell i'm going to click right here on the left hand side i'm going to run the cell while we're running that a couple of things if you're using markdown language for documentation the outline is awesome it kind of builds an h1 h2 h3 heading on the left hand side that kind of walks you through a notebook i love that capability right there in notebooks also over here on the left if you want to look at the value of any of the variables you've defined you don't have to necessarily print them right historically speaking we always print the value or do a count or whatever to see what it is you can just open up the variables give it a second to load and that'll show you all of your variables there so i like that a lot as well all right this job is executing and so what's going to happen here in a minute is we're going to be able to go to my training output folder which if you're this is cool right if you were doing this in databricks and writing to your data lake what would you have to do you'd have to leave databricks go back to the azure portal go find your storage account or you can go to storage explorer go find it go into it and see if it worked but because all of these technologies are integrated together in my workspace i can just go over to the data hub go to linked and then i can see the files right there i can see if it worked i love that capability of not having to go all over the place just a really cool thing and so this is still running once again it's 100 million rows of data i have a really small spark pool so probably not the best choice for that i think it takes about two minutes but when it's done we can come over to our data hub go into linked it's already open right here by the way um and then i can go into training it's gonna click right here on this tab that was already open and i'm going to go back to training and i'm going to go to training output this time so i'm just kind of switching that up here's the partition data folder so it just got created uh 1203 means we got a break coming up in about 15 minutes so stick with me here and we're gonna go into that and then i got all of these folders created pretty cool right all right so i'm gonna come back over to the notebook let's see if it's done it's done it succeeded it took right at almost two minutes and this is awesome so now watch this we go over to our folder now all of my data is partitioned by month pretty cool pretty easy to do by integrating a notebook right so i'm not a notebook guy right just full transparency i do training on it from time to time i do some stuff with it but that's not my core technology right i've told you that before you go look at my youtube videos my blogs you'll you'll figure that out but i do use it from time to time because i find this to be highly convenient this integration now if i go into pickup month we're going to kind of wrap this up full circle here i can come in here and right click on this file do new sql script select top 100 right and you'll notice that i can come over here i'm doing pickup month equals one i'm going to go ahead and put a wildcard in here so we read all of the files that are in that folder again just like we've done before and i'm going to go ahead and run this against my master database on my serverless compute and i'm going to click run and then when it's done we're going to go back over to monitoring and we're going to see how much data was read all right there we go we've returned everything of course we could do aggregations here we could write it we're going to go back over to the monitor hub real quick and then right here you'll notice that with that query that i just ran it took three seconds and it was 10 megabytes of data processed by the way something you should know serverless queries always do a minimum of 10 megabytes so even if that was a 5 kilobyte query it's going to be 10 megabytes once again that cost is .000.05 so it's very very small fractions on the penny but there is a default that shows up there all right and then you can look at the query but this gives me exactly what i wanted i am getting partition elimination i'm not reading all 100 million rows to return the month of january i'm only reading the data i need to read improving performance and minimizing cost right all right so that's a way of just integrating notebooks into that i've had a lot of fun learning notebooks doing different things that i used to do in data breaks inside of synapse notebooks but very intuitive and what i want to point out once again as we go through this class is the integration between these different services and how easily they kind of communicate and interact with one another all right so that is notebooks let me kind of glance at my notes and make sure i'm not missing something there all right so let's do this demo real quick i'm going to show you something else so and some things i'm going to have to choose based on time we've been doing a query against our data lake using this open row sets syntax right an open row set is just not the most intuitive who wants to type out their storage account name and the the file and all of that every single time so what we could do is we could actually create an external table which is not a it's not a real table that stores data but we could create an external table that stores all the metadata information about the underlying files and then we can query the regular external table as if it were a real table in a database so i could say select top 100 star from taxi data right how do we do that i'm going to walk you through that because this is cool this makes it easier to write queries against your tables it's more intuitive also when you connect from power bi power bi is going to recognize those tables so you can connect and use serverless directly against using power bi against data lake as well and it becomes a little bit more intuitive and easier there as well so let me kind of walk you through this real quick the first thing we need to do is build a serverless database this is super easy so if i go over to the data hub and go back to workspace right here i can click on the plus button to add a new resource and i'm going to tell it right here i want to create a sql database this is a dedicated or serverless pool this is not a regular azure sql database so do not get that confused but i want to create a serverless pool so i'm going to click right here on the data hub click the plus button to open that up and then i'm going to come over here and i'm going to choose between serverless and dedicated and then i'm going to choose serverless of course because that's what we're doing and i'm going to call this my serverless and it's my serverless pool so i'll go with p1 all right serverless p1 and then we'll go ahead and click create and it's done so that happens super fast now i have a database and with the database i can now load um tables on that i can create tables we're going to create an external table that is once again it's just a metadata structure on top of the underlying files all right so let's do this real quick and then peter i saw something from you i'll i'll take a look at that here in just a moment all right so let's create an external table now we could write syntax here i could say create external table create the file format create these different things but that's a lot of work so there's an easier way to do it as you can imagine so if i wanted to create an external table on let's say pick up month 1 i can do that so we're going to go back over to our data lake once again really nice really cool that we can do that and i'm going to right click on one of these files here and then i'm going to say new sql script and i'm going to say create an external table so it's going to actually give me a ui and generate the script for me so i don't have to write it so that's what we're going to do we're going to create a new external table here all right this right here is my serverless p1 database that's where i want to put it and i'm just going to call this one taxidata and then i'm going to say use a sql script so we can read the script together and kind of look at what's happening there and then we'll click create all right so there's a couple of things that are happening here first you have to create essentially a file format and this file format is just going to be my parquet file and then that's going to be used in the create external table script and then you also have to create an external data source here and the data source that we're creating is just going to point to our data lake and that primary container so super easy once it's created the first time you really don't have to create it again you can use that same file format and you can use that same data source for other scripts that you run whenever you're creating external tables right so we're going to go ahead and take advantage of these i don't like this name though so let me see if i can change this to something easier here i'll call this adls primary or something because i'm going to use that later i think and then this one i'm going to change the par k i'm going to give it a more intuitive easier name here i do because i'm changing those i will have to come down here and change it down here so that's par k and that was see this name i don't have to type that every time so that was adls primary there we go and then this one was par k all right so what this is going to do is this is going to create an external table which is just the metadata on top of the file it's not loading any data right this is not like loading a data into a dedicated pool this is just the metadata on top of those files and then i'm going to be reading the data from this exact location now i can change this right i don't want to just get that one file i want to get everything from that file also let's just get everything so i'm going to do this i'm going to do star and then i'm going to do star so grab within the partition data grab all the folders and within those folders grab all the files all right so this is going to get everything not just pick up month of one and then when i'm done go ahead and select top 100 here all right built in serverless script looks good let's go ahead and run that and then let me look at what peter was saying all right we got a couple of them i'm still unsure about how this adds value over using adf to push data to my azure sql database so because a you know azure sql database peter is not built in and you don't have to apologize to me i didn't develop it right i just work with it um because azure sql database is not something that we can yet link directly in here and intuitively work with you still have to essentially go to management studio or go back to the azure portal i get exactly what you're saying like i get that right but if you're using the data lake and you want to just create data sets or you quickly want to be able to see what was written without having to go somewhere else that definitely adds some pretty pretty good value to my knowledge there's no goal right now for microsoft to really deprecate azure data factory i'm sure they would love to not have to manage two code bases but i don't know of any plans to deprecate it so i see i don't see any problem with being in data factory but if i don't need any features specifically that's in data factory like ssis integration runtime lift and shift then if if there's no real reason for me to be there i probably would just go with synapse because long term they're going to i would imagine they're going to add more capabilities in here more integrations that's going to make it cleaner so that might be a good reason just to plan for that but that's just me thinking out loud we got another question here using notebook and generating partition data aren't you paying for the increased duplicated storage yes yes you are the same way as if you were copying data using a pipeline so that's a great question if i were going to take old archived data historical data create a copy of it um in a partitioned format then i would go back and get rid of the old data right because now it's all loaded so once i verify it's all been loaded yeah go ahead and get rid of the old data because yes you would be paying for it all right so watch this right here here's what we're going to do we've just created this table now if i want to query data from my data lake from taxidata instead of doing this let's go back one instead of doing this or even further back where we have to use open row set and we're querying all the data and doing a where clause now i can do this i can say select star or let's do top 100 let's do select top 100 star from whoops i did shift enter my fault i always do that from taxidata make sure i'm pointing to serverless here because i got to point to my database now and now instead of having to type open row set right i just query the table directly now the table doesn't have anything in it it's just a structure on top of the data that's in my data lake but it makes it a lot easier to query so if i highlight just this piece of code and i run it we're going to get the top 100 rows from the taxi data there it is so we're back to just simple so you see how it simplifies it right and then of course you could start trying to join tables together these tables are going to show up in power bi desktop we're obviously not going to dive that deep once again if you want to dive deep into that you want to take one of our boot camps where we really dive into all this but this is where you can take advantage of external tables all right so this is going to be a great time to actually take a quick 15 minute break let me make sure looking at my notes that that's where i want to end i think it is that puts us in a great spot lots of great questions lots of interactivity we're going to take a 15-minute break give you a chance to stretch your legs and i'm going to read through some of the questions on break and see what we can answer as soon as we get back and then we're now going to transition into talking about pipelines data flows as well as power bi integration maybe a little bit more here there's one more demo i'm going to think about but i'm gonna look at time all right i'll see you guys back here in 15 minutes enjoy your quick break all right welcome back everybody hopefully you had a good uh quick break there i'm going to jump in real quick and before we get started back up i want to talk about our on-demand learning obviously it's a very cost-effective way of staying up to date with the latest technology when i first started working in four pragmatic works about nine years ago a little over that i was lucky to work with a guy by the name of anthony martin loved that dude super smart but one of the things i learned from him in the two years we were working really closely together on a project is every morning he would spend 30 minutes before his day started reading blogs reading articles reading books watching videos whatever it was just staying relevant like just keeping his skills fresh and that was one of the things i picked up from him which i'm glad i did because today's world technology is changing so fast there's so many new things and you really have to work to keep up with technology right well for less than a dollar a day so grab yourself a cup of coffee buy our training you can keep yourself relevant with with structured training that is designed specifically for you and so anytime we do these events we have you know usually i think on every event so far we have these big deals where you can really save a lot of money on our on-demand learning and so i would definitely recommend that you take advantage of that it ends on sunday right so sign up for it if you're watching this video record it on youtube sorry it's probably not 20 you know 247 a day all right so we're going to dive right back in and what we're going to be taking a look at now is etl in synapse we're going to be taking a look at etl this is going to be for those of you with a data factory background very similar now my goal here is to kind of highlight the integration that exists in synapse analytics that we don't necessarily get in data factory but i get and i understand that a lot of people on here are probably like hey what's going on with azure so you want to learn a little bit about what a pipeline is and what a data flow is so i'm going to try to balance the two as we work through this and so we're going to go through pipelines data flows and then we're going to get into the power bi integration to kind of round off the day all right so let's do it we're going to jump right in we're going to cover the fundamentals here synapse resources pipelines data flows data sets linked services these are the different resources that we use inside of azure synapse analytics to build our pipelines and really to clean and transform that data data flows being the really transformation tool that we have available to us within a pipeline we have activities right an activity essentially defines what action we want to perform on our data copy the data run a stored procedure run some kind of web activity whatever it might be run a notebook run a data bricks notebook right so we can perform activities within a pipeline and a pipeline is very similar to like an ssis package it is our orchestration tool where we control the flow this runs then this runs then that runs right we can control the flow of the op the kind of order of operations there and so you'll see that there are tons of different activities that are available within asa and within data factory pipelines now the copy activity is probably going to be the most common that you use a lot of times we're going to want to copy data from different data stores dynamics you know sharepoint or on-prem and we want to get that data into maybe our data lake we want to get into our dedicated pool we want to get it into our sql database in fact one of the questions that i did see was can we use a notebook inside of synapse analytics to write data to a sql database of course you can yeah it's not as intuitive and as easy as it was to write data to the data lake because the data lake and the notebook are talking to one another they're in the same workspace it's integrated together it just works you don't have to you don't have to pass in a bunch of credentials and all that to get it to write to the data lake because it's all kind of communicating within that workspace right so with the database you'd have to provide all the credentials and stuff but yes you can absolutely write to it and then we could of course use copy activity within a pipeline to do the same thing and so we can write to whatever data stores are available within our synapse environment there's the get metadata activity if you followed any of my youtube videos you know i probably have three or four videos that highlight this activity this activity has a lot of really cool things you can do like get a list of all the files from a container check to see if a file exists and then loop over it with an until activity waiting until the file does exist so lots of cool capabilities that you can do with this the hardest thing that people really struggle with i think right from the beginning with pipelines is how do i read the activity output so i have an output i have an activity it's outputting information how do i read that how do i use it in other activities and so that's where a lot of my youtube videos and our training that we do our boot camps and stuff really focus on dynamic you know expressions and parameters and all that kind of stuff stored procedure activity you use this to essentially invoke and run a stored procedure the limitation with this is that stored procedure activities in pipelines and synapse and data factory do not allow you to output parameters now maybe that's changed right in the years that i've been working with it didn't i don't even try anymore so if somebody's figured out a workaround or whatever then maybe it's changed but in my experience it doesn't allow you to do that and then of course you're limited to those data sources the sql database data warehouse and the sql server database if you want to run a stored procedure and you want to output values from it you use the lookup activity so you can use the lookup activity to retrieve a data set and return it to your pipeline in either synapse or data factory this can be any data factory source you can execute stored procedures you can run and execute sql scripts you do get the output parameters and those can be either a scalar value or those can be an array or a list alright so that's going to be the lookup activity and then of course we have the execute pipeline this has a lot of applications just so many different use cases here some workarounds and different things you can do but this allows us to build out that parent child design pattern that so many of us are used to doing right back in ssis back in whatever etl tool of our choice we kind of compartmentalize the code a little bit to make it reusable which is even easier to do in pipelines than it was with on-prem like ssis to make that code more dynamic and reusable across multiple files multiple tables you get the work that needs to be done in the parent you pass the parameters into the child and dynamically change the tables and change the files that you're processing so execute pipeline really really cool activity here and then of course we have the for each activity this allows you to in the pipeline i like to think of the pipeline kind of like a control flow in ssis in the pipeline you can take the 4-h activity and loop over an array or a list processing each item in that list one item at a time and then you can perform a set of activities for each item so grab the first file you know write that information to a database copy the file archive the file update the audit table saying it was done and then go get the next file and do the same set of activities again so very simple you've probably used it in ssis or in sql or net right everybody on this call has probably used some version of a for each in some technology the filter activity in the control flow takes an array so maybe you use the get metadata activity to get a list of all the objects in a container then you use the filter activity to filter it down to just the files you want so that will take an array filter it down based on an expression and then it outputs its own array a separate array that can be used in your pipeline now the one thing that we need to be aware of when we're working inside of synapse analytics is kind of creating those different objects that we need creating connection managers slash link services so if you've worked with ssis or on-prem sql server we called it connection managers well in the cloud inside of synapse and data factory those same connection managers are called link services this stores the connection information and this stores the credentials for that information here and then we have integration runtimes i'm not going to dive into that here in this three hour workshop but integration run times essentially are the compute the cpu the i o the memory that is used to move the data to perform the different activities that you do within your pipeline right so that's how you get charged once again just like you saw with serverless on demand if you're not doing a lot of crazy stuff in there it a lot of times it's going to be pennies on the dollar it's super cheap to run these pipelines but but i'm going to show you where you can go into the monitoring and actually look and get an idea of the cost of these pipelines as well so all this integration that we get inside of synapse we're going to take a look at all right we are can use we can use and i would recommend you can use folders within your environment to try to organize it right so i i don't know what the suggestion would be i kind of wish there was a better way to organize it than just folders but in you know on-prem you'd create a solution and you'd have projects within that solution and so you might have three different ssis projects that had different purposes but they were all in the same solution but they were very separate and very distinct here we're using folders nothing wrong with that but that is how you kind of organize it if you're working inside of synapse and so a lot of times when i'm doing training um for a private customer and all the students are working in the same environment i have them create their own folder with their name in production you'd probably have this broken down by like department or um different projects that you're on right so could be a little bit different the way that we set that up but folders are the way that we organize really inside of you know synapse analytics and data factory now link services define the connection information so that data factory or synapse can connect to the data source those link services and those data sets that we create the pipelines the data flows all the objects and artifacts that we create are globally available to other pipelines and data flows right so it's not like ssi i'm going to keep making comparisons to ssis okay i know most of us have that background it's not like integration services on-prem where you could build a data set that was scoped to just that package right you don't scope things to just a pipeline in synapse analytics they're globally available to all of your other objects which is a great thing because it makes them more reusable and we need to think about that how do we parameterize this do we import the schema do we not right we're getting into a little bit more intermediate stuff here but we think about that because a lot of times we start building pipelines building data flows building data sets and we hard code everything then we start to realize wait a minute too far down the road probably i could have parameterized all this and i could have used one data set in place of 30. so data factory is very powerful synapse pipeline is very powerful with that kind of reusability and dynamic capability that we get all right so that was kind of the basics of pipelines i think next is data flows so we'll come back to that what i want to do is dive back into azure synapse analytics and let's build a simple pipeline i'm going to build a simple pipeline that loads a table in our dedicated pool right now normally what we would do the step one is we need to make sure that we have connection managers here so i'm going to go over here to my manage hub right here and i'm going to go over to link services this is where you create new connection managers and you'll notice that i currently have two connection managers these connection managers were created for me when i provisioned my asa environment the first one points to my asa my azure synapse analytics workspace so that's the server and then i can put in the name of the database because you can have more than one database so the database is parameterized so that works for all of my databases on this server the other link service that was created is actually pointing to my data lake remember when you provision azure synapse analytics you have to give it a primary data lake so it knows what the data lake is and it goes ahead and it creates that link service for you so we can leverage what's already there however if you were working in here and you wanted to connect to your sql database that was a question we had earlier what i would do is i'd come in and click new up here at the top and i'd create a new connection and then over here we're going to circle back to power bi in just a little bit over here you can look through all the different connections that are available right so if you want to connect to any of these create a connection manager and pull back data from any of these different connections you can you just walk through the process so if i were connecting to you know sql database i would grab azure sql database here click i can click it twice give it a name and then i would come down here and i could enter it manually if i want and i would say hey my database is pragmatic works public oh i'm not typing it oh there are them pragmatic works it's delayed.database.windows.net i could type out the server type out the database name put in my credentials test connection and create that connection now that doesn't connect me to a specific table or a specific view or a specific um stored procedure right it connects me to the database so that is the link service now within a linked service we have data sets so for this example that i'm going to do we already have two link services that have been created for us so the next step is i want to connect to a specific file in my data lake and i want to load that data to a specific table in my dedicated pool so we need to create two data sets now for the first data set we're going to take advantage of the integration that exists within azure synapse analytics always coming back to integration so i'm going to go over to the data hub and from the data hub here i'm going to go over to linked all right and i could just go up here to the top it's still open from before so i'll go over to my file system and let me think for this example here let's keep it with a smaller example we'll do internet cells actually so i'm going to go to training data and in my training data here i have an internet cells parquet file i'm going to right click on that and this is going to be my source i'm going to take the data that's sitting out there it's sitting in my data lake and i want to load it into my dedicated pool right so i'm going to do a new sql script actually i'm going to do a new data set so i'm going to do a new integration data set and this just makes it a little bit easier normally i would have to go to this hub i'd have to go to my data hub click on new click on new integration data set and i'd have to walk through the process choose the data store give it a name this makes it just a little bit easier so i'm going to click right there that's going to open up a new window over here and then i'm going to give this data set a name so i'll say azure it's uh the data is in azure it's in my azure data lake store it's in my azure synapse the name of the storage account is azure synapse workshop so i'm gonna put that in there and then the name of the file is internet sales that's the way i've always named all of them no specific reason i think it just helps me out a lot i always put where the data is on-prem azure wherever it might be that i put the provider type the data store type i put the name of it and then i put if it's a data set the table name the file name the folder name whatever that might be i'm going to select my format it's going to be a parquet file and then i could import the the schema if i want or i could not import the schema totally up to me i'm going to go ahead and import the schema from the file that i'm connecting to all right and then i'm going to go ahead and click ok so this makes it really easy just to create that data set without having to go through the normal process and we're done now you can always come back to the data hub and you can go back in here and you can modify a data set you can add additional data sets you can do that anytime you want if i come and look at this data set right here i want to point out a couple of key features all right one i talked about this before you can parameterize these data sets so i'm kind of hard coding it and since i hard coded it i went ahead and imported the schema but if i were making this a dynamic data set that could really grab any file from my my data lake what i would do is i would come over here and i'd create a parameter i'd give it a parameter like file name and then i would map that file name parameter to this box right here all right once again i have youtube videos on that three-hour workshop we don't have time to dive that deep into one specific topic but that's how you would do it and then i would clear the schema because you know each file would have its own schema so i'd clear that out to make this more dynamic and this is what we call late arriving schema the schema shows up at runtime we don't know exactly what it's going to be but data factory in synapse is incredible at being able to work with that whereas you know other tools we've used in the past legacy tools like ssis really were so metadata rigid we couldn't do that not easily so this is pretty cool stuff but we're going to go with the connection here that's good all right number one the next thing we need to do is build a data set for the table we want to load so i'm going to go and this time we are going to walk through the normal steps i'll go up to the top by the way i should have published a long time ago uh i i normally would never let this get beyond like one or two so be very careful with this uh because you are working in a web development environment if you you know if this crashes or something happens you're losing all your notebooks all your scripts there's no auto recovery here so make sure you publish and save your work early and often don't do like me all right so i'm going to go up here to the top let me go right here i'm going to click right here on integration data sets and we're going to create a new integration data set and this time i want to connect to synapse analytics specifically to my dedicated pool so we'll go to synapse analytics right there and then i'll click on it again to go to the next screen and then i'll give this a name this is going to be azure again the data source is in azure it's a sql pool so i'll type that out and then the database name is my dedicated p1 my dedicated pool 1 and then the table i'm connecting to is going to be let's just call it my sales table all right i'm going to choose my link service real quick right here so that was the one that was created for me when i provisioned azure synapse analytics and then the table doesn't exist yet so i can't really choose one from the drop-down because i haven't created it so i'm going to let the pipeline create the table for me at runtime so another you know another capability here for making this dynamic you've got 100 files you had to process that all needed to go into 100 different tables can you set that all up to be dynamic and automatically create the tables the answer is yes and of course we've had customers that have had that need and data factory synapse handles it really really well so i'm going to give this a name we'll call this dbo internet sales all right i don't know if we just don't have a lot of questions but i haven't seen the chat over here from matt really popping anything new so either i messed it up or we don't have any questions here that's kind of odd and then we'll call it internet cells that's fine now i don't want to import the schema because i can't really import the columns and import the data types when the table doesn't exist yet so we have to go with none and then we'll go ahead and go back down to the bottom and click ok all right and so now we have created data set number two so we've created two data sets here one that connects to my internet sales file one that connects to the table i want to load data into doesn't yet create or doesn't yet exist and so now i want to create a pipeline that moves that data but before i go any further i'm thinking back right now in my head to when i did a three hour presentation on dax and my model crashed on me right before we went on break and i did not save it and that was a lot of problems so i'm gonna go ahead and publish everything that i've done so far so that if something were to happen i do not lose the work that we've done now there's one more thing we have to do with this data set we're not done with the data set because this data set that connection manager was parameterized we actually have to tell it watch this so on this data set right here azure sql pool dedicated p1 we have to tell it what database we're connecting to so that's the only thing with this one because it's parameterized you could use any database on this server you have to tell it which one so i believe i named mine dedicated p1 i'm going to check though we're going to go back over to workspace yeah that's it all right so we give it a database name since we use the default that was provisioned at runtime and then that's it for my data sets now we're going to build a very simple pipeline real quick i'm going to show you how to copy data how to through the process of copying data automatically create a table and that's going to generate quite a bit of conversation around sql pools so as you can imagine if we want to build a pipeline we're going to go right over here to the iconography that kind of represents a pipeline which is the integrate hub and from there we're going to create a new pipeline all right and then i'm going to give this pipeline a name i'll give it something very simple like load internet cells let me get my face off the screen there we'll call this one load internet cells all right and then we'll go ahead and remove the properties to close that up by the way there's been a really odd bug with synapse where it seems to do this it seems to cut off half of the screen like this and you can't see this half of the screen you can't save you can't publish it is not a awesome experience and so i figured out a workaround on that bug if you click on the code view right here you flip to the code and you flip back it fixes it so if you come across that problem that is the workaround because i'm teaching this so often i'm with a bunch of different students who you know different i'm like man this is it's unfortunate to tell somebody refresh the web page and start over but i figured out if i click that button specifically and i come back it kind of refreshes the view so that's extra that's free no charge all right all right so moving on um and it's it's been happening sporadic it's happened to me a few times i know what's happened to matt who's on the call as well so it does does seem to be some oddity there maybe it's with my browser i'm using chrome i don't know but there you go all right so let's build a pipeline we've created a new pipeline when you create a new pipeline you're going to get a list of all of the activities over here on the left that are available for the pipeline once again if you've been in data factory look the fill the experience is the same right i'm going to go in here though and under move and transform this is where i can copy data we're going to pull in the copy data activity and we're going to set that up by simply saying look take the data from the source drop that data into the destination and then derek has a question here so i know that the chat is still working are you able to comment on how azure synapse pipelines compare with informatica pipelines that is one thing that i actually cannot comment on i don't know much about informatica um my experience with informatica was probably six or seven years ago when it was on-prem and we were trying to move customers off of that to ssis and we were trying to build a software to do that automatically so that's my limitation with it i don't know anything about informatica never worked with it other than that so how they compare performance cost i have no idea unfortunately good question though all right maybe remember i'm going to do a q a after this is over so i'll do some research on that derrick and see if there's some resources out there that i can talk to in the video afterwards so i would love to respond to that just right now at this moment that's not something i can give any kind of context on all right so with the copy activity we're going to go ahead and set this up real quick now if you've never been in here before i'm going to be going probably a little too fast but i will try to explain things as i go when i click on this activity it's going to give me the properties window at the bottom and i see that i have two required parameters one that's in the source and one that isn't going to be in the sync right so i have to identify what is my source data set what is my destination data set where am i reading data from where am i writing data to and so i'm going to go ahead and click on source and then i'm going to point right here i'm going to point to the azure data like right here and my internet cells file that's going to be the first thing now you're going to notice when you click on it lots of options appear it can be a little bit overwhelming but these are a lot of really good options in here so you could say hey just take the file path that's in the data set or i know that i'm pointing to a specific file but you know what i changed my mind let's just use a wild card here and let's go and look for a specific you know a folder that matches a specific criteria or files so you can use that same data set from before and you can add a wild card to it so you can definitely do that you can also add in the name of the file down here with additional columns we won't get into that so there's a lot of extra capabilities that exist within here under the sync destination i'm going to go ahead and switch this over to my sql pool specifically you know i want to load the data into that sales table that doesn't yet exist and so i'm going to grab that right there and then i have to choose a couple of things here and i'm going to choose you know what type of copy method do i want to use when you're loading dedicated sql pool if you have some history with massive parallel processing with pdw or when it was azure sql data warehouse for the longest time it's always been used polybase because polybase takes advantage of that mpp architecture and it's able to parallelize out the inserting of data and loading that data right but we now have something called the copy command which i personally prefer it's a lot cleaner a lot easier it handles null better it handles in you know when columns get moved in and out it handles that a little bit better it doesn't have the row limitations that polybase does so generally generally generally i prefer the copy command however in this instance i am going to use polybase and the reason for that is the copy command one of the limitations is that you cannot use like create table as select and so when you're creating a new table copy command is not the right way to go that's kind of like the one con that i can think of off the top of my head so if i try to do auto create this table that doesn't exist copy command will not work so i'm going to switch this over to polybase and we're going to use the polybase method here and then we are going to click right here on auto create table to create the table in my dedicated pool if it does not exist if it already exists this will not create it right if it exists it's just going to load it and the information over here probably talks to exactly that the pre-copy script is pretty cool so for example i could come in here and say you know truncate table internet cells whoops there we go we could truncate the table if i wanted i could actually drop table internet cells i'll drop it and it'll get recreated every time because we already know auto creation would do that that happens before this activity runs it's a pre copy script so there are different things you can do here that are pretty cool and then i think just because of time i'm not going to dive too much deeper into this for mapping i'm going to let it just do auto mapping right we're taking advantage of that later writing schema where we don't have to force the mapping it gives us that flexibility and that dynamic capability and so i'm going to run this there's a couple ways i can run this let me talk about this real quick i'm going to add a little bit extra in here so i could run this in debug what debug says is i want to run the latest version of this pipeline right here inside of azure synapse analytics not what's been published but the latest version right here if i run it from a trigger and i click right here and i say trigger now that's going to run the latest version of this pipeline that's been saved to synapse analytics which right now is none because we haven't published it at all so it wouldn't run anything right i like debug when i'm working here because i can get the output right here i can see it in the pipeline without having to go to the monitoring tub and i can kind of test things out the other benefit of debug i told you i'm going to add something new in here is you can use breakpoints with debug so very similar to ssis remember in ssis how you could disable task in your control flow well you use break points to do that here watch this if i go and grab the general category and i grab let's say the weight activity because we don't have to set it up and configure it so i'm going to grab that and then i'll grab another one so if i have a couple of extra pretend like these are actually doing something but if i have these weight activities in here and i you know do something like add a precedent constraint so if this runs successfully then run this and then i'm going to add just for the video here i'm going to say if this fails run this other thing so if it fails we're going to drag that over so that's on the recording for you and so now if this runs successfully it's going to run this one if this runs and it runs it fails it runs this one let's say that this pipeline has been failing and it's been failing on this activity right here and so this activity only takes let's say 30 seconds to run but if it runs successfully this activity takes like two hours to run and i'm trying to test this out i'm trying to fix this so i don't want to run i don't want to run this pipeline that runs in 30 seconds and then the other activity runs for two hours right i don't want to do that so what i can do is click this nice little red button right here that most people completely ignore and pretend like doesn't exist and that is a break point and what it means is i can now debug this and it will only run the activities up until that break point and then it stops the pipeline so i can test it out right i can test out this copy activity i can make sure it's working then once i know it's working i can deploy it again and i can run it in production so if i hit debug it's going to run the copy activity but it will not run these so that's another benefit of kind of that debug capability if you want to use that now like i said before immediately as long as i don't have an activity selected so i want to click in the background if i click in the background i can immediately see my output and i can wait for this to run all right so let's see here ru i already did that one bond said any idea of if synapse will integrate with our programming language as well i do not have any idea on that but if you email me bonds i can and matt will put my email in the chat window i can reach out to somebody on the microsoft team and find out for you because i was actually bummed by that as well um so definitely reach out and let me know and i'll i'll find out what i can for you all right so i got an error message that actually was not planned but i like error messages i always like error messages because they show us how to debug right so we're going to click right here on the bottom i know that my pipeline is running we used polybase yes so it should work let's see what happened we're going to click on the error message so we're going to click right here at the bottom on this activity let's find out why it failed oh that is interesting is it a firewall rule so let me zoom out real quick and read through this a little bit quicker so database check the link services connect to make sure cannot connect to azure workshop so yes so this is because i locked down my server and i locked it to only my computer when it's trying to do this load operation it's trying to use essentially virtual machines in the background that don't have permission to connect this is why i did a 30 minute video on managed virtual networks to get around this problem without enabling all your ip addresses but the reason it's failing is because the virtual machine within azure that's trying to do this operation the integration runtime it doesn't have permissions it doesn't have the ip address so i'm going to go back over here and i'm going to go into my resource group and i'm actually just going to turn on all ip addresses real quick so let me go back over to resource groups and then we'll go into the workshop here and i'll go in here and i'm going to turn on that firewall rule all right and so let me do allow all and then we're going to do you know 0.0.0.0 terrible practice but we're in the middle of the demo we're going to make this happen so 255 dot got it all right i'm going to add that in save and i'm just going to allow all ip addresses so this will run so like i said i put in mine and that did not get the job done so that should that's going to take a minute to update apologies for that i did not even think about that when i was applying that firewall rule because normally i set up a virtual network for that let's see still updating all right so that'll run when we come back and remember what it's going to do it's going to automatically create a table on our database and to verify that we're going to come over we don't have to go this is the nice integration again i'm going to keep pointing to this this is the nice integration capability of working inside of synapse analytics i can just come over to the data hub i can look at my dedicated pool right here and refresh it and i'm going to be able to see that the database is there or the table that i'm creating is there when we get done here in just a moment right now it's empty we got to run it again let's see it says it's updated the firewall rule we're going to run this in debug again and then we'll give that just a moment did it say it failed already oh that's a few hopefully that's a few seconds ago that was way too quick on the failure there all right one more time here debug one more time all right i'm gonna save it and i'm gonna reload it real quick let's do this turn that off we're going to clean this up just a little bit you got to have at least one demo that goes bad in every presentation apparently it's a rule there's a unwritten rule out there and then we close this thing up and we're going to open it back up real quick because that doesn't even give us an error message it says it failed all right so this is from before i'm going to close all these other tabs close that down again let's open it back up all right so let me i'm going to run it from synapse analytics i'm going to run it live from synapse analytics what is going on oh the database name did that get removed let's see so it's complaining that the database name is null remember in the data set we had to specify the name and i thought i had unless i reverted that i apologize for a little derailment here it's good to see the error message i actually don't normally use the built-in link service that's created for synapse i don't like i don't like having to specify the date because you can type it in wrong you can make mistakes which is probably what i did it's no i i typed it in earlier all right so dedicated p1 all right interesting all right now we can run it again so the error message actually was pretty good there i just missed it all right so now it's going to run my apologies i'm glad we did have that error message because it's good to see that how do we debug how do we look at the error messages but a little bit unexpected if you've been in any of my classes you can attest to the fact that i always create a new link service and i don't use that one just because that one somehow it got removed but you could accidentally forget to um forget or put the wrong one in right because you're typing it in plain text it's not like you get a drop down so i'm not a big fan of that but let's refresh it should be done it's done so now with that integration in synapse analytics we can go check right now our database without having to go to management studio without having to go somewhere else i can come right here under workspace i can refresh my database and then i'm going to see that new table that we just created right there internet cells pretty cool right so this is that integration that we're talking about you can see the results right away in this unified development environment without having to go to all these different tabs like we do you know or we did when we were doing all the different things in azure which was you know a little frustrating to say the least but this cleans that up a lot all right so that's a very simple pipeline in this demo we took a look at not just running the copy activity right we took a look at how to run the copy activity we took a look at how to set up those precedent constraints so we looked at you know green as well as red if it fails so you just set that up very easily right here by clicking right here this is how you can add that kind of conditional logic on the orchestration of that pipeline we took a look at breakpoints even though it never actually ran because my database was null so we fixed that and then we had the firewall rule issue that i had to go fix but i do not recommend doing what i just did that is not a best practice go take a look at that video that i have on youtube it's like 30 minutes long it talks about virtual networks and it talks about why you want to set that up so that's a good place to go but we're in a class we got to get this done right so i went ahead and set that up real quick uh and so this was a good little demo keep in mind that if you come down to general and i'm going to go through a couple more activities and we're going to transition to data flows if you come down here to general there's a lot of other activities and i have you know our class that we do our boot camp goes through practically all of these in depth we show how they all interact how to do you know for each loops and dynamic capabilities parameterization expressions we get into all of that so if you really want to dive deep into all the different capabilities here reach out to us once again we'd love to do those those trainings for your team now we're going to transition over to data flows and i have a couple of slides here so i'm going to glance at them real quick and see if it's something that i want to do let me go ahead and flip over to that and i realized that i did not show you on the screen those activities i was on screen myself so i've never made that mistake before but today that's number three so strike number three i'm not going to do that again i'm going to skip the powerpoint so the powerpoint talks about data flows and the fact that the reason we have data flows is that they help us with transformation logic this is where we can go to make sure that we transform the data because you're really not going to transform data in a pipeline a pipeline is for orchestration purposes kind of setting up that workflow what operates and what runs when right another thing i want to point out in the pipeline real quick before we do move on is that if you go down to let's find it right here so i'm looking for notebooks where synapse here it is so under synapse right here you can pull in a notebook and now you can just run that notebook that we created earlier and it really doesn't take any kind of weird authentication so if you've ever done this with databricks with databricks you got to go there you got to generate a token come back over here create a connection manager a link service and then there's a couple little hoops that you have to jump through and after you run your databricks notebook what do you have to do if you want to see if it ran successfully what happens you've got to go back over to your databricks environment but if you're working with notebooks once again integrated right here using the same apache spark architecture in the background if you're working with that you run it you come over to monitoring and you see did it run did it fail what happened right you can dive deeper into that after the fact so it's another one of those just small little integration features that exist within azure synapse analytics all right let me get rid of that we're going to jump over here to data flows so where do we build data flows we do that inside the develop hub now let me talk a little bit about data flows while i'm doing this data flows are going to everything you build in a data flow is going to be converted to scala and it's going to be ran against a databricks cluster so you're getting that capability that scale out capability that massive parallel processing that you get with apache spark but in the background you don't have to do this yourself you don't have to spin up and provision a cluster but you are going to get billed for that cluster for you know however much data is used it seems to be pretty cost effective in general but this does need to be part of your consideration somebody asked that question earlier they said hey what do you think about etl extract transform and load data versus extract data load it and then transform it after the fact right so do i take the data use a pipeline dump it in my pool and then transform it or dump it in my azure sql database and then transform it because the dedicated pool and the database we're kind of paying that flat rate if it's running all the time we're paying that flat rate for that hour whatever it is so it's not really additional cost to me to transform the data there but if i transform it here it can have that additional cost right so you got to weigh that what do you prefer i like building out data flows but if it's too cost prohibitive i'm going to build it in a stored procedure right so something to absolutely consider let's create a new data flow here gives us this nice little pop-up and i'm going to keep this very simple i don't want to go too crazy with this the first thing i want to do though is i want to turn on data flow debug this data flow debug capability i'm just going to click ok this is actually going to spin up a small databricks cluster for me that allows me to kind of preview the data right here in the data flow once again this was not here when dataflows first came out i remember a painful experience of trying to learn stuff in here so i love this capability but i am going to get charged for this this databricks cluster right this pool that's running in the background i'm going to get charged for that it's small it's not very expensive but you will get charged for that so don't leave it running if you've got 100 developers everybody shouldn't have this on and running all the time just turn it on when you need it but i want to show specific capability now the first thing i want to point out is that you'll notice that when you create a data flow you don't get a pane or a window box that shows you a list of all of the different transformations so that's the first thing you'll notice is different between a pipeline and a data flow that's different from a data flow here and a data flow back in sql server integration services however the first thing i'm going to do is choose my source right so i'm going to click right here in that little perforated box to choose my data source and then down here at the bottom let me go ahead and get off screen for a minute down here at the bottom i can give this a name and we're going to call this one internet cell so we're going to pull data from internet cells table and i'm going to walk through some of the different transformations and how to use them and that capability because it's pretty cool i'm going to choose my data set now if you remember i've already created a data set that points to my internet cells data so i'm going to reuse that because by reusing it it saves time and it kind of still allows me to show case the capabilities that i want to take a look at all right i now this part right here is actually kind of important so let me move this over just a little bit so i can slide back in on the screen allow schema drift is where you say hey i know that if i define the schema the schema might change columns might change columns might get added columns might get removed and if they do change i don't want this to fail because we're building this to be flexible and dynamic and to be responsive right we're building it to be responsive however if you say to yourself i want this to be like ssis i want it to be metadata rigid it can't change that's okay you can come down here to the bottom and you can click validate schema so you really choose one or the other i'm going to actually leave it as allow schema drift it's fine for what i'm doing now the next thing i want to do is come over to source options and under source options you'll notice that similar to a copy activity you can override the default file with a wildcard path so if you wanted to maybe grab you know specific files if you had a bunch of files in there that had the date appended to them and you want to pass in like you know star 2021 2021 star and only grab those files you could do something like that here and then there's some other stuff what column do you want to store the file name in so i could say hey in my destination i have the file name so i want to bring that in from the original file so i'm going to add that here and then it will automatically get mapped also another cool capability you can archive the files immediately after processing them right here without having to add an additional activity in your pipeline so normally we would if we want to archive the data we'd add like a copy activity and we could move it to like an archive folder or we would we could delete it with like a delete activity right but you can do that as part of the data source so that's pretty cool projection allows us to see all of the columns you can always reset it import projection if it's not here and then really that's about it for the data source now if i go to preview you can preview the data i'll highlight that capability on transforms in just a moment so now how do we add transforms we've we've connected to internet sales we're going to walk through about four or five transforms then we're going to jump and pivot over to power bi all right we'll talk about that integration which is pretty cool as well so with internet cells here i'm going to click on this plus button right here at the bottom right and this is going to show me all of the transformations that are being run here all right and so if i go down and i look at all of the different transformations i can see lookups i can do conditional splits i can do joins you're going to see a lot of the same usual suspects that you're used to from ssis on-prem i could do derived column that was one of my favorites you use that all the time and then there's a select column that's an easy way to get rid of columns that have kind of worked their way in your data flow that you don't want you can get rid of them we can do an aggregate a surrogate key pivot onpivot window so let's do window function first right so if you got a background in sql server right window functions when i was writing sql i don't write a lot of sql anymore right that's your career changes over time so i don't get to write it as much as i used to but i used to love window functions row number functions rank the aggregate functions that were window functions and so what i'm going to do here is i'm going to do an aggregate function so here's the deal i have all of these transactions and on each transaction we have product we have customer and we have the date and i want to find out my total sales for each product but i don't want to group by product key i want to keep the original detailed data and i want to aggregate the product for that product that's in that one right so i'm going to create a window function that does that so i'm going to bring in the window function let me show you how easy this is to do and i'm going to tell it the incoming stream is internet cells perfect let's call this aggregate by product and then the over clause is really kind of like your group by clause and so i'm going to say get the sum of sales amount over the product key so get a list of all the product keys all the transactions and get a sum across all of those right so i'm going to do that right here and i'm just going to grab product key just like that and then i'm going to go over and we're going to go over to our window columns and this is where i'm going to add in my aggregation so i'll call this something like total sales by product and then if you haven't ever seen this before we're now going to dive into the expression editor the expression editor the list of functions and capabilities inside of dataflows is very rich very rich lots of functions lots of capabilities and so you're going to see that now so i'm going to click in the box when i click in the box it gives me the pop-up just below and then i'm going to go ahead and click right there in that expression builder all right now i'm going to go ahead and write a very basic very simple aggregation right i just want to do a sum what i can do is i can filter over here so i can click on functions you'll notice when i go to functions there are a lot of functions there are hundreds of them out here i've never counted them but it feels like hundreds there's a lot of functions out here what i'm going to do though is i'm going to filter this down to sum so i know that i probably want to do an aggregation of just a sum there it is right there i'm going to click sum it adds it to the expression editor and then i'm going to choose the column that i want to sum and i can do that by coming to the input schema so i'll remove my filter and then i'm going to search for sales amounts there it is and there it is so very simple sum sales amount now if you wanted to add additional aggregations like count the number of rows or whatever you could create new columns from right here so i could click create new i'm not going to but you could alright so you don't have to close and come back and close and come back you can just add those right there i'm going to hit save and finish and then real quick so we can see the result i'm going to come over to data preview and i'm going to preview the results of that right so before so back in the day when data flows first came out you'd have to come in here and you'd build your data flow you'd run the data flow from a pipeline you'd have to wait six or seven minutes for it to run because it would take about five or six minutes to provision the databricks cluster it would fail on something super minor and then you'd come back try to fix it run it again wait six or seven minutes it fell again it would drive you absolutely insane but now with data flow debug you can kind of test it out and make sure it's working i love this capability because i remember when it wasn't available now i have my very detailed information granular information but if i scroll all the way over to the very very right i can see oops wrong button hit escape it's not a break no break i can see right here that for product of 310 because that's the row i'm on we had four 586 000 in sales now if i'm lucky we might get a different product here to see the difference well we're not lucky it's not in the top 100 but we're going to see it in a minute because i'm going to show you the aggregate function in contrast to this so that's pretty cool using window functions here just like we did you know in sql kind of that aggregate over clause i'm going to go ahead and delete this though since we've seen what the window function does and now we're going to take a look at the aggregate function and for 310 it was like 586 000 so we'll see if it matches up all right so i'm going to add a new transform again this time we're just going to do the aggregate and i'm going to aggregate the data by product key this time we don't get all 26 columns 27 columns we only get the columns that are part of that aggregate right just like writing a select statement in sql whatever is in the select statement whatever you're grouping by whatever you're doing an aggregation on that's all that shows up in the final result set so we're going to go from 26 columns down to 2. i'm going to group by product key so now product key only shows up in the final result set one time and that name is fine then we're going to switch over to aggregate and i'm going to create a total sales column just like we did before you can well not here never mind we're going to do total sales i'll talk about that other part later click on inner expression and then go into the expression builder all right all right and then i'm going to come in here and say sum and we're going to sum this up by sales amount does it pop up for me no all right i'm going to come down to input schema and we'll do the same thing we did before just like that all right and then we'll click save and finish so the exact same thing i did with the window function some cells amount super easy and now we can come over here and let's preview the data again all right so we should get a list of all of our products one row per product and what the aggregated total was and you'll see for product of 310 it's around 586 thousand so we can see that the window function was working perfect exactly the way we wanted it to gave us all the detailed rows with the aggregate but then we looked at and said look that's not really what we want let's get rid of it let's bring in the aggregate function all right let's build on this we're going to try out and just test out a couple of different transformations so the next thing i want to do is let's do a rank function actually let's add to this so if i wanted i could add total transactions too right so inside of this transformation i can click the little plus button right here and i can add additional columns to my aggregate so i'm going to add in something here like total transactions move my mouse out of the way all right and then over here for the expression i'm going to go ahead and do open expression builder launching it again and if i want to find out the total number of transactions for each product i just have to do a count over the table for that product so i'm going to search for counts and then i'll come down here and grab count any expression just like that and then i have to choose what you know what column do i really want to count here and i just want to make sure that i count a column that doesn't have null values right that's going to be kind of the simplest here so i'll just grab product key because i know product key this is not a distinct count so it's going to count every occurrence and i know product key does not have null values in this data set so i'll do that click again on save and finish and then we can do one more real quick test here we can preview the data and that's going to kind of set us up for one of the other demos i'd like to do here in just a minute all right so now i can see exactly my product key exactly the total sales and how many transactions occurred which means we can now determine the average price per transaction we'll do that in a derived column and we'll do that next let's do it now so how do we add additional transformations the same way we added the first one we come right here we click that plus button and from the drop-down list i'm going to go ahead and grab the derived column transform right there and then we're going to set this up all right so i'm going to create a new column here you could also with derived column you can replace the value so very common design pattern in etl specifically data warehousing loads is if a column was null if it didn't return a value and a lookup we replace it with a hard-coded value negative one unknown in a something like that right so you could absolutely use it in the same capacity right here in a data flow in asa absolutely but what i'm going to do is something a little different so i'm going to create a new column and it's going to be average sales per product so it's going to be the average per product for each transaction right and so that's the first part then i'm going to enter an expression here it's going to be very easy right because what we're going to do is we're going to say input schema we're going to take the total sales i'm going to divide that by total transactions and this is going to give me the total sales divided by the number of transactions which will give me an average all right and that's it we'll do save and finish again and then once again i'm going to take advantage of that data preview capability and we're going to refresh it so we'll refresh it we'll give that a chance to load all right and this gives us over here it gives us our average sales by product all right so that's pretty cool we now have that information and we leveraged a derived column to do that we did have a question here that matt sent me that said can a synapse pipeline trigger be invoked from an external app like asp.net and some other you know in other words can you invoke a synapse pipeline from other apps the answer is yes you can but it's not quite as intuitive or easy as it was with data factory so like with data factory one of my favorite examples i actually have a youtube video on this is you can call a data factory pipeline directly from like an azure logic app which if you've ever worked with logic apps there are over 400 connectors you can literally trigger an a logic app from anything from you know like a record gets inserted in a database and because of that you could then set it up so that you essentially trigger your pipeline from anything right which is pretty cool so that's a really awesome work around for data factory but right now today you can't connect from logic apps so i haven't i haven't really tried connecting from you know any kind of web app or anything like that to execute a pipeline i would imagine that capabilities there i believe it is but i know that i have struggled with even something as simple as executing it from a logic app so i'd have to look into that to be sure i'm pretty sure i think i said yes earlier but i'm pretty sure not 100 because i just haven't tried it whereas with data factory i worked with that for a long time so i tried a lot of different things and had no problem with that so i would imagine you can but we got to look into that all right so let's do this we have we've aggregated the data we tried a window function we've done derived call i want to rank the data let's rank it let's find out what are our top selling customers products whatever it is so to rank it once again we click right here we're going to add a new transformation into our data flow and i'm going to go ahead and add in the rank function got to keep an eye on that clock here and we're going to give this rank a name all right so let's give this rank a name i'm going to call this product rank product rank so that's going to be a new column that gets added into our data flow and then for the derived column right here from our derived column right here it's saying what do you want to rank by and i'm going to rank by total sales in descending order right so if you have the most sales you're number one congratulations and so now what do we do we're going to do a data preview we're going to check to see if we have the new column called product rank and we're going to verify that it's correct once again this is taking advantage of you know that databricks cluster in the background that has been provisioned and so here we go it looks like we have a bunch of total sales so this is the total sales 586 543 from high to low this looks 100 correct so that's great we could take this further um it's getting close so i'm going to transition over to power bi but we could keep building on this the last thing i would say is that you know if you wanted to run this data flow what you need to do is you would go back over to a pipeline in your pipeline you add in the data flow activity all right and then you point to the data flow that you want to run and then you run it from there so i'd point to that data flow that we just created and i would execute it now the data flow's not done because i'm not actually going to go through that final example but all i would have to do to finish this is just add a destination and load this somewhere i can load it to a file super easy to create i could load it to a database right but we're not going to go through and i didn't plan on going through that i just wanted to walk through the different data cleansing transformations and capabilities and what that looked like compared to maybe what you're used to with other etl tools right all right so let's transition real quick and talk a little bit about power bi first i'm going to go ahead and turn off looking at the questions here i'm going to turn off dataflow debug doesn't look like i'll need that so i don't want to get charged for that if i'm not using it and then i should have i think it's a pretty quick little powerpoint like one slide and we'll skip over it so we don't need the powerpoint essentially what i wanted to talk about here is that and i we showed it at the very beginning of this workshop power bi integration really works in two ways right one we can connect directly to our power bi workspace from right here and with that we can then build new reports and we can edit existing reports once again bringing that data analyst into the space they can run those on-demand sql queries they can maybe even do some very basic kind of etl that's more enterprise level with pipelines and data flows because i think it's pretty intuitive and then on top of that they can build their reports and you know they can edit existing reports from right here in the synapse workspace so let me show you how that's done real quick i'm going to go ahead and delete the data flow and then we're going to go over to manage so in order to connect to power bi the first thing you have to do is create a connection manager or create a link service and then you have to publish the changes so let's do that real quick and when i click new of course right there at the top in bold it says connect to power bi right so we want to make sure that we connect to power bi right there so i'm going to do that and then i'm going to give this power bi a name and i'll call this something like you know power bi and then i'll use one that we've done in the past like dashboard in a day and it wants underscore let's fix that it's funny because these right here have dashes but mine can't all right so underscore dashboard in a day and then from my pragmatic works tenant i'm going to go ahead and just choose one of you know the many workspaces we have out there and i'm going to grab dashboard in a day and then i'm going to click create all right so that simply creates a connection directly to my dashboard in a day workspace that we have out there the next thing i'm going to do is go ahead and publish so you have to go ahead and publish this real quick i thought i got oh there it is whoops my bad my fault i slipped up i'm gonna go ahead and click publish we could spend about an hour talking about that little error message right there so we're not gonna do that all right so i'm gonna go ahead and click close we've published it now watch what happens if you go back to your develop hub you're now going to see power bi and when you go into that workspace you can have mini right when you go into that workspace you're going to see all of the power bi data sets that exist you're going to see i can click new right here to create a new power bi report on that data set i can delete them from here you'll also notice that i can go into my power bi reports open them up so i can go into you know dyad final report here i can open it up i can edit it and i can change it so this is a super quick demo right super fast but it just shows that capability for those data analysts where we're getting the data analysts the engineers and the scientists all involved and all engaged in this asa workspace and i think you know from what i've seen a lot of your more advanced users you know your power users are starting to take advantage of these enterprise features like pipelines and data flows because they're right there right i'm i'm a power bi person i'm in here and i'm like i need to get this data moved from over here i don't want to just do it in my power bi report is there kind of an easier way and it's like oh oh yeah i can just copy the data right here and set that up on a schedule that runs every day so that is pretty cool and you can do that so if you wanted to create you know a new let me see right here if you want to create a new power bi report you can click right here you can create a brand new report from scratch right here from your synapse analytics environment which is pretty cool now there's one more thing i want to show you before i dive in and open up power bi desktop and show you how to connect to the external table we created earlier how to connect to the dedicated pool and that is over here under monitoring we can come over here and go over to our pipeline so we've looked at monitoring quite a bit today right we've been bouncing back and forth but if i come back over here i can see all of my executions for my pipelines and you'll notice we get this little pop-up right here that says consumption on it that right there gives you an idea of how many data excuse me how many data integration units were used to run that pipeline and then you can take that let me look at it real quick because we're looking at it in debug mode and so you'll see that it says hey your total data movement activities total was .033 your total number of activity runs was one because we only had one activity in there and you can take this information you'd probably have something a little bit more robust go to the pricing calculator punch those numbers in and get a really good idea of you know if i ran this a thousand times because i was looping over you know a thousand files what would that look like what would my cost equate to so this does give you some insight that quite frankly we did not used to have inside of either azure data factory or inside of synapse analytics all right so the last thing i want to show you here we'll start wrapping things up is i wanted to show you the capability to connect to your synapse analytics workspace from power bi and kind of take advantage of what we've been doing here all right so first of all how do i connect how do i find out the server name well you got two server names when you connect to asa you have one for your sql on demand and you have one for your dedicated pool once again this is another huge capability of leveraging sql on demand because the dedicated pool might not always be running it might be paused but if you're accessing data and you're building reports from your data lake and you're using that serverless compute you know that compute is always going to be up and it's going to be running right so i'm going to go over to manage here and from the manage hub we can go over to our sql pools and i'm going to show you how you can real quick get the sql endpoints so that you can authenticate and connect from power bi desktop if i want to connect to serverless right i would click on built in right here and when i click on that over on the right it's going to show me right here my workspace sql endpoint and i can copy that out so that is the server that i want to connect to now the database name you got to remember that as well and i think it's serverless p1 or something like that so i'm going to take that back over to power bi and i'm going to from power bi i'm going to do git data we're going to do more right and we're going to search for synapse all right so we'll zoom back in here and i'm going to search for synapse and we'll grab azure synapse analytics also you know previously known as sql data warehouse and that'll be the first step and then i'm going to do connect now this is going to give me obviously an authentication screen i'm going to put in the server name and then i will type in serverless p1 for my database name and then i can do import or direct query now i believe that the table i created was taxi data and it's like a hundred million rows and i do not want to attempt to bring in 100 million rows so i'm going to do direct query and just connect to the data directly into my data lake all right now you could write a sql statement here um but i don't want to i haven't ever tried that actually with open row set so i would imagine you can't i always do it through the external tables it makes it a little bit easier and you'll see why when i go to the next screen here and i click ok it's going to show me a list of all of my external tables once i connect of course so i'm going to use my microsoft account and let me go ahead and sign in real quick we'll authenticate right there all right so i'm using obviously azure active directory here and then we will approve all right and then we will connect all right so once i connect to that server that database i'm going to be able to see all of those external tables that exist right there and there we go there is that taxidata external table so once again creating that external table on top of those data lake files just makes my connectivity here so much easier and i'm able to leverage that compute power of the serverless p1 right so if you're doing direct query and you're taking advantage of the serverless p1 that's going to be pretty cool all right and so this right here is going to show me all the data i can click at the bottom once i select it we can click load i know my face is probably in the way we can click load right here and then when we click load that's going to load all of that data right here well it doesn't load it because we're doing direct query but it gives me access to that data and we can start building reports so this is pretty cool very easy very intuitive you can connect now we didn't do a spark database today but if we had done a spark database and had tables that were created there from our notebooks we could connect to those two from power bi so all of this is kind of integrated together i do want to show you though before we start wrapping things up i want to show you we can also connect to dedicated pools right so that's that massive parallel processing that dedicated pool that we've provisioned that we can pause we can turn it on we can scale it up we can scale it down we can connect to that as well the way i would do that and there's actually it's a couple years old now but there is an incredible video out there that is on dedicated sql pools and power bi and it talks about the better together story how you can leverage key capabilities in your dedicated pool like materialized views results that cache and how you can leverage the key features in power bi like aggregate tables and so on and so forth to really get just incredibly great performance and analytics and that was out there on data igni was it ignite it would take me a while to find that video at this point but there is a really good video that talks about the better together story between those two what i'm going to do next though is i'm going to show you where you would go if you wanted to connect to your dedicated pool it's actually the exact same process the only difference is when you come over here instead of clicking on built in because built in actually says it's the same exact endpoint the only difference is the built in the serverless has this built into the server name the dedicated does not all right so if you just want to copy it out though what you can do is you can come back over to the manage hub go back into sql pools click right here on dedicated p1 right so we click on dedicated p1 right there and then over here we can copy out that workspace sql endpoint and now we can use that in power bi so now i can zoom out let's flip back over to power bi and i can tell it once again we want to connect to synapse we grab azure synapse analytics just like we did before click connect we put in the server name type in my data k by database name dedicated p1 and then say direct query again and then click ok connect again same exact credentials i'm using azure active directory credentials here you could use your database credentials right because remember that there is a sql server authentication there so if you haven't set up azure active directory you can definitely take advantage of like the sql server authentication all right so justin nice to see you justin uh always keeping up um justin said what is the advantage to direct query the asa data set versus azure sql data set do you mean by that azure data lake and there's like a 30 second delay here so might not pick that up right away um if you are using the direct query to the sql ser the using sql on demand the serverless database you're going to be getting the obviously the compute power of the way that the serverless on demand compute works so that's going to give you obviously a little bit more computational power which is a good thing i assume that what you meant was the uh direct query for against using sql on demand versus just connecting because the other point here maybe this isn't what you were talking about but we could connect directly to our data lake from here as well right so i could connect directly to my data lake directly to a folder from right here in power bi that works as well but this is what i want to show you here because probably most people know that what i wanted to show you here is we could actually connect using the serverless compute and get that you know big data processing compute power without having to have dedicated p1 spun up if that's wrong justin i think you have my email so feel free to send me an email and uh just clarify that that question and we'll kind of talk about that all right so this right here what do we got internet cells that's the table we created earlier in the class now i can bring that into power bi desktop the same way we did serverless earlier today all right so surprisingly i'm actually not getting done at 205 which i'm very impressed with myself about because quite frankly i usually have way too much material but i made sure to scale back and not go too deep into a lot of these topics but once again i cannot encourage you enough we have been insanely busy this year with azure training azure synapse training where we've been doing these private boot camps and so we wanted to get this out there because there's a lot of questions there's a lot of confusions around asa hopefully this cleared it up in a lot of ways it's a lot of the same technologies you're already working with within azure they're they just work better together now right you get that that that collaborative environment that unified development environment for all of your people but if you want further training you want to dive deeper into this please reach out and contact us we would love to do private training for you and of course you have that sale is available until the end of what sunday sometime on sunday i don't remember what time but the cell is available for the on-demand learning as well i'm going to hang out for a couple of minutes here matt if you'll just feed me some questions if there are some good questions there we'll go through them but i know it's been a long you know two hours 45 minutes you guys have hung in there so we appreciate that i'm going to do a follow-up q a video in the next week i'm on vacation next week but i'm gonna try to get it done next week anyway so we'll try to get a follow-up so any questions i wasn't able to get to today or there's probably a few of them out there i got to research because as we talked about technology is growing at a rapid pace i will get that q a video out there for you guys and we'll kind of rehash some of this out all right so thank you matt if you'll put i don't have access to the chat so if you'll put my email in there for everybody i'd appreciate that feel free to reach out if you have any questions and then i'm going to look at these questions over here and let's see what we have so when calling synapse from power bi desktop does it count as traffic from azure to on-prem so no because you're going to publish this out to the service so that's actually a great question right they're actually going to communicate to one another which means that if i'm connecting to azure synapse from power bi desktop and you publish this you don't need a data gateway you can set up a data gateway for more enhanced security or what have you but when you're connecting to azure from power bi once you publish it and you schedule that refresh you do not need to set up a data gateway which a lot of people find empowering and very freeing all right um eric erica said is there a way to connect to power bi to the spark pool i looked into this on the power bi forums and there's a way to connect to a table if you if you create a table on your spark pool so if you are using your your notebook and you create a database and you create a table there's absolutely a way to do it there's somebody in the community who i saw do this he talked about the whole shadow table thing simon wheatley i think advancing analytics he has a really good video on it so they're um i don't know which one he has a bunch of videos so you have to dig through them but if i remember correctly there is a way to do it yeah let me see what else do we got here maybe i missed a couple no i've answered most of the ones that i've been fed if we have big data for a dashboard should we use databricks versus synapse um so that depends right one of the this is a big highly debated topic across people who have obviously been working with databricks for a long time databricks has added a lot of cool and pretty awesome features in there that make their product a little bit better if you know anything about microsoft though they have a very agile approach and they update their product very fast as well at the end of the day they're both using apache spark in the background right so you're going to have to make that decision as an organization if you're starting from scratch and you're using all these other microsoft technologies and you want it all to integrate together i would probably recommend starting obviously with azure synapse and using those notebooks there and using apache spark there if you already have stuff that's out there in data bricks the migration might not you know might be too much work i don't know i haven't tried that so that might be a situation where you say hey we're already in data bricks for those existing reports we're going to leave them there for the new stuff we might do them from synapse analytics good question all right so i think that's a good point to kind of end the meeting today want to thank everybody again for joining us i always have a great time doing this i love doing these events i wish i could do them every month but we have to switch it out between different team members let us know is there a topic that you really want to hear about is there a topic you really want us to do that we haven't done we have a bunch of them that are coming up but if you have one that you think would really hit home let us know we would love to do that as well thank you everybody for joining us today we will see you next time you
Info
Channel: Pragmatic Works
Views: 5,018
Rating: 4.9771428 out of 5
Keywords: Azure, Azure Synapse, Cloud, Learn with the nerds, azure synapse analytics, azure synapse analytics tutorial, intro to azure synapse analytics, data warehousing, Big Data Analytics, Data Science, SQL Pools, Azure Data Lake, ETL, Data flows, Power BI integration, Synapse notebooks, pragmatic works, mitchell pearson, training, microsoft
Id: lLrjaVdBuM0
Channel Id: undefined
Length: 169min 10sec (10150 seconds)
Published: Thu Oct 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.