RETIRED - Replacement in description - Microsoft Azure Master Class Part 10 - Database

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone welcome to part 10 of the azure master class and this is all about databases not exactly version 6.5 which is what this is but a version of sql server for some of the various offerings and there are a lot of them now as always if this is useful please go ahead and hit a like subscribe comment and share it takes a lot of time and work to put these videos together so databases i want to start off just quickly reviewing the types of data we'll commonly see in an organization how we can think about storing the data and then actually walk through the different database offerings that we'll actually commonly see in azure i do have a quick joke um three database admins walk into a no sql bar and they leave after a couple of minutes because there were no tables terrible i know and i'm not a database admin i just thought this t-shirt worked pretty well for the video so relational so here the data is stored against some kind of schema we predefine the structure of the tables we have rows and columns we have a unique key and we typically will normalize the data and what normalize means is hey we take this big set of data and we maybe break it up into separate tables that are each focused about a particular aspect this helps us de-duplicate standardize remove any kind of slight differences in how that data's stored so with a relational database i can think about well we're going to have a set schema so we have a predefined structure of the columns and the type of data we have in that so we have the attributes in columns so these columns and then we store the data as rows or records so each person for example maybe an office they work at could be books could be shopping items whatever that is but we essentially store across each individual row and again we might have multiple tables that have different aspects of the data so we normalize it to really increase the efficiency of how we store it but when we think about the storage a record is going across this way so when we want to read the data in and work with the data we're reading a record at a time so we see all the different attributes of that data available to us and that's obviously one of the most common types we're ever going to see there's lots of ways to query this obviously sql t sql is one of the big ones and because we have this strong format by the schema it makes it very easy to query the data we know exactly what columns we're going to have available to us and these various keys and these relationships we can join tables together let us do very sophisticated queries now we'll further break this down into like online transaction processing and oltp that's typically where we're running our day-to-day transactions we're inserting we're updating we're deleting it's running our systems and then sometimes you'll have an online analytics and analytical processing type workload and olap which is more we just want to get results about a big set of data it's actually common we'll denormalize that we'll bring things back together because this is fantastic for really increasing efficiency for working with the data but if i just want to get a big set of results across a huge amount of data it actually slows things down having all these different tables so you might see it'll denormalize and bring things back together so that's what we think about as kind of this relational so it's that structure of the columns and the rows and then we get a few other ones so we think about non-relational might be nosql we'll have document storage this could be self-describing something like a json document actually how azure works under the covers it could be xml yaml there are many others but the point of here it's actually got that particular tag some identifiers that explains the data and so that's stored as a document in the particular database solution a key point is that different documents in the same collection could have completely different formats they don't have to have all the same tags that there's no set structure to it it is self-describing we commonly think about things like mongodb couchdb they might use this so here if i think about kind of my document storage well it's just some self-describing document again could be json xml yaml there are many others that i could actually leverage then we have key value so key value is really as the name suggests we're storing it as these key value sets so i can think about the key here well that key is how we kind of search and identify we find the particular entry this is completely opaque to the underlying key value store it really knows nothing about it now in that value we can really store anything we want but it's obviously very useful and many things that will actually use key value to break things down that i can quickly go and find and get these pieces of data from then we have this columnar the goal of this is initially it actually looks very similar to this i can think about well once again i'm going to have kind of records and i have these columns and then i have entries i kind of have these records so you might say what what's different why is this different from here it really comes down to what we're trying to do with the data remember here we store it as records so we're storing across with columnar we actually store it in columns so this is how it's actually stored and how it's processed so if i had a certain workload that hey i just want to quickly get the sum of all of the values of this common column here i would actually have to read in all of the records and just look at the particular column i care about because we're storing it that way because we're actually storing it as the column well i can just read that column from the storage and then just operate on it i can skip all the things i i don't care about also because it's storing it as the same type i'm going to get really good compression i can even nest other structures within that columnar structure so this is very very powerful when i want to be able to operate on maybe particular attributes very very quickly so that's the goal there and then graph so graph is all about the idea of relationships so if i think about graph we have the idea that hey we have nodes so a node is a particular entity so i can have a node another node another node and then what we have is we have relationships between them so we have edges so we can define these nodes so a node could be people john ben abbey and then the edge would be the relationship between them daughter son in a company it could be works for it could be works too managed by it could be office these could be offices this could be people and they work at and why this is really useful is now i can quickly interpret relationships between things i could say a query hey quickly show me everyone who works for john and it could just go and find the edges that have that and also i have attributes so we have attributes for all of these attributes for nodes attributes for edges to actually give me information about them so graph is super useful when i care about the relationships between different things things like facebook for example you can imagine that's going to use a graph database underneath because really what it cares about is yes the nodes are people with the facebook account and then we always say a friends whip so quickly show me um everyone who's friends with john so it would just show me all the nodes that have that edge so it's super powerful for actually doing that and there's really as much data as i want to go into about that here if you're really interested in more detail about what i can then do with those structures i have a dp900 review video where i spend a lot more time on that sort of stuff so now let's start thinking about the databases that are actually available in azure and really i could just install anything i want inside a virtual machine so if we start thinking about well what are the offerings available you can really just start with thinking about well in azure we have i ads so i could just create a virtual machine and that could be windows or linux and then inside there well i could just install a database software now that could be really anything you want um it's just a vm it's running an operating system remember there's different types of storage i could actually connect to that so i might obviously have a disk for the os but then i could add other disks in terms of the actual the database store the logs there's things like premium there's ultra so there's all these different options available to me so i could absolutely just install that thing inside there there's different skews of vm that are for example memory optimized that's probably going to be very useful when we think about databases and i need a lot of memory and may not need so many cpu cores also remember off of this virtual machine we kind of have that temp storage so in that temp storage that's that local ssd i'd have things like tempdb if it was sql server for example have a databases but obviously with that i'm doing all of the management for that thing there now there is another skew i think about installing it myself there's also containers so as we get more and more popular with containers i could also think about running databases in a container now we often think of databases as no state and as we'll see as we talk about the database offerings it's very common to actually separate the compute and be stateless from the storage i the state we care about so i could still run databases in a container what i just need to make sure of is on the container level there's some kind of persistent store so that if the container died for example well i still have kind of the database files the logs stored persistently for example this could be a persistent volume and that link would be a persistent volume claim from a particular pod we're running in something like kubernetes so i could also use containers to host databases that that's completely available to me but obviously then as we talked about you're responsible for everything i'm thinking about the database installation the patching the upgrading of it and backing up the database making sure it's highly available thinking about disaster recovery tuning it hey is there an index missing now there are features to help you with that in many databases today but whereas you're responsible for all of those things so let's get a little bit more specific on the options that i i have available so that was kind of i can just do anything i want in a vm remember that ultimately i could take oracle for example and install that in an is vm create linux vm in azure and just install or recording it now let's talk about more of the the azure focused offerings now obviously sql could be one of those i could absolutely install sql myself into an isbn and that's actually where we'll start so there are a large number of different offerings for sql server there are sql virtual machines so i can think about and you may wonder what's different than this and it really isn't different from that except yes it's kind of a vm you can kind of think of this as a diy do it yourself so that was a diy model so here we're going to do the same thing again we'll install sql but maybe we don't install it maybe this time there's actually images in the marketplace that have kind of sequel pre-configured or i i can do my own ones as well i i have that option now what's maybe different is there's this thing called extensions um there's there's agents available in azure that they can help us do things we have them on vms by default to help manage the vm well here we can think about i can have kind of an agent and we kind of have the extension and there's actually one for sql virtual machine and that will actually now bring us some additional capabilities to really just help us with what we're doing so i can think about well firstly when i have this it will actually show in the portal as a sql instance so it's now going to expose it to us from within there it's now going to automatically give us things about sql aware backup so we're actually going back up the databases it's going to automatically give us things like sql patching so things like the os we can do that patching anyway that that's there that that's available but now it will also actually go and patch sql the software plus it will integrate with azure keyboard and where that comes into use is there's many features in sql server that have some element around encryption i could think about um transparent disk encryption tde well it will now go and use azure keyboard there's things like always encrypted where i can do this column level encryption that as the client i have that key and i use those things so it's going to go and integrate with things like that as well so there's various different points that there's others here it's going to go and do that integration with us automatically now this is all around today windows there's no extension for a sequel on linux yet because obviously sql runs on linux now when i think about containers that model over here that's actually very very common you actually run sql on linux but today this extension part does not exist so this is just running in a vm i would still be now responsible for things like thinking about well okay do i need an always-on availability group for resiliency but it's just helping me on my way a little bit if we actually jump over to the portal so what i can see here and where this starts to show is now i can just search for sql virtual machines now i don't have any on this environment but you can see this new option up here at the top this is actually very very recent what i can now say is in the past the sql extension was only really there if i installed it from the azure marketplace what i can now actually do is i can say well even if it's custom even if i went and created this myself i want you to make this available to me so now i can say hey for this subscription i'm going to accept the terms in the agreement which obviously we read those very very thoroughly and hit register so now for existing virtual machines and new ones that i have sql in it will add that extension and now add all those great friendly management features for me so that's going to help me with that overall management so we have that available then we have azure sql managed instance so i'm kind of moving left to right as we get more and more like paz like obviously just creating a vm and putting anything in it it's is um having sql in a vm it's still ios but that extension helps start to do some stuff for us as a single managed instance i would say is the first one that's really like a paz-like offering and the goal of sql managed instance is okay so this was just kind of sequel in a vm managed instance is all about actually now having a true pass offering to us now with sql managed instance i can really think about this this is kind of a v-net integration so this deploys into my virtual network so that's one of the first things it's going to give me with the sql managed instance i'm not now thinking about well there's this public endpoint so how do i neatly link this into my virtual network it goes into my virtual network if i think about how way azure works it's actually a dedicated dedicated cluster from the internal kind of service fabric and what that that actually also means is it's my own sequel instance so it's completely isolated when i talk more about the other sql options they're kind of integrating with different things with this it's actually my instance of it so i create a sequel managed instance now like many of the things we're going to see there's there's two versions there's a general purpose then there's a business critical and there's more features i'm going to talk about in a second but i want to quickly just cover on the architecture so i think about general purpose again remember i've got my virtual network says my v-net and it's going to kind of take a particular subnet within there now always we have this kind of gateways there are these gateways which is where the requests actually come in there's multiple instances of that i can think about well there's going to be an internal load balancer to take the actual client request to actually come in ilb to send it to the gateway and then what i'm actually going to get is a virtual machine i'm going to get a vm and inside that i'll have a number of processes but obviously the big one is kind of sql now there's other agents as well from a management perspective because management's coming in from the fabric i have no access to this i have access to the database functionality but i can't go on to the virtual machine i i do not have that capability it's a fully managed thing obviously what this vm is essentially doing is then it's talking to storage for the database the logs etc so we're separating out the storage from the compute because in the general purpose it's it's a vm so this this virtual machine here links to the storage if they happen to this virtual machine it would spin up another one it would spin up a new virtual machine it would connect it back to the storage with the database and logs on it but obviously there'd be a period of downtime for that and then we have business critical and business critical is actually going to look similar in terms of the gateways we always have these kind of gateway layer but what's actually not going to happen now is i don't just have one i can think about i have the same network structure i have the same kind of subnet i have the same idea of kind of that gateway layer is going to be right there but if i actually now drill down into it what it's actually going to do is have a bunch of virtual machines for running sql the storage it's actually going to use is now local so there's local ssds on these kind of nodes where they're actually running and what it's going to use is the always-on availability group so it's now basically replicating the data between them so there's no longer these kind of external storage i'm not separating the storage in the compute anymore it's creating this always on availability group it's doing all of that kind of replication for me but now obviously my high availability is much much better in general purpose i have avm if it goes down if something happens it has to spin up a new one so there's going to be some downtime associated with that with the business critical tier it's spinning up four of these things using local super super fast ssd storage so my performance is going to be great because it's that local ssd obviously i'm paying more money there's there's four of these things running compared to one thing over here so i get better performance in terms of resiliency and availability but i'm paying a lot more now we have these four available to us this actually gives us an additional feature so one of them will have a read write endpoint that will be accessed by the gateway one other will actually do a read only endpoint so i can actually start to scale out some of my activities with the database so hey if it's a read write operation it's going to go to the read write endpoint this is the primary these are all essentially secondaries but one of the secondaries can actually be marked as a read-only endpoint so it can actually improve and speed up my overall activity if we go and have a quick look at one of these so i'm now going to search was actually just go home we'll create a resource just so we can see these if i just search for sql let's just push enter what i'm going to see is azure sql if i select create this is where we'll start to see the different offerings and i can see kind of sql databases sql managed instance and the sql virtual machines and you'll see there's kind of a single instance as my option here or there's i could actually use azure arc as well if i hit create what we'll actually see is i'll kind of pick a region where it actually exists so we'll go to east us and we can see by default it's saying general purpose now if i configure this i just want you to kind of pay attention to the price so let's say 932 dollars a month because there's just avm running that now notice i can change the number of cores i want i can individually change the amount of storage i want so it's not some blended thing it's independent compute and storage but let's just take that kind of back down we'll take this back down as well there we go if i change this to business critical watch the price so it's now kind of shot up in terms of how much it's going to cost me so there's general purpose there's business critical so it's costing me more money because it's now doing different things in terms of actually what's running underneath so we're actually seeing some differences in now our costings based on that now there's also going to be a delay because if i think back to what it's doing when i deploy this sql managed instance it has to create the virtual machines so it's going to be a delay so when i say hey hit create this thing it has to go and spin up virtual machines and complete that process it says actually in addition to sql mi one of things i can actually do as well is kind of a sql mi pool all that's really now doing for me is it's the same networking same integration but it's going to pre-create either kind of a single vm or that ring of four for the always-on so it's going to pre-create the resources so obviously now it's going to be able to actually create an mi instance into the pool much much faster because the resource is just there waiting and i can also now share that resource and also i can now do things like create a two virtual cpu mi instance which i can't do ordinarily that's too small but now i can essentially carve this up into different size sql mi instances so i can actually go down to two virtual cpus which normally i couldn't actually do that so this is always the v core model and i'll talk a little bit about that later on but the whole point of vcore is i pay for the number of calls i want the compute and then the amount of storage i have but they're independent they don't kind of scale together now i cannot change the type when i think about some of the other options we have later on like the sql single database elastic paul this is a deployment it is managed instance i can't convert this to say hey now i want to go to sql single database i would have to kind of export and import the data in so why would i use this thing like what was the whole point of this now i'm saying sql managed instance the code name was actually cloud lifter which may start to give you kind of a hint of what's so great about this because this was by no means the first offering this was added after the fact and it was really added because of well people had sql and they wanted to take it to the cloud but didn't want to just run it as a virtual machine and so the whole point of this is this is near 100 compact so if i have kind of sql server enterprise edition running on-prem i should be able to just take this and run it in sql managed instance the only thing that's different is i don't actually go in and manage this thing but it has all of the compatibility um i can think about it has sql agent a sql agent is at that instance level so i don't have it on the other offerings we're going to talk about it's not there but i can have the secret agent on this thing all of the kind of dmv the dmv views we might have all the kind of extended events i have those here they are all available to me um i have all of the surface area of sql available to me so i have all the instance level features which i again don't have on the other path offerings we're going to talk about because i don't have really my own instance and i can even do native backup restore to storage to blob so it's very easy for me to just kind of integrate with those things so if i think about absolute compatibility with on-prem azure sql managed instance gives me that now doesn't mean we can't use the other offerings depends on what we're doing and what we're willing to do and maybe do some minor modifications but managed instances there because hey i want to be able to take a database i have on premises today and move it now even though it's these vm things here i still have dynamic scale so what that means is without actually having to change it in terms of have this big downtime or anything else i can change the number of cores i can change between general purpose and business critical i can change the capacity and i can do that up and i can do that down so we still have those capabilities in there but again i can't change its type to saying like sql single instance database i'd have to do a move of the data so that's that sql managed instance option again it runs inside our virtual network so it's going to have very easy integration to other things but it's all still managed for me all the general features i think about for sql are available here so that's managed instance then we have azure sql database and then we get into azure sql hyperscale as well so again if we jump over to the portal just to kind of see these so here if we change it to sql databases we can see we'll look there's single database there's elastic pool and there's database server so the point of this is for azure sql database this database server is just a logical construct it doesn't really exist as such but it has to be there i have to create databases under a database server the database server has certain firewall configurations certain account configurations i have to have it but it's basically the databases where i provision all of the kind of the storage and the performance so let's kind of have a look of how these work together so if this was all of this this was kind of this managed instance so that's mi and i guess one of the things i should have pointed out actually is about geo replication and with dual replication with managed instance there isn't any just with a click of a button if i want geo replication for managed instance i kind of have my primary managed instance over here what i would do is create an empty whiteboard just crashed seeing strange things this morning and what i would actually do over here is i would create an empty managed instance in the other region and then i would actually set up kind of the always-on availability group so i would have to go and sit now i can do auto failover there's features there for me to help but there's not like a click of a button hey i want geo replication i have to go and create that second managed instance where i need it and then configure that to replicate so let's go back to our kind of regular sequel now we'll start with the idea of just single so there's a whole bunch of different kind of skus so single database i want to create so the single database is basically a set of provisioned resource now with that provisioned resource i i do get to choose on these ones so you'll often see this dtu versus vcore model and the whole point of here was dtu was a blend it was a blend of really my kind of my cpu my storage my iops etc it was this blended thing whereas with the v core it's separate i separately say well the number of v cores i want and the capacity i want so i have more flexibility maybe i'm more skewed towards performance of by the cpu i can do that here whereas dtus i have to kind of scale them very linearly this is where everything's moving to the v core model the newer offerings i can't do a d2 option serverless hyper scale managed instance were all v core models so that's where everything is going now no matter which of these i pick i do want to point out i can do dynamic scale so it's actually one of the really cool things across all of these i can do that capabilities on the tier so once again we have kind of a general purpose and business critical in the dtu model general purpose is called standard and business critical is called premium so the naming is slightly different so this is kind of the the dtu naming so this is the dtu naming this is the v core naming and i'm going to use general purpose business critical but if you ever see standard premium well that's because it's the dtu terms for those things so i can change the tier i can change the sizing uh kind of of those things and i can actually change the the dtu vcore i actually want so i can switch between them if i actually go and look at a database it says jump over if i quickly just go and search for sql so you'll actually notice actually firstly we have a server so if i look at my servers i have a sql server but all this is really used for are things like it's that logical container i see well i have admin i can integrate with azure active directory i have firewall configurations i have things like the tls version that's required i can turn on tde in terms of well is it using a service manage key or a customer managed key but what we really have inside here are the databases or the elastic pools now within that thing and go and look at an existing database we'll actually see configure and this is where you can see well right now i'm using the dtu models i have basic standard premium basic is like the super cheap thing and it's basically costing me like it's a dollar per dtu so this cost me five bucks a month but i could change it to standard notice it's this blended model premium or i can switch to vcore so with vehicle we get that general purpose we get that business critical we get provisioned we get serverless but i can independently kind of move these things around and i can switch between them really whenever i want to so like that a great deal of flexibility will be exactly what i want to do there okay so switch back over again so we have these single option this provisioned resource we also have this elastic pull so the point as the name kind of really suggests once again it's provisioned resource so i create this pool of at a certain size and then within that provisioned resource i then add in databases so the databases within their share so that can be important if i think i have different databases that maybe have different requirements at different times they essentially have this auto scale type capability because they have a pool of resources they can just use so if they're busy at different times it's a great way to get good utilization of the underlying resource whereas with the provisioned resource remember when it was all for that single database if it was quite a certain time i'm kind of just wasting the money there it's there so we have this kind of dtu this v chord i'm going to move this for a second a little bit over because that's that's kind of a common thing um for these and likewise kind of that dynamic scale piece that's kind of available across all of them so that's single elastic pull there's another option actually when i do single so that was provisioned resource i also have serverless so this is a much much newer offering this is only vcore and it is also only the general purpose there is no business critical thing to talk about in a second now as the name suggests was it serverless i actually get auto scale i can actually now set a range so i have this range of min max vcors so i tell you well this is the minimum number like three you can go up to 10 and it will move that now at the same time i do also have dynamic scale remember dynamic scale is i go and change the scale i go and move the sliders so what this means is i can still go and change the min and the maxes whenever i want but internally it takes care for which of those it should be using at any one time and bill me accordingly so you can think about okay well i have this minimum number let's say it's here it's three and i might have the maximum number here so what it's going to bill me for is whatever it's currently using within that min and max range so that would be the auto scale part dynamic scale would be actually i'm going to move the minimum number to 2 i'm going gonna change the maximum number to seven so that's dynamic scale i'm changing what that range can be but then it's still doing auto scale within there so that's kind of the difference between what those two things so dynamic scale we have on all the others as well i can change the provision resource i can change the pool size i can change the managed instance but to actually change what they're billing me well that's auto scale and that's serverless it also has auto pause so i can basically say hey look if you're not doing anything for an hour go to sleep and then it will wake up when it needs to do something again be aware of what that means though so if you think about so what this is doing is once again separating the compute from the storage to be able to pause it means i stopped the compute part but the storage obviously still remains but if i pause it when it wakes up there's going to be a little bit of a delay but it can recreate these these pretty quickly but also realize that these all have so all of them have the same thing they use a tempdb on kind of local ssds that's super fast so temp db across all of these things all of these offerings have this template b that it populates managed instance does this as well so if you think about if i stop the compute that's where the tempdb is so if i stop it if i pause it and then unpause it the negative thing that actually is going to happen here there's going to be a warm-up time because that buffer would have been emptied out i have to repopulate the buffer so i have to think about that buffer has to recache things so there will be a performance impact to that so yes it can pause but realize when it does the pause there's going to be an impact to that so if we're going to look at the portal if i jump back over and we'll go back to our let's add a sql resource if we actually go and look at the serverless offering so remember that's only under single database if we look at the details we can see hey we have this elastic pull option we have single database and then under single database we have this hyperscale and serverless offering so if we say hey create this is where hey i could select to create an elastic pool if i wanted to to create a new pool which have a certain amount of resource or i can just do a regular compute and storage configure the database i have to be in the v core and here i can pick serverless so if i pick serverless notice now what it's asking me it's asking me for a maximum number of vcores and a minimum number of recalls so i'm going to set that range hey i'm willing to go up to 10 and i'm willing to shrink down so in this case 1.25 is the smallest i can do and i have this auto pause option so i can actually say hey look if you're not doing anything for an hour go to sleep or i can do it in minutes even days but just realize you're going to get kind of that wake up penalty and that it has to re-warm up so when saint tries to talk to her it will auto warm up but there'll obviously be a performance here as it warms up to actually get that buffer recached in the temp db now we do also have another offering so we have serverless single elastic and again i can move between dynamic scale i can change the type at any time i really want and between these things so that's great there's a certain limit to how big these databases can be i think it's like four terabytes is the biggest they can be so then there's something called hyperscale so hyperscale really changes this model once again i separate the compute kind of from the storage so i'm separating these things out because what hyperscale is actually going to do is i can think about well i have these compute servers and i can kind of i always have one read write endpoint i have to have that but i can optionally add additional read so this is completely optional i can add up to four read replicas now these are complete i don't have to have those now you generally to get the full sla i think you want at least two read replicas you don't have to have them and once again that gateway so all of these things i'm not drawing it here for all of these models they're going to have that gateway to actually get the initial request to either proxy or redirect we have the compute and then separately they kind of have this array of kind of page servers and the page servers the ones that actually go and talk to underlying storage and so these kind of scale out as many as they kind of need to do the job and then these pass kind of the pages up as the compute needs it now as this actually does work kind of behind the scenes there's this um there's like a log landing zone the transaction logs so it goes and writes these transaction logs then there's kind of this log processing so it's separate so it's it's separating out the processing of the logs which takes resource typically that's a separate process and that's going to go and then write it to these and then there's kind of like a log store where once it's gone in there it moves it over to there so it's completely separating out how it actually does these things so this storage here is just blog what this lets me do is a hundred terabyte databases so i can still have just one compute at the top my sql process my read write or again i can i can increase my scale for read operations by adding these redirectors and my own failover time because realize if this goes down if i don't have any read replicas it has to spin up a new compute resource to now go and talk to the page servers if i have read replicas if this goes down it can just quickly fail over to one of these and do this auto failover so one of these becomes read write so i'm going to get a much better kind of failover time and it does this kind of proportional fill um sql server can have multiple data files and it really just does this kind of if you think about how we write to these things there's a kind of a round robin and proportional fill basically just means hey i'm going to go between the different page servers and write my data out it's not sharding it's not using a partition key it's just kind of filling them up there's no concept of that um type processing going on here so we i have up to 100 terabyte database now so this is like super super cool however i can change between all of these i can change to hyperscale but this journey here that road is one way if i change to hyperscale i cannot now go back and change to something else it even warns you when i create a hyperscale how should we go and look at this for a second so over here if i actually go back to create database again and let's configure the database if i select hyperscale i have to check a box it's telling me hey look you can do this but i can't now move to a different service tier in the future i have to say yep i get it i understand i'm not going to complain and moan about it later on um i i'm changing to it and i realize i can't change for something else later on so i have to really confirm that when i create the thing but this i mean if i want the sort of biggest databases if i want all these now kind of read replica capabilities um this is what's going to let me have that remember even with when i talk about business critical i only have one additional read end point this gives me four if i look at that biz that managed instance remember after four replicas only one of them can be made a read-only endpoint so let's talk about what are the models for this thing so we've got all these different types of hyperscale was his own model but for both single and elastic paul um and serverless but it can be general purpose i have general purpose and business critical again so what do those actually mean so let's kind of just draw those pictures out and it's going to look super super familiar one of the things i should also have noticed about serverless it's actually per second billing which makes sense when i think it can constantly be kind of fluctuating okay so once again we have general purpose remember that would be standard in the dtu model and we have kind of business critical that would be premium in the dtu model so with standard i kind of once again it's separating the storage from the compute so i can think okay well i have my storage over here and what i'm putting in my storage are like the database files the mdf and the the log files so i have the ldf and that by default is just going to be lrs it's locally redundant storage it's stored in blob so this this is where i'm storing state that's where the state of my database lives now what it then does is it creates this kind of containerized process where it runs sql and it connects to that thing now it also does have as i kind of mentioned before remember i do have that local fast ssd where i put my temp db so this is fast ssd storage and some people use it for like a buffer pull extension but remember this part is stateless doesn't mean it's bad i'm drawing it in red just to be different but there's no state here because with standard there's just this one process running my database so i can think about that's kind of living within a physical kind of box that's active for mine now there are other boxes sitting over here and these have some spare capacity and there's many others there's lots of these boxes kind of sitting here with spare so if something happened to this box that's running my database well it would just spin up a new container instance of my sql over here and connect it back to the storage that's all it's going to do but obviously there'd be some down time it's pretty quick but there would be some down time as it has to obviously go and do that now i should point out quite separately from this there is a backup to read access grs and i think the backup it's i think it's seven to 35 days but i'll talk more about that this in a second but i do get automatic backup with this i don't have to do anything it's just gonna do that for me so that's kind of the general purpose i can turn on availability zone awareness if i turn on availability zone what it now does it changes this to zrs storage so this is if i turn on a z and these kind of spares well because this storage is now available across the zones it could fail over to a database in another zone it would also impact if i think there's this whole um the gateway so i always have this kind of gateway control ring and i got these multiple kind of instances so if i do the az once again it's going to kind of break these up between zones so the control ring is where the incoming requests come in and then either it redirects or it just acts as a proxy so both of these models this is actually where the operations come in via then it goes that way or that way so that's standard so standard hey there's one instance running my stuff now business critical as you're probably gonna kind of guess well now there's four of them now if you remember before we separated the state from the compute so the storage was stateful the compute was stateless um we're not doing that with business critical business critical hey the sequel's running in each of these so we have the process running and that local storage well that's where it's actually putting its kind of database the log file so that you think the mdf the ldf for all of these is local so it's super fast one to two millisecond latency whereas over here this maybe it's like a 10 millisecond latency so there's a difference in the latency between these things so because it's kind of that remote maybe that's a five to ten millisecond latency to get to that storage here because it's local to it like one millisecond it's going to be much much lower so now they're using local for the storage and as you probably expect what they're going to actually do here is i don't use that color they're going to kind of create this always on availability group so it's local storage but it's now replicating one of them is obviously the primary so one of them has kind of the read write endpoint another one can have a read only endpoint the others don't i have to have a majority when i perform it's the synchronous replication i have to go to the majority to acknowledge that transaction once again there is a backup so from this primary we're going to have backup once again that's going to be that ra grs i think it's always 35 days of business critical whereas this could be seven but what i can have is yes ordinarily bring these together sure maybe have this 7 to 35 day retention i can turn on long-term if i turn on long-term retention this can actually be up to 10 years and this is actually using like the azure backup so i can set a daily target a weekly target a monthly target and even an annual target now for this so if i need to it's doing a delta based storage it is still storing it in place it's doing it where it is i can increase that resiliency and once again the same way i could add availability zones to that model i can add availability zones to this model as well where essentially what it will do is now split these over three az's um to increase my resiliency and once again that control ring would also be split um over those availability zones so it's going to add in that resiliency so if i kind of step back for a second we have these various different models and what i'm going to pick really depends on what are my requirements obviously the general purpose is going to be cheaper but realize there's a performance difference between them um there's also a failover time difference whereas here well the latency is a lot lower and the fallow is it's a sql always on availability group it's going to fail over very very quickly so the naming kind of gives it hey if it's just a general database and i'm trying to optimize my spend great if this is really quickly important it needs to be available at all times one probably we're going to want the business critical for that realize when we're thinking general purpose then we have different ways to carve up so single i get a chunk of resource dedicated for that database elastic pool i get a chunk of resource that i can share between databases so different databases can be busier at different times serverless well hey here's here's a range and you use and i bill you for what you're using at that time and hey you can even pause realizing there's a little bit of a warm-up penalty if i do actually do that and then of course hyperscale where it's completely different model but now we're separately doing the log processing the page servers of the underlying storage and i can actually have up to four read replicas to really drive that additional kind of capability now i can actually add read replicas as well to other regions so if i think about for the elastic paw and the single i can add kind of a geo a read replica so one of the nice things i can do is hey i can say hey add a geo read replica and i think i can have four of those so in different regions i think i can even add it to the same um i can now add these additional reverb because obviously i pay for them there are other instances but i do have that ability to add those replicas no i can't add the read replica for the hyperscale so that's for these offerings i can have geo-read replicas for hyperscale today that there's no option for that okay well let me see that one so sql server just finishing off this obviously there are different scale capabilities between them i have those geo replication options available if it's not hyper scale there are service level differences between in terms of slas and the number of replicas i actually need and there's different feature sets available so i'll kind of put this in the comments but you can see hey sql mi for example has more features available to me in terms of compatibility there might be a big reason why i use it so that's sql then there are others azure database for postgres my sequel maria de bean absolutely i could just go and install these into a regular vm and manage it myself but there are these completely managed options built on the community editions they're open source and then there are kind of azure enhancements to to take on the benefits of azure in terms of maybe its own resiliency and replication these now actually give me a higher and cheaper option if i was to deploy these in a vm we'll probably have to have at least two for high availability purposes well here it would just deploy an instance many of the times but it's using this containerized technology under the hood so if it fails it can spin up a replacement in a couple of seconds so it doesn't have to have multiple instances so it can reduce my cost there are various offerings on this based around well there's different types of requirement i might have now these are all relational databases so they're all kind of um that model which one i pick would really depend on well maybe preference maybe experience i currently have maybe um there might be some differences in some of the feature sets they have so let's think about azure database so the azure managed database and once again there's different ways i can actually deploy it's just like there was general purpose and business critical we we have a similar capability here now for all of these things it's going to do things like auto patch it will even do kind of minor version upgrades if it's a major version upgrade i have to trigger that now i can do it in place um so it'd be a little bit of downtime it's a metadata operation or i can kind of move the data to a new instance you kind of decide that so i think about the offerings we start off with single server and this is the one that's really based around this containerized technology so behind the hood it's using these kind of server containers which means they're super super fast to kind of spin up so once again we have kind of host boxes that can run stuff these are available for all three of the offerings so i can think about well for this i can have that postgres i can have that my sequel and that mariadb which itself is kind of a fork of my sequel post question i actually see very commonly used to move from oracle uh it's got like a 90 compatibility with the pl sql oracle users so it's very common to see this but the point here is what you'll actually have is once again the data is separate so data's separate and there's always minimum of three copies and then it's going to spin up the particular database instance in one of these kind of containers so it's running inside that kind of containerized technology and it will go and connect to the data and run that particular workload if it fails well then it will just spin up a new container and reconnect so it's going to be very fast even though there's only kind of one instance available to me one of the things you have with this is i have that same dynamic scale so once again i can up and down change the resources available to this and i'm always kind of separately thinking about the storage and the computer obviously there's a max amount of storage a certain amount of compute can address but i i can change this while it's running so once again in here you think about all the state is here this is stateless now additionally to this i can actually go ahead and add from this up to five read replicas now they can actually be in the same region or a different region so it doesn't have to be a different region i might want to add it for the same but again this that gives me the ability to other applications processes and actually need to go and take this one of the things it will actually do is obviously i backup so it has kind of backup um built into this offering and i can optionally say make that backup grs you don't have to do it but you can maybe date a sovereignty and that will actually do it for 30 days now for postgres only today it can actually hook into azure backup to do a long-term attention and i can set those same kind of policies to say hey i need it to keep it for this daily weekly monthly but again that's post quest today i think my sequel is coming and it's not there yet but i can now do that long-term retention with those so that's kind of regular single server postgres mysql mariadb all of those are available to me then they recently introduced this new thing now it's in preview time of writing this is called flexible server now the picture looks super super similar once again we have kind of these various hosts but if you remember the other one was containerized technology this is now vm based so essentially now i'm getting a virtual machine that's going to run my database instance and then there's spare capacity over there once again i've got my separate storage but this time it's actually going and talking to kind of a premium managed disk so again there's three copies of the data there's always three copies of any data i run in azure at that spare capacity but one of the things i can add here is we can turn on i think availability zones i can actually have in a different zone so it's the same region but a different zone i can be replicating over to here and obviously this is running in its kind of vm and it has its storage so i can do a cross region synchronous replication to an instance in another zone so now i'm going to get this zone redundant high availability option so with flexible server i can turn that on i can't do that with the single server option and so this is adding in this ha zone option and remember the other one could have five read replicas this one can have 10 read replicas again i pay for all of these things once again i have that dynamic scale set dynamic scale applies to to both of these instances but now i can have additional read replicas um in other places now for the flexible server the support here is once again postgres and it's my sequel but there is no maria db support now why would i use this thing it really comes down to that flexible part there are things in the container based world as a developer i can't do i can't run certain types of commands i can't change certain types of configuration so what flexible server does is gives me more flexibility exactly as the name kind of suggests i can now do more things within that flexible server the other cool thing is so yes i get this great obviously flexibility i can also pause it so i can actually stop paying for it so i say okay i'm going to pause this thing down i want to stop paying and i can use burstable types of so the b series i can use those burstable vms and to really kind of optimize my cost on this so that's like a a nice option to use then there is another offering hyperscale that name will seem familiar obviously sql had hyperscale as well but they're completely different so here what we have is there's like a coordinator layer so through that coordinator layer is where all the kind of requests are coming in via now what it's actually going to do in the hyperscale is i have my nodes my node itself has these premium managed disks attached but it just keeps adding nodes with its own premium managed disks and adds more and more and more so it will just keep scaling essentially to this what they say is there is no storage limit it will just keep scaling it will just keep adding those things out so this is kind of this code name and cited technology this open source technology it uses it's just gonna keep scaling these things out and all of them are read right because what we're actually doing here is this is actually sharding data so i create a database and i pick kind of how i'm sharding it where that key is and it's separating the data out over those nodes based on the partition that i'm selecting here so i now have all of these great capabilities she's really infinite scale now i can add to this high availability if i add high availability it adds a replica for each node so it's obviously going to double that layer so it's kind of a key point i can add h a that will add a hot standby but it is obviously doubling so now for each of these there's a replica so yes i get better resiliency fell over because remember each of them has their own part of the data so if one of them fails that part of the data is now gone it has to instantiate a new node to connect to that premium managed disk to bring that part back so if this is like really important data i probably want that hopefully i want that hot standby to give me that resiliency so cosmos db whereas the other databases i talked about sql server postgres my sequel mariadb these were existing databases that have been taken to the cloud they've been optimized and tweaked they weren't written for the cloud because one of the pain points you often see is well the cloud i have all these different regions and as our applications branch out and become maybe active active the database is often the blocker because if there's only one copy that's read writable any transaction has to come back to that one instance well there's a latency if i try and run a compute in west us on databases in east u.s to do the right over there there's a latency introduced so there's often this desires well can't there be some globally distributed database there's always a trade-off i can't have the latest data with no latency but what cosmos db is written for global distribution it supports multiple models and what that means is trying by draw those different document types at the start well remember basically every single thing we've drawn up to this point all of these this was all relational and i'll kind of tidy this up a little bit i'll move this down just so it's easier to see for the future but all of these were relational solutions well what if the data isn't relational remember we talked about things like that the document the key value the graph we had all those different options available to us probably shouldn't have moved that that always takes a huge amount of time whenever you try and move a lot of objects in the whiteboard so we have the different data models and so cosmos db they're supported so i can think about hey okay so with cosmos yes we actually have document and then there's different apis to work with the different types of data so i can think about well there's kind of sql there's db api is to integrate with document type data then remember there's kind of that columnar where we store the data in that direction so here we would have cassandra then if you think about we had key value so the key value that's going to be kind of the table xcd and then finally we had graph where we have those nodes and the edges and that's going to be gremlin so with cosmos db i can actually support all of these different types of apis so if i've got an app using one of these apis i can use that to talk to different models for the actual data now it's built around a partition key so we talked about sharding previously to separate the data cosmos db is built all around that idea if i want kind of these huge amounts of storage and i want to efficiently be able to query and work with it we have to be able to separate out the data so from a structured perspective it's very different from all the others where we see hey there's a vm or there's a container and there's something happening we see none of that so when i think about cosmos db essentially we set a partition key so the first thing when we create a store we create this we pick the par partition key sometimes you forget to spell when you're drawing things and what will now happen is i write data that data will have a value for that partition key so this is used to shard and therefore distribute so i write some piece of data that after the partition is going to bother me let's try that again partition key so as i write data we put a particular partition key value and what that will actually drive is as i write data it's going to create logical partitions i spoke partition wrong there as well party on let's fix that right again it can be very hard to spell sometimes partition key there we go all right logical partitions so each unique key value will have its own logical partition and there's no limit on how many of these i can have it will have as many logical partitions as it is logical partition per key now underneath that there's going to be physical partitions this is actually how it's storing and having compute to interact with this and i can really think about so it's going to create these physical partitions again lots of them and it will map certain logical partitions get stored in certain physical partitions there are limits based on those physical partitions um they're obviously very very big i think it's 10 million request units per second a partition can support and up to i think 50 gigabytes now the question this is how you scale cosmos db you say how many request units and there's an auto scale capability where you say kind of a range i want to be able to go up to this many request units rather than provisioning the question is always hard to work out based on how i'm partitioning my data how efficient my queries are it can be very difficult to know the right number of request units i end up running out so it will slow my queries down so now there is actually an auto scale capability but it's going to now separate up using a hashing algorithm to distribute my logical partitions to physical partitions so it's really important i get the right partition key if i have a poor partition key then well when i'm querying and interacting with the data let's say i picked date well if most of my operations against the current day sure i might be spread out over lots of physical partitions but on any particular day it'll all be done against one physical partition which means it's really not very good now the way these actually break up is reach these physical there's kind of a leader and then there's kind of followers and it's going to be a forwarder because the point is i can have these geographically distributed so what i can have now is a global replica over here which will have its own sets of these so it will have again its own leader and then its own kind of followers forwarders etc so this forwarder would send to a leader in another geography so you can see at any one time there's always four replicas of my data so no matter what happens there's four replicas and this folder again it would send to whatever regions i have another region over here it would send the data to it so that's the structure of cosmos db i don't think i don't think about servers or what type of criticality i have none of that i pick a number of request units that's all i actually do there's nothing else i have to configure or worry about but those request units are tricky to get the right number so i'm going to do analysis on my queries i'm going to think what's the right partition key there's a change feed sometimes i'll even duplicate the data so i can have a different partition key when i want to interact with it in different ways which is very unusual we wouldn't normally do that in any other kind of database but because the partition key is so important in how we spend the request units because the storage is super cheap i may actually duplicate the data from the change feed to have a different partition key to optimize how i want to actually interact with it now i have a configurable consistency this is the key to cosmos db i have basically a little slider that i can set well what's the consistency because i can pick kind of a consistency versus latency again you you can't have both there's always a trade-off i can't always have the most up-to-date data but have no latency it there's a speed of light speed over fiber it's impossible so i pick what do i care most about do i care most about being able to just have very low latency and get a result quickly but maybe it's a little bit out of date or i must have the latest data so you can pick so i can go really from strong consistency strong means guaranteed wherever i am whatever that read is i'm gonna get the latest data all the way to eventual eventual basically says you'll get it event you'll get it at some point who knows when there's kind of a middle value of session so there's a session key i can export from a process and this guarantees that the read write will be the same i'm guaranteed to get latest read from any right that was performed within the session multiple processes could share the same session that might be very common within a certain region then there's kind of a bounded staleness that says look you're allowed to be this many transactions late so you can get some delay there's kind of this consistent prefix that says you'll get me the right order you'll get it when you get it these i can actually support multi-write so when i have those geographical replicas i have as many replicas as i want i can write to any of them and then based on these consistency model i pick will really drive when the others get those copies i cannot have multi-right if i'm strong strong says i must always get the latest result no matter where i am which means that there's no point in having multi-right because it has to be synchronous replication so if i'm multi-right i can't do strong strong will mean there's one writable copy and then everyone would have to synchronously write that out to the other copies so they would always get the guarantee it's actually going to get shown in the portal so if we jump over if i actually go over and look at cosmos there's my cosmos database i can see my default consistency and we can see we have session bounded stillness strong and it's kind of showing you these little music notes showing you what it means so with strong watch the music notes they're all hitting at the same time no matter where the replicas are they're always getting the same data if i do eventual adventures like well you'll get them but they could actually be out of order no it's here south central we've got a note so yes it will get the data but there's no guarantee on the ordering you're just going to get it with session notice here is telling me the session a so for all of those in the same session they're guaranteed to all see the same thing reads right same order bounded staleness remember i can pick the lag in terms of number of operations or time so here they're getting in the same order but they're allowed to be out by that certain number of operations certain number of seconds constant prefix well if rights were performed in the order they'll never see it out of order but again there's no real guarantee of when they get it so i get to now pick that when i use cosmos tv so i get to pick or how up to date and when i think now i want applications that are distributed regionally and i want them to always be able to write read and write to local copy cosmos db is really what i'm going to be thinking about that's really the only solution there were some third-party products um that give me some ability maybe through that multi-um right type feature and certain types of synchronization but this was built for the cloud this kind of just natively gives me this okay quickly on encryption these all support encryption including bring your own keys all the things i've talked about cosmos postgres my sequel mariadb or the sequel you can all bring your own key to them for things like sql it has this transparent date encryption again it's enabled by default and i can pick you saw me at the database server level pick if it's custom managed or system at the database level i can turn it off i don't know why you would there's also things like for sql always encrypted so always encrypted is actually at the client side it encrypts the data and it can either use a certificate locally in my store or azure key vault to store the certificate so i'll have to have rights to that key vault this seems like data masking so data masking is not encrypting it but i can write a function to say don't show it unless i've got the right to kind of unmask the data so super quickly with these let's just kind of show so in this one let's jump over so what i'm kind of showing here is actually kind of funny up here this is a notebook in azure data studio and i can see i am running this microsoft sql azure rtm version it's evergreen so one of the things i should have pointed out really is one of the features of azure sql database is it's evergreen there's constantly new features it's always been updated i don't worry about any of that stuff but what i have here is i'm just going to select star from the justice roster what i want you to see is the mother name is just this encrypted set of texas cipher text but what i really want you to focus on is the fact that for both bruce and clark it's the same value this is a deterministic encryption rather than randomized so this means i can still do things like searches on that now if i jump over to another system and i run the same query you know as they see they have the same name the reason i'm seeing this if i go and actually look at my connection go to my options i have always encrypted on so it's actually going and getting that key from azure keyboard and showing me the data so but notice it's the same name so it had the same hashing value as we saw here now additionally what i have is another user aquaman and if aquaman looks at the data you'll notice last name is hashed out because on this i applied a data mask now if i grant aquaman the unmask capability and run that query again now you can see they can see the names this isn't encrypting it any differently but it's just now got a function on that that hashes the data if i remove that permission again now he can't see the last names no one really wants to trust aquaman and those things i can configure for example that encryption all that i can actually do through here i can say encrypt columns and then i can drive that encryption i can pick columns to encrypt and pick do i want to store in a local certificate or do i want to actually do it through for example azure key vault so i can just go through pick the columns pick is it deterministic i it's going to be the same hash um or i want to do saying randomize but then it it limits some of the operations i can actually perform on that so that's another type of encryption i can do and then also we can do data masking now i can do that from within here but i can also do that from the portal so here if i actually go back and look at my database let's see if i can actually find which one it was it's not that one just gonna look at the server the server will link to the databases there it is so from here we'll actually see i can do dynamic data masking and right now we can see we've already masked that last name so i just used this but i could drive it through here as well also super powerful things like data discovery and classification this is an additional feature but it will actually go and find data it will suggest hey i think you should classify these columns as these types like confidential confidential gdpr i can have my own classifications and then based on those classifications that might drive me to say do enhanced auditing or maybe go and add things like always encrypted or things like data masking so it's super important to have that kind of encryption those options available to me tls configuration for encryption in transit so i can pick hey is there a minimum tls version uh i can even drive it with things like azure policy to say hey i need to have this minimal tls version set and then finally i just want to touch on data flow organizations commonly think about extract transform load extract load transform and other types of data flow you think about from the source of the data to the sync where it's going to end up and then what i want to do with it think about data lakes so if i think for a second all this stuff on the board so where this really comes from is remember i can have a whole set of different sources of data now these could be kind of structured sources so it could actually be databases could be on-prem via gateways it could actually be kind of in the cloud i might have semi-structured maybe it's that parquet format which is a columnar and maybe it's unstructured and what happened in the past is we would transform the data so we'd get it we'd have to transform it because storage was expensive and so i needed to transform it into a state i could do something with it well i transform it into what i need to do something with it but that's only going to answer the questions i know of today it says future questions i've lost the original data so what's super common now is we have this data lake and this is adls gen 2 so it sits on top of blob yeah it's a hierarchical namespace so that means now i have true folders i can move things once i finish processing it i can't move there's no folders in blob blob is an object store it's flat i can fake folders by putting it as part of the name then i can't rename it i have to kind of copy and delete so it's very slow and slowed down my processing so my data lake has a hierarchical file system i have posix style apples so i can give people access only to what they should have access to and essentially i can bring in all of this into my data lake so now i have the original version so i can think about sure i'm doing an extract i'm going to do a load first now i'm keeping that original format and it could be anything again parquet files json documents xml text files images i literally can put anything i want and also i get the benefits of blob so things like tiering so i can do tiering i might have requirements to keep data for a really really long time or i could tear it off to call or even archive once i get it into the data remember i've got target things i want to do so if extract load transform the next thing is well i actually have to do the transformation so it could be kind of adding joins it could be sorting it could be mapping it could be changing format there's all these different things i next need to do but this is where we have things like hdinsight so hd insight is really different types of clustering could be hadoop for map reduce breaks large data processing is two phases firstly it's mapped so we create these key values from all our massive amounts of data then it kind of gets shuffled around and then it's actually reduced by summarizing based on the keys from the mapping phase we might have a spark for just massive transformations difference from kind of mapreduce and hadoop hadoop is all disk based spark is memory based they're bigger might have interactive query i can query things directly on the lake there's hives there's database there's all these different features but essentially hdinsight supports all of those we also have data bricks data bricks is actually create mostly created by the people who created spark it's an open source solution but it's now this just complete managed offering there's azure data bricks so i can use that to transform all of my data and then well i'm probably going to put it in some kind of data warehouse so just now it's kind of how that denormalization is in a format i want to do something with i may have a data mart a data mart is typically a subset around one particular authorized optimization component of that so it's one bit then i want to do analytics so this is where we actually have kind of azure um analytics and what the nice thing about analytics does is you can think about these data warehouses all of these things well it's got a certain structure that to the business user me wanting to use the data i have no clue what that is so what i can do here is i can create these kind of semantic models and all the semantic model says is hey if these are all these different tables semantic model let's maybe rename certain columns i can add the links between them i can add other types of calculation so now as the business user to interact with the model i don't care about the format of the raw data it's exposed to me in this very business friendly model it obviously does the analysis as well so it's got kind of an engine there to actually perform analysis on mass via those models and then i'd probably visualize it so things like power bi now power bi yes i can work from those models it could work against a data warehouse it can even do things directly by data bricks transpose things onto the lake itself we have all these different layers that really make this thing up now when i think about this process obviously there's lots of moving parts so what we have is that extract transform load all the way to here we need a control flow to actually go and get the data and call things to transform it then write it so this is azure data factory so we traditionally think of azure data factory as the control flow it's calling those things doesn't normally manipulate the data itself but it goes and gets it it copies it to places it calls other things however azure data factory now also has data flow capabilities through a kind of ui so there's a graphical user interface i can add drag and drop different things behind the scenes it's using data bricks i create this visual flow of the data the mappings the joins the sortings do all of that and it will actually then go and pull that down to scalar so behind the scenes to then pass that to the databricks to actually go and do the work scalar is kind of the native language of databricks there's other support as well so now i can actually think about azure data factory as a complete data integration solution i can really just through data factory from source to sync yes control what's being done but then actually map out the data mappings as well do the transformations to get it to the final format so there's a azure data factory now obviously there's this analytics this visualization part as well so what you may have heard of is something else that does all of this stuff as well is now synapse azure synapse syn apps it takes data warehouse and in the past we'll have to plumb all of this stuff very separately i had to think about the security the networking all the different components to make this work so what synapse has done is basically brought it all together synapse is now taking it's using data factory underneath for the pipelines it's using data warehouse it's using azure analytics but it's just brought it all under this one kind of workspace which is synapse and it's doing the plumbing behind the scenes for us additionally it has kind of some serverless on demand so one of the nice things i can do is it's using azure date like i can do like a query against the data lake directly but use on demand compute i don't have to use my provision compute to actually do those things so i can think about well data factory is that control flow but now it has data flow via data bricks it's in that complete data integration solution synapse uses data factory to now add in those complete components but it's all about getting my data from the source and to kind of that sync so actually data factory is control flow but it's now also that data flow as well and then there's lots of different services to actually use for and that analysis active synapse analytics and analysis services kind of brings all those things together so that was obviously a huge amount of stuff and as always questions below so we really do think about this kind of complete story and picture of all of these different types of data models we might have we have the sql offerings hey i can run it in a vm managed instance great compatibility with on-premises runs inside our virtual network evergreen automatically upgraded for us then we move into the actual pas pas offerings where we can sort of private endpoints into our network some of them will actually use um vena integration for that communication so i can still privately utilize these different um tiers of service based on performance based on resiliency requirements if i'm using the open source databases today i can still get a fully managed offering of those different options available again based on hey maybe i want a hot standby and another availability zone can have those read replicas in same different regions infinite scale with the hyperscale offering and i should have pointed out the the hyperscale offering uh that is postgres only today so when i think hyperscale that is postgresql but these are all relational databases and we will focus on that one read writer then cosmos db many many different data models supported many many different apis i pick the consistency i want so i could absolutely have multiple writers across all the different regions i pick that trade-off between consistency and latency and then if i just think complete picture uh the data flow again not databases but just they based obviously flowing part this model we have that source to sync data factory if i need to get data and copy it somewhere data factory is get data to transform it put it somewhere data factory and then synapse kind of sits on top of all of it to bring it all together so again i know we uh covered a lot um i hope this was useful and until next time stay safe you
Info
Channel: John Savill's Technical Training
Views: 38,037
Rating: undefined out of 5
Keywords: azure, azure cloud, azure sql database, sql server, azure sql managed instance, azure sql mi, postgresql, mysql, mariadb, cosmos db, data factory, databricks, synapse, sql analytics, azure database, database
Id: Af8s5uaMLgY
Channel Id: undefined
Length: 107min 27sec (6447 seconds)
Published: Tue Nov 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.