DP-900 Data Fundamentals Study Cram v2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi everyone, welcome to this DP900V2 study cram. So I did a version a couple of years ago and I felt it was time to update it. As always if this is useful. A like and subscribe is appreciated. Now the DP900 is one of those fundamental certifications if we quickly jump over. Here we can see we have the basic information about hey, Yep, this is the Azure data fundamentals certification. This is very broad. You don't need to know super detail about any particular thing. But if we go and look through this page, what it's going to show us is, well, hey, there's the study guide. So the study guide is going to give you the key information to help you prepare for the exam. What are the core skills that I want to be able to tick off and say yes, I know this, so if we scroll through. It's going to say, hey look, the overall skills measured, the functional groups and the exact skills. So I want to be able to go through this entire list and say Yep, I know these. Then it's also got some sample questions. So it's nice to go through those, give you an idea of the type of things I should be able to answer, and then so I'm familiar and not panicking about the environment. There's an exam sandbox just so I can get used to the graphical interface, the styles of questions that I might see. If I keep going down in the document, it's then going to have hey how I can actually go and schedule the exam. But also free online training and I would definitely, definitely recommend you go through this to get a good idea and help learn the information to help and make sure you are actually successful. My goal of this study cram is really in hopefully an hour or two to just refresh some of those core concepts. Maybe watch it just before you take the exam to increase your chance of being successful again. I don't need to know super deep the information. It's a 60 minute exam. Number of questions may vary. I think I had about 45 when I did it. Finished very quickly. There's not case studies, there's not labs. It's very much multiple choice. Hey, what do I want to use for this? Particular thing. Don't stress out if you don't know an answer. Eliminate the obviously wrong ones and then take an educated guess. Azure isn't trying to trick you, So what seems most logical based on what you do know and have a best attempt at it. At the end of the day, if you don't pass the first time, it show you where you was weakish. You'll go back and you'll get it the next time. So with this, let's jump into it. Now, when we think about Azure services, one of the things I want to quickly review is when I think about interacting with Azure. I as a human being, I'm sitting here at my machine and there's different ways to interact with Azure. I can use the portal. The portal is very good for it's intuitive. I can start to learn about what's available to me. I might go and do some investigations. I shouldn't really be creating things in a production environment. Want to be using templates? There's things like PowerShell, the CLI. These are great for doing automations, creating scripts to do things. And then yes, we have things like templates. Templates have the benefit that they are declarative. I state what is my desired end state. And it makes it so I don't have to tell it how to do something, I just say what I actually want to happen. There's many types of templates, there's things like the Azure resource manager, JSON, you'll hear things like bicep, and then things like terraform. This is a third party, but again a declarative technology. But these are all files that state what do I want the end state to be. And the benefit of these, I can put them into git repos for example, like GitHub. Or Azure DevOps repos. I can use pipelines to deploy them, I can do change control on them, so we really want to focus on this. But no matter what I do, when I interact, when I want to actually interact with that cloud service that is Azure, everything actually goes through site called the Azure Resource Manager. That is the restful MPR API through which I talk when I want to do things into Azure. And when I think about Azure and that Azure resource manager, there's really 2 levels to our interactions with many, many different things I can think. For example, there's an idea of a control plane. So a control plane is when I'm interacting with the Azure resource itself. So I can think about management. I want to create a resource, delete a resource, maybe modify a resource. And then completely separately. What we have? Is then a data plane. If it's a BLOB, if it's a storage account, if it's a data in a table and a database, we have a data plane. And the really important thing here is are the control plane. We have things like resource based access control. That's at the resource level. That gives me permissions to perform things on the resource. However, that does not mean that gives me permission. At the data plane. Just because I have access to the resource doesn't mean I have access to the data. So the data plane now. Some of them do support now role based access control data. So it could be there is a data plane role based access control, it might have its own key or shared access signature. It might have its own Ackles access control list, ways to define who can do what. But the key point about this is realize that just because I have some permission at the control plane. Does not mean I have permission at the data plane and we can see this very quickly. So if here I jump over quickly to the portal if I were to look at saying simple like a storage account and I can pick any storage account I want if I look at the access control. What we can see is there are many roles available. Now most of these are focused on their ARM control plane. If I look, there's actions and these actions against the resource provider, but at the control plane. But none of these would actually give me permission on the data in the storage account itself. I could not interact for BLOB or a queue or a table. But storage accounts all one of the service that do have data plane RBAC. So look look at a data role like BLOB data contributor, BLOB data owner, BLOB data reader. There were also ones for queue and for table. If I look at one of these, this one doesn't happen to have control plane, but it has data actions so here I can see I can set data control plane. Is if you see questions about, hey, you have this control plane role, what data does this let you see typically? It's not. I need something at the data plane to be able to actually interact with that data, so that's really the important key part of that. Now when we think of types of service that we can have in Azure. There's this whole idea of there's different layers in Azure and there's different responsibilities so I can absolutely think about. For example, there's obviously hardware, there's the physical fabric, storage, compute clusters, networking. Then there's a hypervisor which is based on Hyper V and Azure and all of the different services then build on top of that. Now what you also then get is things like an operating system, a runtime, a middleware, maybe an application that actually provides business value and then data. And so my different types of service will do different things for you. You have different responsibilities. So one of the things we always think about is, well, infrastructure as a service. So if I'm infrastructure as a service. You're responsible. For this stuff. Picking the OS. Managing the OS, so things I back up and patching and hey, I go and install the database inside that IIS VM. I'm then responsible for installing the database application. I'm responsible for tuning it. I'm responsible for minor updates, for major updates, for upgrades. I'm doing all of those things. Whereas if I think about things like a Paas offering platform as a service, the line of responsibility now becomes I really only care about writing my app and worrying about my data. Azure is now taking care. Of these parts, I don't see an operating system or runtime or middleware is done for me. Now. Both of these would apply. When we think about data services in Azure, we have a choice. For example, as I mentioned, I could absolutely install a database in an IaaS virtual machine. So with my database product, hey hey, I've got my database. That I want to install and that database. Yeah, I just do an installation. But then I'm thinking about things like, well, when do I back it up? Updates, tuning, maintenance, all of those things. My responsibility both for the operating system and the database itself. As opposed to in this world, we have this database as a service where the database is managed by Azure, it's just installed by Azure. There's many examples of this. There are things like the Azure SQL. Database. Azure SQL managed instance, there's all the managed databases like Postgres. My SQL Maria DB etc. Let's wrong way around this. Is that cosmos DB? So the key point of these is when I use those PaaS databases, those databases are service. I'm not worrying about an operating system. I'm not worrying about patching the database or the minutiae around the backup. They might even be auto tune. There might be other services built on top of that, security built on top of that. Those are really just done for me. And so when I think about the benefits of these kind of scenarios, when I have these database as a service, we really do think about things like, well, they're Evergreen. By Evergreen, they're constantly getting new functionality, they're constantly getting those updates. It's just performed for us. I'm going to have features like auto backup. And maybe there's some configurations I can do about retention, maybe I can manually add in extra backups, but just going to be done for me. They're often gonna have things like native. High availability and maybe I have options to had read asynchronous replicas to other regions for example. It's going to vary by service, but those are the basic ideas behind them. They might have things like auto scale. I have a lot of flexibility in the scalability of those solutions. I have less management to do. And that's really a key point. When we move from this world of IaaS to Paas, it's about a shift of responsibility. In that IaaS world, I'm responsible really for everything. There are things to help me. But it's my responsibility. As soon as I start to use the database offerings in Azure, I'm responsible for a lot less. A lot of it is done for me and I can really focus on the key, maybe design elements of the database and using it for my application. Now a key point when I have these sorts of solutions, for the most part there is no OS access. It's doing those things for me. Some of them might let me stop and start, for example. The compute things to optimize money S these different types of service and most of what we'll talk about in here. All these databases of service offerings, again you can install them into an IaaS VM. There are even some features added if I have things like SQL Server installed an IaaS VM. But most of the time our goal when we use the cloud is only want to be responsible for the things that help me differentiate and give me business value. I don't want to be managing an OS if I don't need to. So as much as possible we try and move up through the layers. So hey, I'd rather be using a database as a service and just not care about patching an operating system or worrying about securing the operating system. Now, whether it's an IaaS or Paas database, realize I still need to be able to access it in some manner. So once again, I can think about well, OK, I'm sitting on a certain machine. And that machine is on a certain network. Great. And then what I'm trying to get to. Is that database, so there's some database ServiceNow, whether that database service is a Paas offering or what happens to might be running in some IIS virtual machine, whatever that is, there's a set of processes that offer me access to that database that go and connect to some storage maybe. I can't see that storage, especially with the database as a service offerings, but I want to be able to get to it. So how do I do that? So a step one is I have to be able to find it. It's gonna have a name, a DNS name. So firstly there's an idea of a DNS resolution so I have to be able to resolve. The name to an IP address, and we want the name because often when we talk to these things it's going to be encrypted TLS communications, which means the certificate has the name of the database, not the IP, so I have to be able to connect to it using the name of the database. So if I'm not, for example, on the virtual network, it's tied into a virtual network in some way. If I want a remote network, hey, I still have to have DNS resolution to it. And often these have a public endpoint. So it's some IP address that's accessible via the Internet. So once I resolve the IP address, well, then I have to be able to get to it. I have to have a network path to talk to that IP address. Which means if there's firewall configurations for example, I have to make sure the IP address that I'm coming from, maybe it's my direct IP, or more likely there's some network address translation on my network which is visible. I have to enable that connectivity. But also, we may use things like an Azure virtual network. So an Azure virtual network is a set of IP ranges. It's a construct that lives within a subscription a certain region. Maybe we don't like this public endpoint. I don't want this public endpoint that anyone could connect to. So instead we have this capability to create something called a private endpoint. That points to a specific instance of a database service. So if I'm sitting on this vnet or anything connected to the vnet. I can then talk via this address and I could essentially block off this public endpoint if I wanted to. Now if I'm sitting on a remote network, obviously I have to have a network connectivity. To get to that virtual network. So that could be something like a site to site VPN between my maybe office location and that virtual network could be a point to site. If it's just at home it could be something like Express route private peering. There has to be a network path to get to the V net that has that private endpoint and then of course once all of that is done. I still need permission. Just because I can get to it doesn't mean I'm allowed to interact with it. Say I have to better resolve the name. I have to have a path to either the public endpoint or a private endpoint. If I've enabled that, then I have to have some way to authenticate and be authorized. Authenticate is proving hey I am who I say I am. Authorization is, well, I am who I say I am. What am I allowed to do? So I have to have some method to get permissions. And again, maybe that's integrating with Azure AD, maybe it's using some shared access signature, maybe it's a key. There's many other different things I can do, but I have to have that in place to use any of the data services. So there's no magic involved in this. I have to be able to resolve it. Get to it and then have permission to it. So that's just kind of the bare. Fundamentals we have to understand for the interactions. So whether that's an IaaS VM with a database installed or just a database as a service. I have to better do those things. OK. So that's some of the the basics around Azure. But then then what do I do? What do we actually have? What are the types of data that I may actually want to interact with? And so we have this thing. We have this idea of data. And it would be nice to think of it as just all there's just this one type of data. And what is data? I can think of data as a collection of information, a collection of facts that could be numbers, it could be log entries, it could be descriptions, it could be observations around things. Those facts may follow a very, very strict structure. For example, if a fact is a date, a date has a very fixed structure. If it's an address, that has a fairly fixed structure. But even that may vary based on country. Hey, some of them have a post code, some have a ZIP code, some have a state, some don't have a state, some have a county. So there might be some flexibility. Sometimes there's semi structured. There's certain attributes that are common, but they might be using different ways. Some people have one e-mail address, some people have 10 e-mail addresses. Some people have additional certifications or qualifications. So we need some flexibility in how I store the data. From a grocery store hey an Orange has very different set of attributes to a box of cereal. So there might be very. Flexible structures. It might be self describing as there's no structure. It's an image, it's a media file, it's a Word document. So we have these different types of data and based on the type of data we're going to want a different type of service to actually manage it. So let's start off with structured and this will be the most, I guess, familiar when we think about databases, so I can think about. Structured. That's one of the types of data, and even within structured there's some flexibility around it. But the idea of structured is there's some kind of fixed schema I'm defining. Hey, there's entities. There's different types of entity. A person, an office. Items for sale, and then there's different attributes those entities have. So as a human being I have certain attributes, maybe a first name or last name, a date of birth and address, a phone number. As a building, maybe there's a GPS coordinate, whatever it is, but there's different attributes for the different types of entity. So we have a schema which is the definition of what are the attributes for each of the types of entity we may have. So we have a very. Formatted fixed schema for each of the types. If I have a well defined structure of my data that I can normalize. So normalize means hey, I'm going to maybe split my data into those different tables, reorganize it so I remove wasted space and not duplicating things. It helps me ensure my data integrity, makes it usable through standard types of interactions. So if I think about structured for a second. There are different ways I can think about this, but we'll start with this idea. Of a relational. Database. So I'm going to have a few different things here. So my whole point here is it has that schema. I have to well defined attributes of the different types of entities. I'm going to normalize my data. I don't want to have this mass duplication. For example, I might have the idea of my company and I have the people. Who work at a certain office. For every person. I don't want to list the office address. It would make far more sense to split out people and offices and then just reference the office. In the separate table that the people working so I'm not duplicating data so it's more efficient in the storage, it's going to be easier to interact with and I'm going to have that better data integrity as I have that strict type of requirement for the different types of data. Now when I have a relational database. I'm organizing it. I'm going to have tables, so I think about the idea. That we create a table and a table has a specific schema. So what we have in the idea of a table is we have columns. So I can really think a column. Is a particular attribute of this entity. And then within the table. We have multiple rows or records individual entities of this particular type, so this is really the key point when I think about this. The schema defines what are those columns, what are the attributes. Each type of entity would have its own table. And it gives me a standard way to interact the column. Hey there the properties. The rows are the entities. So I might have a first name column, a last name column, a date of birth column office ID. I work in which would then respond to another one. Now, when I think about this table and I have these entities, if I think about office ID for a second, well, I want to have some ID that I reference. So very often what we'll have is one of these columns will be the key. So maybe there's an employee ID. So the first column might be employee ID, first name, last name, date of birth. So then maybe there's another column with. Job history and it would have the job done linking to the employee ID. So I'm not duplicating the employee information, I'm normalizing it. I'm only storing the pertinent information and then referencing other tables where that's more useful. That's really the core part. And if I think about a typical relational database and I have these rows. Think of it as that's the way we actually store it. On the disk, so I can really think OK. I'm going to store it as hey employee ID so ID 01 and maybe it's first name Clark, last name Kent and then date of birth etc etc. So this is how we are storing. As rows. Storing it that way, and since there's many many other rows, but this is how it's stored on the disk when we interact with it. Really the most useful way to interact is where we care about the entities. We want to interact with the entities like this. Now there's there's another type. I can also think about columnar. So. Call them up. People say it different ways. But it might look the same. It might look the same as well. There's rows and there's columns. But it is stored. As columns. So on that underlying storage is actually storing. Hey, Clark. Bruce. Diana. And then ohh OK the next one is hey Kent, Wayne, you get the idea. Then I could park a file. Works this way. It stores it as columns and why I might want this is how am I interacting with it. If most of my interactions were really caring about maybe summarizing particular attributes, well this is really useful then because I can just load up, really read in the column and sum it or average it if these were sales numbers. So there were benefits on how I intend to interact with this. So that's really the goal of this if I want to report on an. Act on really columns is what I care about. Then storing it this way may actually be more useful to me. So that's why we have these different types available to us. So that's the traditional, very structured and we're kind of used to that idea in databases. Now there's also I could think about. Let's use a different color. I could think about semi structured. So if semi structured, semi structured, there's either loose. Or no schema? It might be considered self describing. This could be a JSON file. That's a very common example of this. Could be an XML file, but it could even be something like a really a CSV, a tab separated file, something delimited by a certain character. It can really be any format where there's a flexibility itself describes within it. So I can think about we call these commonly a document. And the key attribute is in some way itself describing. Now it will very commonly still have a key. There is some key that we have still to be able to identify records. I always want that. Now that key may also be used as like going to partition key. I might have a separate. If you think about I have to store stuff as we get bigger and bigger. I can't just store stuff as one big file. I might have to separate it into fragments, we call them shards. So I want to be able to Shard my data. I want to be able to distribute the data maybe over different files, maybe over different database servers, and they each host a portion of the complete database. So I want to know well, how do I know when to Shard it and which Shard does the data go into it? So partition key that helps me do that. Distribution. So partition Key helps me have an even distribution. And that even distribution is key. I wouldn't want to pick a partition key that was uneven. I don't want 90% of my data going into one partition and the other four have 3% and 2%. I get terrible performance. So it's really important we pick up our tuition kit that's going to give us a very good distribution of our data. So sometimes we have to create a synthetic partition here. I can't just base it on some raw value. I have to maybe do a combination or do other things to get something that's really good. So get this even distribution that's used to Shard, separate out, distribute the data and that sharding will go over logical partitions. So it's gonna split into different logical groupings of data that then fundamentally will separate over physical partitions I different files, IE different database servers. But the key part here is. There is no fixed schema to it if we quickly go and look at file for a second. Here I can see hey this people. So I have a particular instance of a person, so they're in the curly braces. And in this instance it is Bruce Wayne. But I can see it's self describing. It's telling me the attribute. It has powers, it has one power, money. But then a different person, Clark Kent, has actually a different attribute. Also has pets and its array of powers. Well it has different it has lots of power, Superman has everything, but it has different data available to me. There's no fixed schema it can really just include. Anything I want, and here I'm using those square brackets to indicate an array so I can have multiple attributes for that one particularly type of attribute. To what this is doing, and we're going to come back to these, but just think, hey, fixed, very defined schema, hey. Structured, relational, database semi structured or JSON and XML file could be a CSV, tab separated file could be a space separated. There's different formats for that. Then we have the idea of unstructured. There's there's nothing structured around this at all. We'll go with orange for this one so I can break it off now and say unstructured. It's just. Bits on a disc. Yes, it's a file. That could be a document, it could be a picture, it could be a video. It's just it could be a BLOB, binary, large object. And there's different services around this, but very commonly what we're going to think about in an Azure world here. I would think of BLOB. So if these binary large objects. And there are different types, but I could think about, hey, within there I can store anything. I might store images. I might store video. I might store like a dot doc file. I could store excels, spreadsheets, PDFs, you name it. I can still anything I want in it. That's really the point of that. And there are other types of data. There's types of data where, for example, it describes relationships between things. So for example I could think about graph. So graph is all about the idea that I have. Relationships I want to be able to express. Think about social media, think about Facebook. So and so is a friend with so and so, so and so follows this. Think about again the people idea so I could have the idea of OK. We have these different types of nodes, so we have nodes which are entities. So when node could be of type person. So this is John. And thereby be another person. It's another node. Could be Julie, there could be another node. But this time it's a office. And that could be Dallas. And so there's then the idea of relationships when edge. So I could think of, well, John says relationship, this is an edge. It's the type works at. Or this edge works for. Julie also works at. So it's very good for actually going and working out what are the relationships between the different types of entity I have in my environment now. There are others as well. I have things like do different color, you'll run out of colors pretty quickly here. I did that kind of already. Ah panic do that one. So I have the idea of saying like a key value. And it's funny, sometimes you look at key value and it looks like database. The whole point of key value is there is a key. And then you have some name and a value, so the key could be a hate ID one and then it's first name equals clock. Last name equals Kent. So it's just these key value pairs. But if you looked at it, you might organize them them and it looks like a table, but every single entity can have completely different key value pairs. In there there's really not any type of strict relationship. So Azure storage tables is a good example of that. And there are others like you might hear for example about time series. And the name kind of says, hey, there's something that's really based around time. So there's a time and then there's some data. Maybe it's telemetry coming in, maybe it's logging coming in. But hey, it's all focused around organizing it by that time. Now if we were to pivot. And think about, well, OK, what are some of the types of storage and service we actually have in Azure then? Around those different services. And realize if we start with unstructured, there's different ways we want to interact with the storage. Typically here, what's going to be that unstructured or semi structured? If it's structured, it's probably in a database service, although there's there's some exceptions to that. We talked about JSON already. We have that set format around the data. We talked about delimited files, tab, space where there's some record entity on each line. XML used to be very, very popular. But if it's unstructured or even semi structured, I just want to store it somewhere I want some service. And the core one we really think about is an Azure storage account. So I can have the idea of storage accounts. I think of this BLOB for example. So I'm going to have a storage account. Now, out of that storage account, there's many different services that we can actually leverage. So which account exists in a specific region I create? It lives within a subscription within a region, but it offers many, many different types of service. And a BLOB is really the first one. BLOB is just, hey, a bunch of ones and zeros. Now there are different types of BLOB. A very common one I can think about is block. A block BLOB it's just made-up of a sequence of blocks. I can have up to 50,000 blocks and the blocks can be variable size that go into that block BLOB. But it gives me this massive things like 190 tebi bytes is a maximum file size. I can have things like a page BLOB, a page where is very good for random read write access. They're 512 byte pages. Disks in Azure sit on top of page block behind the scenes. Um, I can fetch single pages at a time, which makes it really powerful for that random interaction. There's append. I can't modify or delete blocks within the file, but I can add to the end of it, so I think logging it will be a fantastic for that idea. And then there are services that almost sit on top of that. So for example we've block. Sitting on top of block is another Azure service, Azure Data Lake Storage Gen 2 and as a data like we're going to talk about this a lot more. That adds things like POSIX style apples a true hierarchy. It has different API can interact with. But when I think about all of the different types of files I might want to store in Azure, very very commonly I think about block BLOB. I think about that Azure Data Lake Storage Gen 2. So if it's that unstructured data, I would just want to store somewhere. If I think about that semi structured. And again that could be JSON, XML, separated values, tab separated values. It could be some of those that database files that columnar format. For example, there's things like Avro. I mentioned the parquet. Already there's things like ORC, so these are in a way they're based around being read from certain types of databases, but it structures that data so I can interact very commonly. I'll put those on a data lake. Well, I think about JSON, XML and CSV. Great, I can read those. As a human being, they're not that optimal for machines to read. Whereas these things like Avro, Parquet and ORC, they're really designed for the machines to read and leverage. Now realize. There's other types of service I may want, so sure BLOB is one of them. But maybe I want to operate over file based protocol. And I want certain services as part of that. So when I think of so BLOB is 1 service I get from a storage account, I can also have files. And Azure files lets me have SMB 2.1 and 3 and NFS shares. So now I can go and interact it using those regular standard protocols. So this could be a lift and shift, it could be a hybrid solution, some born in the cloud server. I need some shared storage area that I want to talk to using a file based protocol, SMB, NFS. I'm used to those. I might even have a scenario where I could think about hanging in Azure. I create an Azure file share in SMB file share. But maybe I also still have file shares on premises. Running on Windows file servers, so these are all SMB. And what I can do is I can synchronize. With that cloud based endpoint, that cloud file share I can do tiring. Hey date, I'm not using just store it in the Azure file share. So there's solution is called Azure. File sync. So I create a sync group containing my on Prem Windows based file servers, SMB a single Azure file share and it will synchronize between them or via this share. Hey data does not used a lot, don't store it on here, just do it on here. But pull it down if someone tries to access it on demand. So I'm going to optimize the space I need to use. There's also table, so I mentioned key value. There's also a table service. BLOB. Files table which is key value. And there's also queue. So I think if queue is this idea of how a message, I want to write the message to the queue and then something pulls out. Now so we say first in first out, it's not a guaranteed first in first out. It typically will be, but there could be exceptions if I need a guaranteed first in first out service and those things like. This bus queues that have message sessions to actually guarantee that. Now if you think about resiliency for my data and I think about this storage account. From a resiliency perspective, there's always, always. A minimum of three copies of my data. Now what is different is, well, how is that data actually stored? I have different resiliency options for my storage account. So if we think about a region for a second. We always think about a region as that two millisecond latency envelope. So if I have one region, my primary region for my storage account. And actually one of the things that gets exposed is you see the idea of availability zones. Availability zones are groups of data centers that have independent power calling and communications. So I might have and those idea of AZ is going to come up again and again and again. Each subscription will see 3 availability zones, so it could be like an Azure, one AZ, two and an AZ 3. Many regions also have a paired region now, not all do. Some of the newer regions like Central Qatar does not have a paired region. But that also might have. Multiple AZs? Or in this case, just a little set of data centers? So one of the things that's happening here is I have different redundancy options. For my data. Some as you forget how to spell very simple words when you're trying to draw on a board and say at the same time. So it didn't redundancy options. So there's LRS. So let's start with colors. So there's LRS. Locally redundant storage, so three in a particular data center, so LRS my 3 copies of my data. Would be kind of scattered in a certain data center. Then there's ZRS. So there's still 3 copies, but they're over the three AZS. So now my 3 copies are here here. And here. Well then I have GRS globally redundant storage. So I have three in a data center and then a synchronously another three in a paired data center. Data center in the pad region. So I might have the three here. And then it's async. It's always asynchronous. Now there's a big distance. It's asynchronous or the performance would be terrible. There's hundreds of miles. I can't synchronously replicate, so synchronous means before I send the commit back to the app making the right. I make sure it's been committed to the disk. In the pair was there's hundreds of moles that would be 10s and 10s of milliseconds of latency, which doesn't seem a lot, but to a computer that's doing transactions, that's huge, so it's asynchronous. We send it as quick as we can, but we commit back to that asking app as soon as we commit it to our primary, and then as we can, we then send 3 copies to that paired region. So we do synchronous within the region. Hey, these are close together. Then there's combinations. Same we have things like GZRS. So there's three over the AZ's and then the three in the paired region. So now I have my 3 copies spread over here, but then the three and the other are not over Aziz and the paired region, they're just in one set of data centers. For some of the services in this pair, there's like an RA option, read Access GRS, read Access GZRS. So what the RA option? This is optional. It lets me get read access if it's a BLOB, a cure or table doesn't work for files. So that's kind of an important point. So it BLOB. Queue Table can get read access, can't write to it. There's only one primary, but I could read on that paired region. So if I had certain maybe analysis I wanted to do either certain app design where it was mostly reads so hey, I could read from the pair. That could be a good pattern to use. There's also performance differences available now when I think about my performance. For Azure storage accounts, there's actually two types of account, so firstly I have the idea of a premium. Storage account. So that's. It's just premium. I have premium block BLOB, page BLOB files for a high performance a lower latency. And then I have the idea of justice, a standard storage account. But within the standard I have tiers. I have the idea of a hot tier. I have the idea of a. Don't expect color. I have the idea of a core tier. And then I have the idea of an archive. Tier And as you can really imagine, an important thing is the archive is offline. If I move saying to archive, I can't directly read it. I have to move it back into hot or cold before I can actually read it. This is the most expensive. For the storage, premium cost the most, then hot, then cool, then archive. But transaction costs are the opposite. Very, very low. Sometimes it's free on the transaction cost. Try that. Costs go up and up as I move down. So I pay more for the capacity but less for the interactions with it. That's one of the key points around it. So if I think dollars. For the capacity. Either the cost of the amount of data I'm storing it goes up. But transaction costs. Go down. So the higher up the arrow is confusing. So it gets more expensive as they go up. The transaction costs go down as I move up through them and we can see that. So if we quickly jump over and look at the pricing for a second. So here if I look at the storage pricing. Well, I can see is well, hey look, if I look at firstly the capacity, so I can see for premium, hey it's $0.15 per GB. And then for hot, hey, it's 1.8 cents per GB, then it's just one cent for call. And then for archive, it's tiny part of a cent. So I pay less for the capacity, but then for the transactions, the operations premium is cheaper than hot. Hot is, we can see here premium and if we compare that to hot to premium, premium is quite a lot cheaper than hot, which is cheaper than coal, which is if you look at the read Operations archive is way higher because it has to rehydrate and actually bring that back into cool or hot. Attributes around those services. I get lower latency as well with premium a great benefit, but I pay more for the storage, I pay less for the interactions. So that's one of the key points. Now within this standard storage account, one of the useful things I can actually do is lifecycle management. So I can enable. Lifecycle management. Well that lets me do is hey look if a file has not been modified or accessed for 30 days. Let's move it to cool. If it's then not being accessed or modified for 90 days, hey, let's move it to archive. Hey, it's not been accessed for seven years, let's just delete it. So it helps me maintain all of that data. Now when I think about these tiers, it's only block BLOB, not append or page. So there's limitations on where I can use these tiers. Archives really useful with let's just have some huge amount of data I just need to store. But I don't need immediate access. I could wait hours when I make that request for to be brought back into the other ones. So that's really the key point around those. OK, so that's. Unstructured or typical semi structured storage accounts are a great feature there. And again the thing we're really focused on a lot is this idea of a data lake. This adls Gen 2 which sits on top of block, but it's still part of an overall storage account. So lets pivot. So we talked about, hey that unstructured and semi structured in a data lake, in a storage account. What about a structured options? What are the offerings there? These are the more advanced data types we'll actually go and interact with. Now when I think about database, there's actually two key types. So if I think about my database types, because we interact with databases for different purposes. I can think about I have my line of business application. So what I actually want is to do transactional processing. Someone buying something, hey I need to log that order so I can think of OLTP. So this is all about this online transactional processing. So very commonly for example my line of business app is going to interact with an OLTP type database. So what characterizes this is I can think about a higher volume, so lots and lots of transactions, but they're pretty small. So high volume of small transactions. I want fast access when I'm querying when I'm doing some update or read. So I want very very fast access. This database is going to be normalized. Remember, the whole point here is I'm removing the did very small part of it, but I'm separating out the data into tables specific for the type of entity and I'm going to reference things between it. So gonna normalize my data so we're split into small, very well defined tables. There's a huge, huge focus on this idea of acid. So. Atomic. Every transaction is a single unit which either completely succeeds. Or fails, for example if someone buying something. Well, there's. Putting in the order, then there's maybe subtracting from their balance the cost of that new order, and they're two different. Parts of the transaction. I can't have one succeed and the other fail. They both need to succeed or they both have to fail. I think about consistency, keeping the database in a valid state. Isolation transactions, if they're happening concurrently, aren't going to interfere with each other. Hey, someone's buying two things at the same time. Durability. If I commit a transaction, it will remain committed even if there's some interruption once I tell the app this is committed. It's committed. There are safeguards in place, be it logging or other features that make sure once the app has been told by the database is committed, it stays committed. So this is what I'm using with my line of business app. But on the, there's this other types of things we need to do. I want to run analytics, I want to run reporting. So then we have the idea of an oil a an online analytical processing database. So this is typically a large data volume. So lots and lots of data. My interactions are typically looking at historical. Data. It's mainly read. Now obviously there's some right going in to get the data into there, but it's mainly read only. I think of an example like a data warehouse. Now what's interesting is obviously this is different because my focus here is about. Hey, I want to be able to do this analysis. One of the things we might see in a data warehouse is we denormalize the data. Because I care about the performance of running these big analytical queries against it. Because fundamentally my goal for this is analytics. So sometimes we will combine tables back together. We may introduce duplication of data because it will make the analytics we want to run against it that much more efficient. So we care about different things. I want to capture raw data for insights and I think the data warehouse I have to get the data ingested, maybe transformed into it. We'll hear about things like data Lakes and data warehouses commonly together. Now when I think of these services. Well. If I think about an OLTP. We think about SQL Server. I might think about Postgres. I think that my SQL. I think about things like Maria DB. From the data warehouse, I think about things like synapse. So there's different solutions depending on what we're really trying to do. Now before I go any further. There's different people that interact with the data that we're talking about here. So let's just for a quick second think about. Well. Who are the actors involved in all of this so? Who who is doing this stuff against this? So an obvious one. Is the database admin. And this is really the first person we often think about, so I have the idea of the database. Admin. The DBA. So I can think about the DB's responsible really for the design. The implementation. And the maintenance? Of the database. They're looking at it, they're tuning at it. They create the tables. They say, hey, a new index is probably a good idea to add here. They're doing updates of database, minor and major updates. They're looking after the database. They have all the operational, the maintenance responsibilities. Now they don't work in isolation. The database admins are experts in databases. They'll work with the app teams to understand the app requirements so they can correctly design the database, the schemas, the tables, the relationships between them. What's the performance requirement? What's the HA? The high availability? What's the disaster recovery has to fail over somewhere else? What's the security? All of those things. They will understand the requirements from the app teams to design the database, so that's their goal. But then we have things like well. What exactly is the data workload? Because the database is one part of maybe a flow of the data. Someone has to understand what is that flow of data. So then we have the idea. Of a data. Engineer. So they goal is to understand. The data workload. So there you have to understand where's the data coming from. How is it emitted? How can I read it in? Where do I need to destroy it? What do I need to do to it? Do I need to transform it? What are the types of interactions required to it? Where do I need it to actually get stored? So I think about this idea of a pipeline that's going to actually bring the data in. Maybe store it, maybe modify it, transform it, and then store it in places. So the data engineers, they're going to work all of that out. How do we ingest it? What do we do with it? Where do we put it? Do I have to clean it? All of those things? What are the network considerations? What the security considerations? They're going to design this, so they are going to design. The data pipeline. How do I get it? What do I do to it? Where do I put it? Data engineers going to enable that by understanding the data workload? And then once we have the data somewhere, generally data, we don't care about data, we care about what insights, what can we learn from the data. So someone has to work out well what are the relationships between the data. Someone has to create models to actually let us do analysis on the data then we have. The data. Analyst so these two things like explore the data. They will go and look at the data. They will find relationships. In the data. They will build models, so they'll build models that, then the business. Can use. To get insight. So they're going to expose insight. From the data. So there are these three key roles. That light up data someone has to maintain the databases and design the databases especially for those online the transactional processing, the line of business apps. Someone else when I think about the complete lineage of the data. Getting it from somewhere, doing something to it, making it usable to then store in other places. Hey, the data engineer has to design those pipelines and then once we have data in place is. How do I get useful insight that might change the behavior of my organization? Whoa, those are the data analysts. We have all these different roles to make sure everything we get is useful. And get the most out of our data. So it really is the key point about all of those different things we're actually going to do. Now let's go back now and start drilling into some of the specific types of service for a second. And I referenced this already, so we had this idea of a table. If I think about a relational database, SQL, Postgres, MySQL, Mariadb. I talked about idea of a primary key which uniquely identifies a particular entity. No2 rows can have the same key. So hey ID one, ID 2 again could be the employee ID for example. But one of the useful things we normalized data is I want relationships between them. So we have this idea of hey, there could be another table. Over here. Which has its own schema. Remember it has its own entities which have their own attributes. And it has its own key. Imagine this was office. These were offices, remember? So one of the things I might very commonly do is over here. This this might actually reference. Maybe it was the office they work at. A foreign key. So it's referencing a key in another table, so it could be like office one, office one, office one, it's A1 record called Office one and it's referencing those. So this idea of a very strict schema that defines those and we can look. So if we jump over here for a second. So if I go and open up close those down quick. So let's go and look at one of our databases. And what we can see is through the query editor. So I'm going to actually authenticate using Azure AD. And now I can go and look at the various tables I have. But if I was to go and look for example at the roster, I can see the schema. I have 4 attributes. Oops, I have 4 attributes defined. First name, last name, code name, Mother name. I was lazy. They're all varchar. But this is the strict schema for this particular table. So that's the data I can have. And then there's the schema of a different table, just an item and owner. Now I can go and look at the data. So here I'm just again using this nice web interface, but I'm selecting the top 1000 entries. And I can see those values. So it's as adhering to the schema, Bruce Wayne, Clark Kent, Diana Barry. And if the mother name looks weird, it's because it's always encrypted. It's encrypted at the client side. It's using a deterministic encryption. So the same source string will always end up with the same encrypted value. So both Bruce and Clark's mothers and Martha. So the encrypted value get out of the algorithm would be the same. This is better than using the random in terms of indexing and ability to do server side operations, but it's not as strong as random for the encryption. But this shows hey we have the tables, we have them here in the database and I can just interact in different ways with them. So do you think about this for a second? So we have the SQL ways to interact this transact SQL which has got some sort of enhancements. On top of that, there are different languages we use when we interact with the databases and the Dbas. Database administrators will commonly leverage these. But other people might use subsets of it when they actually want to go and view data or modify the data. So there are different ways to interact. So if we think about the languages for a second. There's an idea? Of down here. A data. Definition. Language. So the DDL. So the data definition language is all about things like hey, I want to create. Uh, I want to alter. I want to drop. Uh, I want to rename. And I'm doing that on tables. Maybe I'm doing it on. I'm creating a stored procedure. A stored procedure is a set of operations that I can then just execute whenever I want. It simplifies have to type them all in. I could pass parameters, but stored procedures let me do that. Maybe it's views? A view is as the name suggests. I can run a select statement against one or more tables, specify particular columns I want to show, so it gives me a unique view, but it's not storing the data. Again, it's just a way to see data in a different set of attributes and combinations than how it's actually stored in the database. So the DBI admins will use these to create the tables, all to them, drop, etc. Then there's also the data. We're going to type it all out again, control. Language the DCL. So this is about permissions. So I'm granting a permission, I'm denying a permission, I'm revoking a permission I gave you already. So this is all about modifying the permissions. Again, the database administrator would be doing this, and then I might think about the data manipulation. Language. So now I'm interacting with the table and I'm doing things to that I'm inserting new rows, I'm updating a row I'm deleting. A row and a subset of this is actually the data query. Language. Which is how doing a select, and there's many other things I can do here. But when I think about the structures of what these are all doing, hey, maybe it's an insert. Into a certain table name. Just give you an example and then I tell it hey, I'm inserting into this column. And this column. Doing the names there and then I'll tell it the values. Clark, etcetera. I might say select. From. A certain table. And then I have conditions, hey, where. Bah Bah Bah Bah I could order things. This actually might be useful just to go and seek this quick. So here in this example right here. So we're here. What I actually did was this select top 1000 from justice, which is the group, and then roster, which was the table. But I could take out a top 1000 and just say give me everything. So just one all the records. So there's the same records that we have right there. But maybe I want to change how it's presented to me I how it's ordering so I can add different things to this. So for example I might say hey order by last name. And now you can see it's ordered the records. Now Alan is first, then Kent, Prince and Wayne. Now I could also think about, well I have data in other tables. So in assets I have items based on the owner and realize that owner matches the code name from the roster table. So that could be what I could join on where that owner equals code name. So that would be a way I could have a join between assets and roster. Almost I could think of it as that foreign key. So now what we'll do is we'll do a new query. And I'm going to say, hey, look, select. Now R is just a reference here to the roster table, which we're going to say. So from the roster I want last name, from my assets I want item O I'm going to get this from. The roster table. So here if I just start typing it I'll fill out that full name. So from roster as now I shouldn't have pushed tab there so I just want that as RO I'll remember is going to be the symbol for roster and I'm going to join it with my assets table. And I want to do that as a. So again, those letters just are the representations of the full table. Then I'm joining it on. Well, the roster code name needs to equal the assets owner. So that's how they know they match. So sure enough, now it's hey last name. And the item, so I did a join on those two different tables. So really just showing how I can do that through just very, very simple things within the language. And once again, that's all this is. R just means the roster and a just means the assets table. That's all these are doing within this query. It's not saying he's known detail, but this is just very much that basic structure and then hey. What am I actually joining them on? Quite taxing if I had a lot of data to have to go and look at every single row. So one of the things you will very often do is you add those indexes, we create the indexes and we can add additional indexes. Doesn't have to just be the key. I can add indexes on things that maybe hey I'm manipulate a lot to improve the overall performance of the system. I talked about views already which are just virtual tables based on some select statement. It's not storing it again, it just gives me a way to interact. And then those store procedures. Hey, I have just a set of commands I might want to run periodically, then I can just exec it, I can pass parameters. I have a lot of flexibility for that. I mentioned over here some examples of the different database types, so one of the things you are going to want to know is some additional detail on the offerings, because even within like SQL there's actually different skews, different sets of capabilities based on those. So it's important to have a pretty good idea. Excuse me of what those options are. So we start with Azure SQL database. So if I think of. My Azure. SQL database now the key thing here. This is fully managed. Oops. I say hey, I want a new Azure SQL database. Make it so. Now there were different skews available to me for this different service tiers. I can think about there's a general purpose. So I have a general purpose. Also standard and the reason there's two names. There's actually different ways to purchase and bundle in the different compute and storage and IO capabilities of the databases. There's like a blended model called DTU, but then I can also buy it by V cores and have the storage separate and we'll talk about that. And there's also business critical. Also called premium. And then there's hyperscale. And as you can really imagine, general purpose. I have a database server. There's a database server offering my database. The storage is separate, so the data and logs. What happens is if this fails, another instance takes over my data and logs and starts offering my database again. With the business critical premium I actually have like an always on ring of servers that replicate to different copies of the data and log, so if one fails another can quickly spin up much much quicker. So there's multiple database engines. Obviously I pay more. There are zone redundant offerings where it can spread those either the takeover ones, the gateways that give me the initial entry into this are spread over different availability zones. So we can that resiliency from any particular set of data center failures. Within these, there's actually. The idea of a single database, so when I create it and give it the resources, just one database gets all of the resource. Or I can do something elastic pool? So here I can create multiple databases within that elastic pool that share the resource. Where that would be useful is if I had maybe databases busy at different times. And so hey, when one database is busy, it can use more of the resource in the pool, and then when it's quieter, the other one maybe needs more resource. So if I can balance very well multiple databases that need maybe peaks at different times. That can be a really, really useful option to do, and there's even a serverless offering, so that's only for the general purpose on the V core. But then that just auto scales based on any particular load, so there's also just for general purpose. A serverless. Offering available to me. Hyperscale is designed for massive performance and massive capacity. It uses separate page servers and multiple page servers to actually interact with the data stored. And then I have this idea of a primary compute node that takes in the requests and then it distributes that over the page servers to gather the results, aggregates it together and then passes it back to the requesters. So there are many different. Offerings based on exactly what my requirements are and when I talked about those different. Offerings. If we quickly look at this site, it understands the structure. So here I can see that standard general purpose. There's a control ring for the initial entry point that the A talks to, and then I just have a primary replica. Which then separates out is data and log files. And if that primary failed will there were just spares that could then go and connect to the data and log files and offer my service. But it has to recognize, hey there's a failure and then fail over. Compare that to the business critical, we'll hear there are multiple of them. So I can see now, well, there's a primary replica and then I have secondary replicas. Each of them have their. Own data and logs and because it's in an always on availability group, if there is a failure it's going to pick U that quickly. And just like general purpose I could separate those over availability zones and that's super different from hyperscale. With hyperscale hey there's these completely separate page servers with a portion of the data that serves it up when I do those queries to and the prime ring and we can see the options if I try and create a database. Well, if I do configure. Right now the default is general purpose and it's using a combined. Just V cores and then the data. So with the V core I have general purpose hyperscale business. And then I'll have the DTU options so I can separate those things out. I could change the V cores, I can select, change the data, I could change the configuration of the hardware, and I have that serverless option because it's general purpose. If I pick anything else, hey, I don't have that serverless offering anymore. On the DTU side, again, that's really the older that's that more blended set of performance based around the IO and the compute and the storage and there are different sort of skews available for that. But most people now will use the V core. I can specify the exact hardware configurations, the number of cores I want. I have a lot of flexibility actually on what I want to do. Now one other element is what we have these SQL databases, but they are just database instances. What about networking? What about auditing? What about logins? And So what happens is these databases do actually sit under a logical server. It's not a real physical thing that really physically has the databases, but it's that server that actually has the other attributes. So if we were to go and look at the server instead of just the database, notice database is missing some of those other configuration items. But the server itself if we go and select on this. Now I see different options. Now I can see things around the networking, around some of the security elements, around some of the identity, around the auditing. So from a structure perspective, hey I have this Azure SQL database but overriding all of these from a logical perspective. They do live under a logical. Database server. And it's a database server where I can configure hey network rules. I can configure things like the firewall. I can configure the audit in the logins and all of those various things. Now something that's gonna appear in a lot of these different services and I guess I should mention it now and I alluded to it earlier on, but when I think about Azure SQL database, by default it's a public endpoint. Now I can use the firewall to lock it down, but it's a public endpoint. If I don't want that, this is where private endpoint comes in. Now some services integrate into your virtual network SQL MI, which we're about to talk about. But if you have the idea of hey. I have my Azure SQL database. And by default it is a public endpoint. And I don't like the idea of locking that down. I don't trust it. There are things I can do with service endpoints to make certain subnets known and only allow it from a certain subnet within a certain vnet. It's a service. End points would let me do that, but again, I can create a private. Endpoint which points to a particular instance. Of hey this is SQL. Ohh one. And if I wanted to, I could then disable the public endpoint and with private endpoint any network that's connected to it can use it as well. I need that DNS. There's going to be a special DNS name, a private link DNS they not have to be able to resolve. But then any network would be on Prem or other V Nets could now go via this private endpoint, a service endpoint which I enable for the all of the Azure SQL service on a particular subnet that only works with things in the subnet. So private endpoint is very very powerful. To change and let me do that. So just from a networking private endpoints are really keen we see those more and more these days. OK. So as Azure SQL database, that's one of the SQL offerings that we see. The other very big one. Is this idea? Of Azure SQL. Trying to get my spacing right here. MI. O managed instance. The huge point here about Azure SQL MI is it's near. 100% compatible. If you were just running SQL Server in your own OS, there were many different aspects to SQL. There's that agent jobs, there's different capabilities. Well, in Azure SQL database some of those things are not available to you. It's this multi tenant service. It's very controlled. I can't do everything I could do on a regular SQL Server. There's tables that goes through to differences. Azure SQL MI. It deploys into your vnet. So we don't need a private endpoint because it's it's target is an IP in my vnet. But it's yours. It's still managed for you. So it's still fully managed and updated, but it lives in your vnet and it has a much higher compatibility with regular SQL. Now one of the things I can do Azure SQL MI. Is I can also via Azure Arc, so Azure Arc brings certain Azure control plane operations to on premises for server operating systems for Kubernetes. And once it's managing Kubernetes, it can bring certain data services down on top of that as containers as pods on that managed Kubernetes. One of them is SQL, MI, Postgres, hyperscale was another one. So we can actually bring Azure data services to your Kubernetes. Environment be on Prem or even another cloud, so I've Azure if I need, hey I need. Near 100% compat with my on Prem SQL Server. Azure SQL MI is probably going to be the solution to do that. Um. It's again, it's still automatically managed and maintained, it's just better compatibility. I can have multiple databases in a single Azure SQL MI. Now the other option is well I can have SQL. In an IaaS infrastructure as a service I and Azure VM. I can just install SQL in a VM, but even then there's benefits as an agent and it can run in a lightweight mode. Ohh what can run full and this actually brings management capabilities even though it's just a regular SQL in an IaaS VM. So here we can see the different features and we can see we're depending on the feature. So Portal management hey I'll get with lightweight and full automated backup will it's full only. Automated patching full only, key vault integration full only but hasten the licensing. Some of the flexible version changes I get with the lightweight as well and you can go through and you can see hey where I do and don't get the feature. And again this is all about my lightweight or full mode. Really lightweight is the extension is available. Let me just try and find the part in the document that goes through the difference between it. But in lightweight, I have the extension but it doesn't enable the agent, the full agent inside the operating system. Whereas when I have the full mode, that full agent is deployed an running inside the operating system. Oh here we go, management mode. So lightweight, hey we've got the extension but it does not install the agent. And then the full mode we actually install the full agent to give the full functionality. Have those different options available to us so I can install SQL actually just in an Isvs M and I still get benefits from that. Now obviously when I think about in a VM, I can install anything in a VM as long as it's supported by the database vendor. I could put Oracle in a virtual machine for example in IIS if I needed an Oracle database. Like you just install it in a an IaaS database. But then there obviously are many other types of database available. And you're not on your own. I don't have to just use them in an IaaS virtual machine, there are a whole set of Azure. Database. 4. So there were different offerings, so one of them. Is Azure database for Postgres. One of the nice things about the Postgres is it as a fantastic compatibility with Oracle. There's like a 90% compact between sort of the SQL between Oracle and Postgres. It's built on the community edition, which is something in common you're going to see for all of these Azure databases for, but again, great competitor Oracle. There are different offerings for Postgres because it has evolved. They're changing the way they do this now. All of them is automating the patching the backups, but we have this idea of single server. This was the original offering. This was based around their own special containerization technology. It gave you a great SLA. It was 99.99% SLA, even though it's a single instance, because if it failed, it could switch over Super, super quickly to another instance that was just sitting around ready. There was always 3 copies of the data. Computing data were separated. Uh, but didn't support things like availability zones. There were the ability to add read replicas to other regions, but they've now introduced is a flexible. Now this is VM based. Now that VM basis now gives me more flexibility. Uh, I can have do things like the burstable VMS, the B series. I can stop and start so I can optimize my spend. I can optionally have ha so I can have a standby server so it will automatically fail over to them. I get far more configuration, there's hundreds of attributes I can now change. Of my database. So I get much much better flexibility of my database that I couldn't do with a single server. There's a PG bouncer so a broker to handle the connections coming in. I can have a custom maintenance window, it doesn't automatic miner updates. There's also a hyperscale. So hyperscale uses the situs extension. So this is a standard extension for Postgres. The remember I talked about sharding things so this does that sharding. It splits the data over multiple shards over multiple servers. And I do this through distributed table. So I've distributed table that then shards over the multiple servers. I can have also these ideas of if I have certain important data I can have a reference table that would be on every server. Maybe I'm commonly looking at comparing against. I can do that in a reference table. There's actually 2 tiers of this. There is a basic tier. So the basic is a single server so it's one node. So here both the coordinator and the worker is just one box. This is designed to get me started if I'm trying to experiment to keep my cost down. And then it's also the standard SKU that where I have one coordinator. And then two plus workers. And I can upgrade. So I could start off with basic, I say OK I get it and then switch over to the standard mode if I wanted to. So I have those capabilities. So hey, I want to Postgres database. I can still have a sync replicas um. But this takes away most of the management. For me it's just provided by the system. So that's postgres. Azure database for Postgres is also Azure database for MySQL. And once again with my sequel I have single. And flexible. And then there's also Maria DB. So manage that risk for Maria DB. That is single only. So we have these different offerings. So my sequel is commonly used as part of the lamp stack. So on that whole Linux you have Linux, Apache, MySQL and PHP. Hey, I can get a managed MySQL as part of that. Mariadb is also very good for Oracle compatibility and again built on the community editions. And really just like all of those there are other things. There's for example an Azure managed instance for Apache Cassandra. This is an automated automated deployment, automated scaling into my virtual network as part of your existing Cassandra ring in a hybrid configuration. So we we do have these different options available to us. Now. Another big offering. So this was all based around hey, relational databases, but we also had that idea if I can find it. We have that semi structured. Documents JSON, XML. All of the databases we've talked about so far. Were born somewhere else. They were born on Prem. And then we brought them to the cloud and enhancements are made to make them better in the cloud. When I think about semi structured. The key service here is Cosmos DB. Cosmos DB was born in the cloud. And what that means, it was actually designed from the start to support multiple regions. An option if I want. To support multi right that is very difficult to do in a traditional relational database. If I want to write my application, I want it to be able to have rights at multiple regions. That's really hard to do. Typically only one of the regions would be writable and have to write my app. Maybe I can read from the async replica but write to the other one? Well, then where do I read from if I've just updated? It's a whole set of considerations around it. And why this is so tricky is that distance thing if there's a hundreds of miles between regions, remember that synchronous? Yeah if I want to make a right here. If I want to wait for it to be synchronized over here, so make sure no matter where I'm reading from I get the latest information, I'd have to wait for that synchronization? Get the commit acknowledgement back and then I can tell the app yes, you're good to go. Now imagine I'm lots of regions all around the world. Performance would be terrible. So I really need this idea of, OK, I need some flexibility. I need different consistency models to say, look what am I prepared to give up? Because I cannot have everything. Whenever there's a distance involved, I cannot have guaranteed consistency. I'm going to get the same data no matter where I read from without giving up some performance. Because latency is always going to get introduced with performance, we're used to the idea of cap theory. So cat was all about the idea of partitioning and availability and consistency, but in the cloud and when I do have multiple instances, there's this PAC which builds on that, but it's all about the idea of else. Else. Latency. Or consistency. Which one do I want? I can't have them both. And that's the whole point of Cosmos DB. With Cosmos DB it lets me pick. It has this whole idea of variable. Consistency. I pick and there's really 5 levels I can have strong. All the way down to eventual. And that's the name kind of suggests eventual well, we'll get there, get there eventually. In the middle there's a very common idea of session to guarantee within a certain session everyone sees the same thing if they read. And there's also kind of these midpoints between them now strong guarantees that everyone, no matter where they are, always sees the same data, in which case multi white does not make any sense. Because they they still have to go to a same copy has the sync everywhere, so multi region write and strong. Doesn't make any sense if I want to guarantee everyone sees the same thing. It doesn't make any sense to write in different places so that that goes away, but for all of the others you get this idea of, well, what is the consistency I actually want now some nice pictures to that kind of shows the consistency levels. So here I can see if we Scroll down. It uses notes on the piano. So it goes from this stronger consistency to weaker consistency. O if I think of strong consistency, no matter where I read from, I'm always going to see the same data in the same order. So that's that strong consistency and I can go all the way down to the weakest, which is eventually I think I've scrolled past it. But if I look at the eventual, then depending on which one I look at, I'll see different data in different orders. There's no guarantee consistency of it. A very common one we use though is this idea of this session consistency. And notice we have different sessions that can be shared by different processes. So here Easters 2 and West US two are sharing session a. They are guaranteed to see the same data at the same time. So maybe. Articular processes require that whereas it different session will that could see completely different data O that's really the goal is I set the consistency based on what is it my application is expecting. It's written that you can pick. I pick. What is that consistency I actually need? And it's it's supports multiple APIs to I think about different types of data and interactions I want. It supports multiple of those. So if I think multi API. Well, hey. I might want to use a SQL API or Mongo DB. When I want to interact with documents. From that semi structured JSON for example. But I can also use Cassandra. When I have that more column based data. I can have table APIs. When it's just those key values. And I can use Gremlin. API's when I want to use graph, remember when I had those nodes and the edges that were the relationships between them? So all of those are supported when I think about Cosmos DB. Now one of the interesting things about Cosmos DB is all of the performance and how I pay for it. Is with these things called request units. And different operations I perform cost me a different amount of request units. Now one of the challenges is we have this idea of a provisioned mode. Where I say, hey, I want this many request units and I pay for that many request units. The challenge is it's very hard. To get that number right. If I make it too low, I'll get throttled because I've run out and I'll get event codes to say you run out, you've been throttled, or I make it too high and I'm wasting money. And so there's also we've provisioned I can do auto scale. So we've auto scale and this really is the common one. I can say a min and Max and it will modify it based on the number of questions it actually needs to use. Now the way this works is you pay a little bit more for the auto scale which is look at the pricing quick if it's not multi right. So here if I look at the pricing, the key point here is so I can look at the standard throughput and I pay a certain amount of money O based on the request units. But notice here there's multi region write costs. If I go to autoscale, there's this 1 1/2 times for single region rights. So if it's single, not multi region right, I pay that 1 1/2 times. But multi region right, it doesn't have it, it's the same cost. Brianna. To use auto scale you just always going to do it. There's no downside to doing it. Even provision though. Unless I can operate at a 66% or greater efficiency, which is really, really hard to do in the real world, it makes more sense to use a word scale. It's going to be cheaper. So even though there's that 1.5 price factor. It's still cheaper most of the time because it's really hard to get above 66% utilization of and provisioned unless you're really, really steady workload and you've done a huge amount of work to estimate that out. It's just super, super hard to really do that in any real way. Um, there is also a serverless option. That has a smaller scale. But I can stop and start it. Um, But there's a maximum amount of data I can have with that. It might be a smaller kind of workload you have. And then I assign this to a database and within there I can take containers and tables and all of the other things. If we go back. So if I think about the different database services for a second, so we talked about the idea of the OLTP, these are all relational databases and then we have those semi structured the documents. What about data warehouse side of things? So. There is also our lap services. So I can absolutely think of Azure. Synapse. Analytics. The board is starting to misbehave. It's getting busy. And this actually tightly integrates with things like the Azure data Lake. So when I think about that ADLS Gen 2. This service integrates very tightly with the Azure Data Lake, but what this thing does is it separates out its compute. And its storage. So I can pause. So this could actually be paused. So I stopped paying for the compute side. But this is going to let me do is it incorporates many, many different elements. It brings many different services together that before existed. But what kind of confusing to work out. How do I use them? How do I give them permissions to each other? So it brings them all together. It has been that pipelines. So when I think about building that data flow. It has that. It's using the same as data factory, which we're going to talk about. It has a SQL based. Data warehouse. So remember that large store of just lots and lots of historical data that but then want to run analytics against. It has Apache spark. For various types of data manipulations. So I can think about data preparation, getting it ready for some other type of storage, cleaning it, mapping it. I can have things like extract, transform, load extract, load, transform, which we're going to talk about. I can do things like machine learning. I can do all of those directly within the service. I have a data explorer. Today we explore lets me actually have a data analytics solution for real time querying of the log and the telemetry using KQL. And of course it integrates very tightly with that Data Lake. And the whole point of the data lake is hey, if I store data in here be a CSV parquet, a tab separated, a JSON, well that can actually be read directly from the SQL pause. It has to do further analysis on it. So it really is super, super powerful. The other types of interactions I may actually want to do. So before I go any further, let's just talk about some of the basic tools that I may actually leverage in the environment. So if I think of the tooling, those different personas we talked about before, the database administrator and the engineer, the analyst, they will use different sets of tools. See if I focus for a second on. The DBA so that database administrator. Remember they are focused on. The database, the tables, the stored procedures, the views. So if I'm SQL focused we have the SQL Server management studio. Now this is windows only. But this is going to give me the deepest set of management capabilities. So it's very, very deep management of SQL Server. I can think about interacting directly with things like the query store. I can manage things like SQL agent jobs. So this is management, the high availability. Am I can really? Control and interact with any aspect of SQL Server. But then I can think about, well, there's also the Azure data studio. Now the focus here is, well it doesn't say SQL in there. The focus is on data services in Azure. So this is instead of instead of being windows only this is multi platform. And I can interact here with different types of database. So yes, I could think about SQL, I could think about data warehouses, I could think about Postgres or via this nice graphical interface. So it's visual. It's multiple types of database, but it's really focused on interactions, querying the data. It is really not focused on that deep management aspects of a particular database. Now it has extensions. So one of the great things I can do with this is I can create dashboards, I can have visualizations in my dashboards. So there's many different aspects to what I can do with the data studio. So you might start to see this used by other roles when hey I want to make insights available to other groups within my company now if I think about from a business. Perspective and how they might want to interact with data. We have solutions like Power BI. Now Power BI is all about getting me insight. It's all about different types of visualization. And there were many aspects to power BI. I could have dashboards. I can have the various components within that dashboard coming from many different data sources. It could be a data lake, it could be a data warehouse, it could be excel, it could be other types of files. And if I think about that journey of the data, we talked before about what it has to get ingested from somewhere, maybe there's transformations and it's stored somewhere, and then I want to analyze it. Think of Power BI being at the end of that chain. When I've gone through that data wrangling, I've mapped things. I've put them into different locations. Power BI is at the end of that chain to help different people get insight from the data. There's a power BI desktop offering that lets me actually create data visualizations, create data models, create reports, and then I can publish them to the power BI service, which is a cloud service that could then be consumed by. Phone applications through web browsers. So there is a web browser interface. It has some very, very limited data modeling and reporting, but it's not the same as the desktop tool. But I can create these models with power BI. Through analytical modeling, I can create relationships between different tables from different data sources. Via its native model capability. And if you think visualizations, I can really do all of those with power BI, be it a table and text, be it a bar or column, line charts, pie charts, scatter plots, things on maps. Hey, have a geographical map showing maybe sales by region for example. And I can combine all of those into a nice interactive report that could have filtering value specified. All of it just through and that power BI tool. Now, one other thing I want to talk about when I think about, hey, we have data all over the place. Is really this service? I'm going to do it in my Galaxy. I normally save the Galaxy Pen. And it's Microsoft. purview Now this is one of these newer offerings and per view is all around. Well, the data can be anywhere. So that data could be in Azure data services, it could be in another cloud data services, it could be in a certain SaaS app, it could be on premises and it's in place. I don't have to copy it into purview. I can give per view line of sight, and that's to be able to get to. It has a certain credential, but it doesn't have to copy the data, and a huge part of what this lets me do is it discovers, so does discovery. What data do I have out there over these services? What is the type of data? So now I can add things like well, I can classify it. Or is it PII? Is it about this certain project? Is it HIPAA related? It will go through an in place? Classify and then once I classify it, remember. Well, I can then apply governance to it. I can put controls around what do I want to happen with that. They obviously manage risk, helps me match compliance requirements. I have to actually the reality of the situation and when I think about classifying it can be as simple as expressions. How you think about it's a certain number of numerics, then a dash, then a certain number of numerics, a dash, like find a Social Security number, but it can also have trainable. Classifiers. And what does that mean? That means hey, I show it. Bunches of data, examples of the data, and then through machine learning it can learn. Oh OK well that looks like that other data I was showing before. OK, I'll classify as such. And again, once we classify it, then I can apply that governance to it. Now this includes things like data in teams, data in SharePoint, data in exchange, data in OneDrive, data in Azure storage accounts, data in AWS S3 buckets, databases. So it's going to go and discover the data, go and find the sensitive data, apply classifications, and then I can even add controls through things like per view information protection. I could add watermarks to documents for example, say hey classified, I could encrypt it, I could add access restrictions to it. It's going to also let me have things like data lineage. So when I think about data lineage, well, that's where has this come from? What does it gone through? Where has it gone? What's the history of this data? I can see that. I can look at the behavior related to the data. Hey, someone suddenly doing a mass download? That's a little bit suspicious. Or I could report on that. I might want to block it so I have a lot of flexibility. Around. Finding the data wherever it is, understanding what is the data, wherever it is, I may not know. I may not know. This data is sitting out on some other service. This is going to help me find it, and that's really the key part. Of all of this. Now. We talk about data, all these different types of data, and we talk about, hey, we ingest it and we do stuff and we put it somewhere else. Well, how do we get it? How do we get this data? So let's start off with the idea that, well, obviously there's different ways I can actually collect the data, so if I think I'll make sure I've got enough, I don't want to run out of space, Scroll down a bit. So if I think about data, what are some of the common ways I could get data from something? Well, obviously there's batch. Batches may be the most familiar to people. The idea is that data is collected. In a group. I have a collection of records, maybe maybe it's the last day of data or last hour of data or whatever it is. There's a big group of it. So typically the idea is this will be processed on an interval. So there's going to be some interval based processing. Every hour we go and fetch the log file, every hour we go and fetch the sales data. So typically because it's a batch and it's accumulating and then we grab it. It's going to be a very large volume, so we associate the idea of a large volume of data in the batches and because we're collecting it and processing in groups and there's a time between each of those iterations. There's going to be a latency. This is not real time. There's going to be a delay between maybe that data being created. And then progress through whatever this pipeline this workflow is going to be. And then I do some work on that. That's very different from the idea of streaming. If you look at the model world we're in today, think about Internet of Things, these tiny fridges and sensors, they're connected to the Internet. They're constantly emitting telemetry. That time based, hey, my CPU percentage or my heat or those things, they're constantly streaming actually a lot of data. When you have millions of these things, that's a massive amount of data. That's not in batches. It's constantly being sent to some target. So the idea of when we have a streaming. This we process. As it arrives. Whereas again with batches maybe saying it's writing it and it just sits and writes it to a file that accumulates. All lots of files accumulate but they just get grabbed in batches and then something happens to them. So because of this, this is real time or we talk about real near real time, it's really, really quick. So this had potentially a high latency. This would be a very, very low to almost 0 latency we would think about. Now because of this processing it as it arrives and maybe even it can't send the next bit until we process it, I have to really think about scale. This cannot be a bottleneck. So I need to make sure when I do streaming, whatever is receiving this data has to be able to handle the amount. So again, I talked about the example of Internet of Things. Imagine maybe I'm doing tweet analysis as soon as people tweeting, I'm grabbing that. Maybe it's stock prices. I'm constantly. Wrapping as the stock price changes and I'm doing some analytics, I'm doing things on there. Now when I think of streaming and I think of services, one of the big ones we have in Azure is Azure Stream. Analytics. So this is designed specifically to be that target as you've got these sources of data coming in, so it's real time. It's a managed processing engine. It's going to capture the string from the input, I can run queries against it, it can extract bits of data from it and then output to other targets for further analysis. So it's doing both the ingestion, the capture, but also the processing in one part. Do I think about, well, where's it coming from? This could absolutely be kind of IoT. There could be IoT hub, so Azure IoT hub for example, it could be coming in from IoT hub, it could be coming in from blobs, people are writing these blobs, it could be from event hub and other things as well. It's getting them. It could be selecting, it could be aggregating over very very small time Windows potentially. And then, hey, I can send it. I can send it. Maybe I'm sending things to a data lake, maybe a data warehouse, maybe that real time processing. I'm actually going to generate alerts. Hey, this temperature sensor on this piece of equipment is risen to a certain level. I'm going to alert so we can do safety things, so we can stop things happening. And we can do that because of this very, very low latency, because it's near real time. So that's really that the key part around that. Now this all is great except the challenge is. Very often this data is coming from many, many different things, and it's typically not in the format we want it to be because maybe I don't have the ability to change the output format. What is generating these? But it's not in a good format for me to actually analyse to process. Maybe it's not normalized, it's not in a fixed structure, it needs to be cleaned up. There's bad data. I need to clean all that up before I put it somewhere to do analysis against. So I need some set of processes to make this good, because again, the data can come from many many places. If I think about all of the different things that could happen to this, remember what we talked about before? I have to ingest it from somewhere, so I have to think about an initial extract. Now again, that extraction could be coming from a device, from a form, from a database. Um. How do I get this in the right? It could come from batch, it could come from streaming. There's a huge amount of different sources that I could be having to extract from. But then again, I need to get it into the right format to do work against. So then we think about, well, I have to transform it. And there's different types of transformations, and we'll talk a bit more about that. It could just be cleaning it up. It could be standardizing formats of dates to a certain format. It might be taking out bad data. It might be combining fields. And then once I've transformed it then I have to load it. Into whatever system is actually going to go and do the work about it. So we have these. Extract, transform, load. It's a very popular concept. Hey, we get the data from somewhere, we get it into the format we need to do. Transforming structures, removing bad data, standardizing certain attributes, filtering things out. And this really comes from the idea that we transform it before we load and store it, because maybe there was limited amounts of storage. So when I transform it, get it in the right format and justice store the data we want that we know we want to run the analysis against. But the problem is we have extract transform, load is well, one that whole idea that storage is super expensive in the old days doesn't really apply anymore like a data lake is super cheap. But also, once we've transformed it and discarded maybe data we don't care about, I can't come back at a later time and say, well, actually I want to be able to see those other attributes. We can't. We dropped them. And so a very common variation on this is actually. Now we do an extract. We do a load. And then we do a transform. And then load it into something else. And when I think about this, load the point of what we do here. Guess what? We put it in a data lake. So we keep it in its original unmodified form, so we extract it. We store it straight away. So this is this adls Gen 2. Remember that sits on BLOB and it's super super cheap. So this is loading it in its raw. We're not trimming, we're not filtering, we're just taking it and storing it. Which is why comedy scene things like parquet files for example. We might just put these in here or it could be CSV, it doesn't matter. Then we can transform it and load it. But if in the future. Say you know what I need the data. Different data I need in a different format to do this new type of analysis. I can go back. I can load in new data from that raw and load it in different ways. So extract load, transform and then it is going to load it again. Actually gives me a lot more flexibility for things that I don't yet know that I need to do. So I because I can always go back, I can always go back and load in different data from my data, like to transform it in different ways and then store it. So that's why this is very, very powerful now that because we have these are concepts of very, very cheap long term attention. Do you know, let's not discard it. Let's store everything in that original format. Because I can always now come back to it and get new types. And again we have things like tearing on BLOB. Hey that hot, that cool, that archive. I could take advantage of lifecycle management so I can really optimize what I want to do now. There's even ways that. I might be able to directly access the files. In the data lake, I don't even have to have something do some new transformation. Talk about Polybase, which lets me essentially take files and expose them as tables in some database service. You have concepts like schema on read. So ordinarily what we do is in these transformations, we take the data, we transform it into the fixed format. That's got the schema, so it's a schema on right. So we go and we transform it and store it. So as we write the data, we've converted it to the end format. That's very easy for the analysis to work against, but if the analysis wants to go directly against the data lake, there's a concept of schema on read, it will convert it. On the fly. To this format to which you can run queries against. It's more expensive in terms of computation. Having to apply that, it's going to be slower. But I can do that if it's maybe a more ad hoc type interaction so that there is that concept of giving me. It's going to be slower because it's having to apply as it's reading the data. Now I mentioned these transformations and again realize there's different solutions. When I think about the transformation, there's very simple types mapping. Hey, this field goes over to here and services like Data Factory can do very very simple mapping. I could write an Azure function to do very simple mapping, but then there are more complex analytics. Type transformations, and that's where you'll hear about things like Hdinsight. And things like data bricks. These can do very complex analysis interactions with the data, and we'll come back to that in a second. But let's take a step back though. I drew this idea of extract transform load or extract load. Transform load again. It is happening by magic, like, what is doing this? How do I get this whole thing happening? Surely something has to drive this whole process, and absolutely yes it does. What we need is for all of this. All of these interactions is orchestration. When I think about that data lineage, if it's coming from somewhere, it goes through some process, it gets stored somewhere. Something has to drive the things that are happening to the data. So I have to actually think about there's a control flow. Actually calling the things that are going to operate on the data back. Control flow itself is not changing the data, it's calling other things. Then there's a data flow itself that will interact and modify. So from an orchestration perspective, we have Azure data factory. So the point of Azure data factory is source. Where's it coming from? To think where does it end up? That's the goal of Azure Data Factory and its focus is all about. That control plane, the control flow. It is not typically modifying the data. Now it can do some basic things if you want it to, like that basic mapping capabilities, but its primary goal is focused on calling other things that will do activities that may or may not change the data. I think about the idea of within Data Factory I create a pipeline. So I create this pipeline. That's going to drive. Calling various things. So it's going to call, hey activity one, maybe activity one is write it to a data lake, then it's going to call another activity. OK, well now it's been stored in its raw format. Maybe now I'm going to call data bricks to do some analysis on that data, then it does some other things and then eventually, hey, I'm going to go and store it to different place, I'm going to go write it. So Azure Data Factory is the orchestration. That gives me that complete flow of the data. It's going to talk to the systems to be able to do the extraction. It's going to call things to do transformations, and it's going to go and put it in different places. And when I think about the interactions, I can have different triggers. So what's gonna trigger this? It could be a schedule. Hey there's a batch I want to go and grab every hour. Grab this data. It could be events. Hey, a BLOB is written to as storage count ground. Grab it, it could be manual. There's many different ways that I can trigger these, and the point is I can have many different sources. Source one, source two. Source three could be streams coming in, it could be data from a database, it could be logs. All of those things. It doesn't matter, but I'm going to have this complete flow going through with this orchestration. It's doing all of that work for me. O again, focus on data factory as the control flow. It's the control plane that makes sure activities get called in the order it says logic. Hey, I need loops for example. It's going to do that within here. It could handle hyper environments. This can be coming from Azure, it could be coming from on Prem, it could be other clouds. It has a nice little UI, but I can construct these pipelines. There's an integration runtime that I could run on a system, so if it doesn't have direct access, I can do an integration runtime that can give it access to data just where it is. There's different ways it can interact. But remember, the key point is, is typically not doing anything to the data. It's going to call things that will do things to the data. That's the whole point. So if I think about this, remember this is the workflow. This could be the data engineer. This is not a database admin. It's not a data analyst. This is the engineer. The engineer is responsible for understanding all the data and understanding, well, how do I get it? What needs to happen to it to get it in a state. So then the analyst could now go and actually do. Useful things with it. Now, when I think about some of these common activities and again data factory can do basic mapping things because I think about richer types of interactions to transform and get insight from our data. Typically to perform these activities, there's really two key services, a big one. Is Azure data bricks. Now this is a. This is Apache Spark based so data bricks is an Apache spark based large scale data analytics solution. So it's designed to give analytics information about my data, so Azure data breaks is just an automatically deployed and managed. Databricks instance. It's using virtual machines, it's using blobs behind the scenes. It can autoscale. So this is all about Apache Spark. Based. It's fully managed. It can give me things like autoscale. But if I'm. Those data engineers and I want to use data bricks in Azure. Hey, I'm going to use Azure data bricks. I have the idea of notebooks that I can leverage, which is going to give me the ability to both enable analysis of the data, but also collaborate between different data engineers, between data scientists, between the business users. So this is a very specific service. There's also. Azure. HD insight. Now this is all about. Azure hosted clusters of Apache Open source data processing solutions. There's a whole bunch of these. They're all Apache something. And so we have Apache Spark. Remember, Databricks was a particular. Um, implementation built on top of Apache Spark. So it's this distributed data processing system. Supports lots of different languages and APIs, including things like Java and Scala and Python And SQL. But it's all about doing that data processing. It's distributed, it splits it over so I can accelerate how that's actually processed. Then there's things like Hadoop. So top again is a distributed system. But this is designed to process very very large volumes of data across multiple clusters. And you can create these MapReduce jobs that I can write in different languages but it can basically now two phase the data so it can be processed and split more efficiently. There's that Apache H base. So this is just an open source system for large scale Nosql data storage. For querying, there's Kafka. There's storm. So kafka is a message broker for data stream processing. Storm is just an open source systems for real time data processing through these various topologies. And typically the whole point. And I've kind of tried to show this idea here, but a super super simple example I can think about. Hey, I've got data. And again, it might be different sources. I've got data source one and data source 2. And I have to firstly ingest the data. So my first step would be OK well we need to get the data in. Well, that would be data factory. Azure Data Factory would handle that. Now remember, as a data factory is now involved in all of the steps. But now Azure Data Factory would be thinking about, OK, I'm going to load. I'm going to put in a data link. So I have my adls. Gen 2 for example, so I'm loading it. I'm keeping in its raw format. But then I'm also now going to do a transform. So I want to get it in a fixed point. Maybe I wanna actually get some analysis against it. So maybe I'm using data bricks. And again, data factory is that control flow. Remember it's doing these various things. And then maybe, OK, from there I'm going to do a load into synapse. I want to data warehouse. Synapse, remember. And one of the interesting things is Synapse actually has. All of these things kind of built in. So although we talk about Azure Data Factory doing this, synapse can use data factory to light up and enable these types of functionalities for you. And then once it's here or maybe I've got power BI. Which is going against them doing that analysis against it. But maybe again, this load will sometimes and the various interactions I might run queries on here that actually go directly against. The data lake using polybase. Which is a way to surface the data from the files directly as these virtual tables within the solution I can go and interact with it without having to do other things. That might just be one example of a flow. That I may have. So there's lots of these components come together. Now there is another solution that didn't really talk about and I guess I should super quickly. If you've used Azure, you've probably seen log analytics workspace. It's store for all my telemetry my logs and I can use the kusto query language KQL to run queries against it. Well that service is actually sits on top of something called Azure Data Explorer. So this is all about log and telemetry. Ingestion and then using kusto query language to say table name and then what I'm looking for. I can write very very complex queries to get analysis against the data. So if this was like logs. Or maybe I would actually use it directly into Azure Data Explorer. It would just logs or telemetry. I might want to use Azure Data Explorer. She's got this huge scale capability. But then I can run these great queries directly against it. So depending on exactly what I want to do. Azure data Explorer may actually be a very good solution for that. Now I guess one final thing, I talked about Power BI and I talked about the whole point of Power BI is it's great to actually do that analysis against. Realized there are many many types of analysis. I might think about descriptive. What happened? I might think about diagnostic. Why did it happen? I might have think about predictive. Based on history. What will happen? Prescriptive. What should I do? And cognitive. This is all about giving me conclusions. So cognitive hate based on existing knowledge you have. Give me conclusions based on what this all means. And the whole point about Power BI is it can visualize all of this. I can bring different pieces of data together. I can create models to then make this more digestible to the end business and actually get true insights from the data. So that's why I wanted to cover. So the key point really for the the DP 900 is are you going to be an expert in everyone of these things? Absolutely not. Nor do you need to be. The point is understanding, hey look, there are these different services available. Understand examples and where I might want to use them. Hey I just need to store a whole bunch of unstructured data. Well a data leak would be great for that and that's on top of BLOB. I need a system where I want multi write capability with flexible consistency for my born in the cloud application that sounds like Cosmos DB. I want a managed SQL deployment in my virtual network that needs full compatibility or near full from my on premises SQL Server or SQL managed instance. Hey, I'm used to you. I'm using the lamp stack. Um, what service would help me do that in a managed way? Will lamp stack remember part of that is MySQL. So there's a managed MySQL offering there. So understand what the offerings are, understand, hey, the different maybe options that are available, the capabilities, I would stress do not panic. At the end of the day, again, if they saying you don't know. Then eliminate the obvious wrong answers. There's always answers that say cheese, and it's like it's not cheese. Get it down to a set you think it could be. And pick the one that seems the most logical to you. Again, things are not named in Azure. They're not structured to make it hard for people to use. They want people to be very intuitive. If you don't pass it the first time, you'll get a score report and it will show you where you was weakest. Just go and focused on those weaker points and you'll get the next time. So a lot of work goes into creating these are like and subscribe definitely is appreciated, but I wish you good luck. I'm sure you're going to crush it and I'll see you in another video.
Info
Channel: John Savill's Technical Training
Views: 146,422
Rating: undefined out of 5
Keywords: azure, azure cloud, microsoft azure, microsoft, cloud, dp-900
Id: 0gtpasITVnk
Channel Id: undefined
Length: 148min 1sec (8881 seconds)
Published: Tue Aug 30 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.