Inside Azure Datacenter Architecture with Mark Russinovich

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[MUSIC] >>Hi, welcome to Inside Azure Datacenter Architecture. My name is Mark Russinovich. I'm Chief Technology Officer and Technical Fellow at Microsoft. This is a talk that I've been giving now for several Ignites. Each time it's a brand new talk. This talk is no different, lots of new demos. In fact, I've got a dozen demos to show you that all highlight some of the latest innovations across all of Azure. Whether you're new to Azure or you've been following Azure for some time, including watching these presentations, there's something here for you. In fact, here's the agenda that I've got for you. I've got seven sections. I'm going to start by taking a look at our datacenter infrastructure, including the way that we design them to go out across the world. I go in then into our Intelligent Infrastructure. Intelligence means machine-learning, how we're applying that for AIOps across lots of different services inside of the infrastructure. I'll take a look at networking, including physical networking architecture as well as the logical services that run on top of it. I'll talk about our servers, highlighting some of our largest servers, highlighting also some of our smallest and coldest servers. Then I'll talk about Azure Resource Manager, which is a universal control plane to Azure. One of the cool innovations we've got there is making it much easier for you to author Azure Resource Manager templates using a new language. I'll go inside of Azure compute. I'll show you how we're leveraging confidential computing to make it easier for you to protect your data while it's in use. Then finally, I'll go into Azure storage and data, showing you some of the innovations and login analytics, and the way that we can store data extremely efficiently with extreme high density for millennia without having to touch it. It's an exciting show, and let's get started by looking at our datacenters. Our datacenters, actually are part of a spectrum of Azure, which we call the world's computer , that starts there on the right with our Azure Hyperscale Regions that you'll probably think of when you think of Azure, that spread the spectrum all the way down to the left, down to Azure Sphere microcontroller units, MCUs, that are just four megabytes in size, where you still have connectivity and services consistent across this entire platform which is our goal, connection, and be able to deploy and operate consistently throughout that entire spectrum, including those in the middle where it's our hardware like Azure Edge Zones as well as hardware that's on your premises like Azure Private edge zones and Azure Stack Hub and Azure Stack HCI, which is completely on hardware that you bring Azure Stack HCI certified hardware. But this hyperscale public Cloud regions, that's the big focus, that's the center of the Cloud. We've continued to build that out over time. We now have over 65 Azure regions. We've taken Azure to so many countries and in response to demand from countries and businesses that want data close to enterprises there as well as data on their sovereign ground protected by their laws. Just last year alone, we've introduced nine new regions. Including Indonesia, our latest announcement in February of this year. The state of Georgia, New region there, Chile, bringing in a region to South America, Denmark, Greece, Sweden, Taiwan, Australia, and then Arizona in United States in September of last year. Within those regions, we've got an architecture that is built on top of availability zones. We've been working on availability zones for many years. You've heard us introduce them many years ago and continue to build out availability zone capabilities. The concept behind availability zones is making it possible for you to have high availability within a region. High availability meaning that you can store data durably and have compute that can survive localized problems inside of a physical datacenter, like a flood in the datacenter, a power outage in the datacenter an HVAC, a cooling problem in the datacenter, but still be able to serve, computing data out of those remaining two availability zones because we promised a minimum of three with every region. With availability zones, like I mentioned, we've been on a long journey, but this is a pivotal year for availability zones because we're excited to announce that we're going to have an A-Z in every country. A minimum of three availability zones in every single country that Azure's in by the end of this calendar year. Going forward, every single new region that we announce will have availability zones in it. Not only that, but when it comes to you writing applications that take advantage of the resiliency that availability zones offer, we've now promised to have all foundational and mainstream Azure services support AZs by the end of the calendar year as well. Regardless of what service you're using, if it's an a region with availability zones, and availability zone has a problem, that service will continue to operate. Hyperscale public Cloud regions is one example of our datacenter designs, but we're taking datacenters and Azure regions elsewhere too. If you take a look at our modular datacenter designs, you can see here that we've got a form factor that is portable. In fact, that box right there, it's designed to be pulled on the back of a semi truck so it can go almost anywhere, which makes it great for humanitarian disaster relief or if you want to forward operation center or taking it to an edge site for temporary base of operations. This is designed to be fully remote, so it works in fully disconnected modes as well as partially disconnected. It's got a SATCOM module built into it so that it can communicate via satellite if it doesn't have a wired connection. It's got a high availability module built into it also for UPS. The servers are on shock absorber so it can tolerate the transport as well, and it's got HVAC built right into it. It's fully self-contained Azure inside of a tractor-trailer basically. But all of these regions that we've talked about, add up to tens, hundreds of megawatts, even gigawatts of power consumption. One of the top concerns for us, for our customers, for governments is taking care of the environment. Azure has a big focus on sustainability efforts that are part of Microsoft's overall commitment to the environment. There's a number of different ways that Azure is contributing to our sustainability efforts. One example is in the procurement of renewable energy and powering our datacenters with renewable energy wherever possible, Microsoft has made some of the largest renewable energy deals on the planet. Including one that we recently made here in the state of Virginia for solar power, which adds up for a total of 300 megawatts. This project is well underway with 75 megawatts coming online in last October, another 225 megawatts of solar power coming online by the summer. We also announced that in Denmark region, it's going to be 100 percent renewable energy. You're going to see that pretty much trend going forward with our new region announcements, but it doesn't just stop at renewable energy efforts, it's also how efficient our datacenters are at consuming energy. There's an industry standard term called PUE, which measures datacenter efficiency. Basically, the higher the number, the more inefficient it is. A typical IT Datacenter has a PUE of between 1.8 and 2. Microsoft's datacenters, just like traditional IT datacenters was around that number in the late 2000s. We've continued to invest in new datacenter designs that it more efficient, minimizing the amount of components, streamlining the power from the connection to the datacenter all the way to the server so that we get as close to one as possible, meaning perfect efficiency. Our latest data is datacenter designs called Ballard, have a datacenter efficiency or a PUE of 1.18, the industry leading. It also, our energy, our sustainability efforts go into; how do we not just be friendly to the environment going forward, but how do we help improve the environment? As part of that, we issued a request for proposals in July of 2020 to procure one million metric tons of carbon removal in 2021. In response to that RFP, we received 189 projects in 40 countries, and we've already bought 1.3 million tons of carbon removal. One of the projects that we have underway is one with the Swiss company called Climeworks, that Microsoft is invested in. Together, we're going to permanently remove 1,400 metric tons of carbon this year. That carbon is going to be put back for useful purposes in many cases, including going back into synthetic fuel production, used in greenhouse agriculture, going into carbonated beverages or even permanently stored underground in volcanic rock using a mineralization process. Let's turn our attention now to taking a look at the way that we operate our infrastructure. We call it the intelligent infrastructure, just like we've got intelligent Cloud and intelligent Edge, we actually apply machine learning and intelligence to everything that we're doing. This is part of our broader efforts on raising the reliability of Azure, and continuously improving it. If you're interested in some of the things I'm going to be talking about here, you might be interested in the blog post series that I started about a couple of years ago, the Microsoft Azure Reliability blog series, you can see there's the first post from that series, which covers some of these topics. Now, AIOps, the places that we apply it, meaning using machine learning to look for signals, anomalies, autocommunication. If there's an issue, we'll first detect it using AIOps and then we'll automatically communicate it to just the impacted customers so that they're aware that something's happening using AIOps, That's a great example of where we simply cannot provide the time to notify that customers expect if we're waiting on humans to make an assessment and then go figure out who they should communicate it to. Instead, we rely on automated systems to go and try to perform that as much as possible. We also will do root cause analysis. Once we do have an issue, we can root cause it by having AIOps look at all the signals and point us in the right direction, that can help us resolve an issue more quickly or even if it's not impacting customers to figure out how we can prevent those issues from surfacing again and potentially impacting them. At the heart of our AIOps is a system we call brain, you might expect that. Brain puts all these signals together, it puts them all together using machine learning algorithms to figure out where the highest fidelity signals are, and looking at correlations of signals together so that it can perform all these operations that we've been talking about on top of one platform. Now, we also apply AIOps in another place, one of the other places that we apply it to is for efficiently operating our infrastructure and providing a better customer experience. The system that we built here is called Resource Central, and it's actually one that we've publicly talked about. We've got published academic papers that go into how Resource Central actually works. At the heart of Resource Central is machine learning training. Machine learning training that is offline by taking all of the signals from our production services that are related to power management for spot price, VM eviction, for more efficiently packing virtual machines together. Coming up with the train models and then pushing those train models out to this Resource Central service, which then acts as part of our allocation system and our infrastructure control system. This is a continuous feedback loop, as the system continues to evolve, as customer behavior patterns change, as we introduce new services and capabilities, Resource Central just get smarter and gets applied more broadly. But another great place that we're applying Resource Central is in minimizing customer impact towards failures that we think are imminent. Failures that are imminent would include ones where we're getting signals from the hardware in our servers, that it's starting to produce errors. Those errors might be correctable at this point, or there might be a failure of a component, that component is not impacting any virtual machines but signals that, that server is going to be moving likely in a degraded state. We're using Resource Central to take those signals, create these models of how a server is going to perform. How likely is it going to fail? What's the timeline before it fails? Then also figuring out the best ways to respond to that. In some cases, it just means resetting the server, it could be a software-induced failure and we might need to do what's called a Kernel Soft Reboot, which means just resetting the hypervisor environment but leaving the virtual machines preserved in place through that process. Resource Central and through Project Narya has been operating now for about a year in production. As a result of it, applying this intelligence to prediction and the best way to mitigate the impact is we've seen a 26 percent decrease in VM interruptions, meaning some software or hardware problem on the server is causing an impact to virtual machines, so already a huge impact and we're just getting started with this capability. We're also applying intelligence to the way that we operate our datacenters themselves. If you take a look at a datacenter architecture, besides the physical infrastructure that I've been talking about, there's humans involved in the process. There's maintenance activities that have to take place, maintenance activities that if done improperly could impact production services. For example, if you're taking out a power feed that's redundant for maintenance but the one that is redundant for is offline because of a problem, you've just caused an outage. We're applying intelligence to datacenter operations, again looking at all the signals together but also keeping track of these performance and availability of the datacenter so that we can make sure that maintenance operations don't impact customers, and that other operations, like increasing the datacenter infrastructure footprint also don't impact customers because they're done in opportune times. To do this looking holistically at the datacenter, understanding datacenter impacts, where's this breaker? What is going to be impacting? If we may perform maintenance on this, what other correlated components do we want to make sure are up and running when we do that maintenance? We're modeling our datacenters using Azure Digital Twins. Azure Digital Twins is one of our newest services in the IoT category, where digital twins allow you to create a full model of a physical environment as well as a virtual environment. Modeling things like the devices themselves and then modeling logical abstractions on top of them as well in a graph form, where the graph is really reactive. What that means is that you can pump data into those digital twins from the live environment, you can perform simulations. What if I did this what would it impact? Because the graph is representing those connections between those different resources. You can also invoke external business process using Azure Digital Twins like calling out to Logic Apps workflow and Azure Function. Some digital twin, modeling some particular component moves into a particular state, go trigger this workflow and that workflow could be getting datacenter operations involved to go take a look at a problem or kicking off an automated workflow that's going to go mitigate or prevent some problem from happening. Let's take a closer look at that digital twins environment that we set up for our datacenters. Here, I'm going to open our datacenter digital twin environment viewer focused on anomaly views. You can see a timeline at the top that lets us pick a time range during which we want to look for anomalies that might represent real problems. I'm going to select about an hour window, I'm going to slide it over part of the timeline that I know that there were some anomalies. Specifically two breakers here that show as potentially having an outage, I can also see that there's a number of other devices that also have potential anomalies. I can use this data selector to select any of them and I'm going to pick those two outage breakers and then look at their digital twins in the graph viewer, I'm going to select one of them, and what that does is automatically connects using Azure Digital Twins to Time Series Insights, our data stream of the power coming off of that breaker. You can see sure enough there was a dip in the power over a period of that time in that one hour window. I can see here at the properties of that digital twin, I can see for example the state that there was a power drop at that point, so the state of that digital twin switched to drop. I can also see that it was connected to this other one. In fact, if I select the other breaker, I see the connection in the digital twin graph between the two, they're pointing at each other which tells me that they were dundant. But when I select the redundant one, I see that it also has an outage during that time window. That means both of those breakers lost power for some period of time, and that could have had downstream impact that I want to explore. I can look beyond just the immediate graph connections to a four-level steep to see what else was impacted. You can see a number of other devices and maybe logical components that were impacted by that particular outage, and that lets me go explore, understand the impact. If we're looking at this historically in a postmortem, we can understand how this datacenter behaved and figure out ways to prevent the same kind of problems from potentially impacting customers in the future. Let's now talk a little bit about Azure Networking. Azure Networking consists of a bunch of different capabilities, a physical infrastructure, and services on top of that physical infrastructure. What I'm going to do is take a look from the servers and the racks, where you can see that we've got, for example, 50-gigabit SmartNIC, meaning FPGA accelerated network adapters attached to our servers. Going up to 200-gigabit Software Defined Appliances, our Top of Rack routers, Accelerated Networking services that leverage that FPGA as well as Container Services called SWIFT that I've talked about in previous presentations. Going into our DC scoped hardware, which includes roughly one millions of fiber cable that we deploy within one of our hyperscale Cloud regions. Services like Azure Firewall, Azure DDos protection, that gives your websites protection for DDos attacks, our Load Balancing services. Into our inter-region network, where we've got something called regional network gateways, Just like for availability zones where we have redundant, fully isolated datacenters on which to run compute capacity, we also make the network completely redundant, including the physical infrastructure. There's two RNGs at least in every region so that if one of them fails, we still have connectivity between those availability zones as well as out to the WAN. This RNG architecture which goes in T-shirt sizes from 28 megawatts of capacity in the region up to 528 megawatts really gives us a ton of flexibility with minimal network hardware connecting those zones together and out to the WAN. We have to meet our latency boundaries. That means that all of this infrastructure to meet that two-millisecond inter-region latency envelope, has to be within a 100 kilometers of each other or so. That's one of the many dozens of factors that determine where we place availability zones and the RNGs inside an area where we're creating a hyperscale Cloud region. Then that takes us out to the WAN. Microsoft's WAN's one of the largest in the world. We've got a 130,000 kilometers of fiber and subsea cable. We've got over 300 terabytes that we've added to the WAN just in 2020. We saw there's huge surge of work from home, learn from home. As countries went to lockdown, which caused us to expand our network capacity to support all that shift of activity out from traditional enterprises, out to the edges, and the connectivity and increasing demand for Cloud services as well. We also tripled our transatlantic cable capacity. I talked about that in my presentation last summer on how Microsoft reacted to the COVID demands. Then finally, last-mile connectivity. We've got over a 180 edge sites and continue to grow. What that means is that when there's an edge site and your network traffic is aimed at Microsoft or an Azure service, it enters our backbone right at that edge site enters our dark fiber WAN network, so you get the highest performance and consistency of performance into our network as possible. The closer you can enter, the more that you're in our network and under our control where we can provide the best quality of service. We've had a 100 percent growth in peering capacity again, as part of that expansion that we had in response to the COVID demand spikes. Some of the services that we've got there include Express Route 100 gigabit, which allows you to connect your own enterprise networks into Azure immediately from your edge of your enterprise network. Basically entering our dark fiber backbone at a 100 gigabits per second of network capacity. We also have our CDN services that you can leverage across all of those peering and edge sites. Now one of the ways that the network, not just Microsoft's network, but the world's network is programmed is with something called Border Gateway Protocol. As part of our focus on making our networks more resilient, not just at the physical infrastructure level, we've also been focusing on making it more resilient at the logical level as well. Border Gateway Protocol is the protocol that is used to advertise the routes from one IP address to another one. How do I get this packet over to the server, which could be sitting inside of an Azure Datacenter? In most cases, the BGP routers along that path, advertise the correct path to the destination. As the packet arrives in each point, it gets routed in the correct direction and eventually makes it to that destination server. But it's possible with the way that BGP has been architected and used up to very recently for bad actors to miss route packets by advertising false routes, or for an operator to make a mistake, and leak a route. Leaking a route is what happens when they advertise a broad set of IP addresses that should be routed in a certain direction that includes routes that shouldn't be part of that, and that will route legitimate traffic from its destination to get blackholed by accident. We've been working with a bunch of different companies to focus on improving BGP, and this problem isn't just theoretical, it's actually a real problem. For example, a couple of years ago you might have heard of a Cryptocurrency Heist where myetherwallet.com's traffic was redirected because somebody deliberately misadvertised routes associated with DNS Route 53, addresses related to myetherwallet.com, causing traffic that's going to myetherwallet.com to end up in servers in Russia. Where those Russian attackers, if the user clicked through the warnings, they got in their browser and authenticated, the attacker got their login credentials to the EtherWallets, able to steal their ethereum, and then basically abscond with it. Estimates are that they made anywhere between 50,000 and a few $100,000 off of this heist. As part of our working with the industry consortium back starting in 2019 when we joined the Mom's project, which is focused on reliably broadcasting and advertising these BGP routes using PKI or signature signing, where Microsoft for its routes would sign them with Microsoft signatures and other BGP routers would look for Microsoft signatures on any Microsoft to advertise routes has actually gone into effect. We started signing all over routes earlier in January of 2020, and you can see that going back to January 2020, that the hijacks that we saw in our network went to zero where we've got over a 149,000 routes now signed for Azure services, well all of Microsoft Services making us have the most signed routes of any organization in the world. But Azure Datacenter infrastructure, I've been talking a lot about the WAN and physical infrastructure on the ground, but Azure has actually gone to space with Azure Orbital. Azure Orbital is really effectively ground station as a service where those ground stations are connected to satellites. Azure Orbital's all about how do you connect satellites into Azure Datacenters. There's two kinds of capabilities that Azure Ground Station as a service or Azure Orbital provides. One of them is Earth observation. With Earth observation, the idea is that you've got satellites that are taking images of the Earth, of the atmosphere, of the ground, of water, of pollution, and you want to perform analytics and Machine Learning on that type of data so that we can understand how climate is changing or how the environment's changing. What we can do to improve and prevent catastrophic problems from happening. The best place therefore is to get that data directly into Azure Data Lake Storage Gen2, and then get that process with Azure Synapse Analytics that I'll talk about a little bit later. Azure Orbital is focused on that aspect, but it's also focused on communications, specifically IoT communications, where you've got devices that are in remote locations, maybe even the modular datacenter I talked about earlier, where you don't have a ground-based connection into Azure Datacenters and you want to leverage satellite connectivity. This one is where we work with a host of satellite partners that then register their satellites with Azure and allow customers then to leverage those satellites for communication from an edge device up through satellites and down into Azure Datacenters to talk to Azure services remotely. Let's go take a look at that in action. I've got a really cool demo here to show you. Where I'm going to actually show you a network setup here that's connected to Azure Orbital that starts with a Virtual Network. Pulling up the Azure Portal here you can see I've got a Virtual Network with a bunch of network interfaces on it. If I go, I can see that one of those network interfaces has a public IP address, 52.150.50, you can see we're going to come back to that later. You can also see that I've got a storage account as part of that resource group that's connected to that Virtual Network that blocks access from anything except for that subnet. That means that when I go try to click on the container because I'm not accessing that subnet from the portal, I get access denied. If I go try to access that container in Azure Storage from a browser, I get access denied because the IP address the browser is coming from isn't in that subnet, in that Virtual Network. However, I've got right here, a satellite up-link here that I'm going to connect my phone to, and that's going to let me join that Virtual Network here, from my phone. When I take my phone out of "Airplane Mod" you can see that it's connecting now to the Orbital Network that I've got configured here. When I go to "Settings" you can see that, sure enough, my network selection is the Orbital Network. That Orbital Network is connected through that device, and because it's configured to be part of that subnet, you can see that IP address that I show up as publicly is in fact that same IP address we saw attached to that network interface, that 52 dot IP address. That means that because it's part of that Virtual Network coming from that IP address, that network interface, I can, sure enough, access that same Azure Storage account from space through a satellite here from my phone. Speaking of virtual networks and subnets. One of the big challenges that we've heard from our customers is they get bigger and bigger deployments in Azure is a sprawl of virtual networks. Where those virtual networks aren't designed to operate in isolation. They have applications in one virtual network that need to talk to services in another one or virtual machines and a third one and up to this point, they've had to set up complex peering relationships between all of those virtual networks. You can imagine if you've got a dozen virtual networks, how many peering relationships you have to create to allow them all to connect together and you've got to create those connections every time a new virtual network joins and you want to add it. Now I have to setup that many number of new peering relationships. To make management at large scale of virtual networks much easier, we're creating centralized network management. In last Ignite, I showed you centralize network management, showing you how to set up connectivity relationships between networks very easily where you can tag your virtual networks with a tag and with that tag, automatically they get joined in hub and spoke or peer to peer relationships through Azure, centralized network management. So you don't have to go manually set up in a peering relationships. It's simply tagging something provides you the connectivity of that virtual network that you're looking for. Now I'm going to talk about a new capability that we're introducing, which is related to security management. Where you want to provide security policies that apply to a collection of virtual networks and centralized network management let's you do that too. Let's go take a look at that. So here I'm going to pull up the Azure portal again and I've got two virtual machines here in remote desktop connection manager and users internals tool. You can see that one of those virtual machine is able to access bing.com just fine and that's because it's got no network security group rules attached to the virtual network it's part of. When I go into the Central Network Manager, you can see that I've got a network group here called Spoke, and that's the one that I created last Ignite that allows my virtual networks to connect together in hub and spoke. If I go to the connectivity configuration, that's where I've got that policy set up. But when I go to the security, that's where you can see I've got a security configuration that I can go and create a rule for that blocks outbound traffic to the web. So I'm going to call this block web. Say give it a priority, say deny, say outbound, any protocol. Destination port 80 and 443 which would be web addresses and I'm going to select my Spoke network group and save that configuration. Now that I've set up that configuration with that rule in it, I need to go deploy it, so I go to deployment and you can see that I'm going to deploy that configuration here. Security, I'm going to deploy that particular configuration. I target all of the regions that I've got virtual networks in and then apply and within a few seconds that's applied. Now if I go back to that same virtual machine in one of those regions and try to access Microsoft.com, you can see that it's unable to access it. If I go to another virtual machine in a different region that it's part of that same network security group, you can see that I'm also not able to access bing or Microsoft.com because neither of those were cached on the local machine. So sure enough, I'm able to now perform at scale management of network connectivity as well as network security across dozens or hundreds of virtual network. Something that's been very onerous up into this point. Now let's go take a look at inside of our servers and our server infrastructure. One of the ways that I find really fun to look at the evolution of Cloud servers over time, is to look at our high memory skew evolution. Back in 2014, we introduced something that we internally called the Godzilla skew. We called it the Godzilla skew because it was the largest Virtual Machine in the public Cloud at the time. It had 512 gigabytes of RAM in it. It had 32 cores on it. It had a, you can see 9, 800 gigabyte SSDs on it. So just a monster machine for back in it's day in 2014. But we've evolved so quickly in the evolution of hardware in the Cloud, driven higher and higher by in-memory databases like SAP 4 HANA, by our customers migrating their SAP workloads, asking for larger workloads. As far as the general evolution of hardware, you can see that our general purpose servers now are rivaling or beating what Godzilla was just six years ago. So these are the servers that we're deploying at scale, our DS series virtual machines on top of it and our F series virtual machines. You can see we've got Intel and AMD lines of servers. They've got more RAM now than Godzilla had six years ago and they've got more cores on them, the same or more cores on them as well. But when it comes to those SAP 4 HANA workloads, this isn't enough these days. We've been pushed ever higher and you can see our Beast server that we introduced in 2017 that had four terabytes of RAM. Might think four terabytes is enough RAM for SAP 4 HANA, well, we had customers bringing even larger SAP deployments into Azure, wanting larger servers and so we introduced BeastV2, and you can see BeastV2 we introduced in 2019. This one has 12 terabytes of ram. You might think that's enough for SAP 4 HANA workloads, but no, we've been asked to get even larger server sizes. Last night I talked about mega Godzilla beast. Mega Godzilla beast here has 24 terabytes of RAM and it has 448 cores on it. I expect in six years we'll be thinking this thing is relatively small, but right now it seems really big and here's a picture inside of it. This has a 192, 128 gigabyte dims in it, so it's basically packed with dims. You can see they're dim slots. But I thought it'd be fun to show you a little demo of what this thing can do. Of course, it can run Notepad really, really fast. But I decided to have some fun with it and I think you might appreciate this. Here I'm logged in to omega Godzilla B server. You can see looking at Task Manager that it's got 420 cores of those 448 available to the virtual machine and it's got 22 terabytes of that 24 available to it. If we go back to the test manager, CPV, you can see there's enough pixels there that I thought I could have some fun with it. There were some videos going around last summer of somebody doing things with task manager, animations and games and I thought I'd do it for real. So I wrote a little program here that takes in a bitmap and then by pinning CPU activity to particular cores in response to the bitmap, I can actually show bitmaps right on Task Manager and here you can see a scrolling Azure logo. This is actually available in my GitHub if you'd like to go take a look at this program and that's the first thing I did. But I got a little bit obsessed as I was playing with this over the holidays and decided to get some games working on it. One of my favorite games in college was Tetris. Here you can see I've taken a console Tetris game and I'm here playing it right on mega Godzilla Beast in Task Manager by manipulating the CPU. Land another block here and then I've posted videos of this on Twitter, but I thought I'd do something new for this Ignite presentation. So I took another one of my favorite games, Breakout, a console version of that and also integrated it and here I'm playing Breakout now on Task Manager on the mega Godzilla Beast. Say hit the ball and hopefully this takes out a few bricks up there and sure enough, dead. So some amazing things you can do with a machine that costs millions of dollars. Some of the other ways that we're pushing hardware in our datacenters is with AI HPC infrastructure and one of our great partners is NVIDIA. NVIDIA has got a new GPU that they're coming out with called the A100 ampere GPU and this is the most powerful machine learning GPU ever. You can see that we've got a variety of different ways you can leverage the A100 in Azure's datacenter on top of our new NDv4 type virtual machines. You can either use them as a single GPU or you can use eight GPUs at a time connected with NVIDIAs NV link protocol on a single-server. But something very unique to Azure is that we have a version of the NDv4, the full server version that leverages 8, 200 gigabit HDRInfiniBand connections for a total of 1.6 terabits of back-end network connectivity between our A100 servers to allow you to connect many hundreds or thousands of them together and run large-scale distributed machine learning training or HPC jobs on top of it. Let's go take a look at what that InfiniBand back-end network gives you, that the front end network, if you're just leveraging that doesn't. You can see here I'm connected to two Azure virtual machines. The left one is running an A100 GPU and so is the right one, and they're both connected to four server clusters with A100 GPU is. They're both running the same CPUs, AMD Rome systems with 96 cores, with AMD EPYC processors. If I run the NVIDIA SMI utility on them, you'll see that they've got the eight A100 GPUs attached to them with 40 gigabytes of HDR2, or HBM2 memory connected to them. The difference though is that on the right one, like I mentioned, this is an InfiniBand back-end network that we've connected to it or that we're leveraging. It's got the eight 200 gigabit InfiniBand HDR2 adapters on top of it that connect those four virtual machines together. Now, if I look at the machine learning training run, I'm going to execute on these clusters. You can see that the only difference between the two is that I've disabled InfiniBand on the one on the left because we're not leveraging the InfiniBand network connections, we're just leveraging the front-end network to connect these jobs together. We can see that I've got a batch size of eight, I've got to warm up of eight, and I've got 128 iterations or samples that I'm going to be running through this training job. This is actually the GPT2 model from open AI that I'm actually going to be training here. Let's go time that on the two systems, the key difference again is InfiniBand connecting for distributed training efficiency. We can see already as this thing gets underway here and iteration start completing on the right that it's taken me a little over 300 milliseconds per batch. On the right and on the left, here you can see 2,000 milliseconds, about two seconds on the left. That's the right is about seven times faster because it's leveraging that 1.6 terabits of bidirectional connectivity between those servers that the left doesn't have access to. As it's exchanging weights across them, it runs into that inefficiency. The A100 really represent trend that had been seeing and so does the high memory skews and the number of cores that we see on the general-purpose skews, trends in the datacenter of more and more power consumption per server. How do we cool the increasing demand for concentrated compute power? How do we pack and leveraged the floor space on our datacenters more efficiently. Because if we're air cooling them, which is the way that we're cooling these datacenters today, we've got to leave hot air aisles and cool air aisles and have huge H facts systems that are pumping air into and out of the datacenter if it's not adiabatically cooled. Large overheads and lots of floor space that is just wasted for moving air in and out of those servers. We've been investigating now how can we achieve better cooling efficiency and liquid cooling is what we've been focusing on. There's a type of liquid cooling called cold plate cooling that you might be using. I'm using it in my home system when I play my video games. I'm cold plate cooling the GPUs and CPUs on my system so I can run them at very high clock rates. That is a great way to cool, but it's got the downside that every single server has to be custom fitted for the pipes in the cold plates on top of that internal server infrastructure. Meaning that it's not a one size fits all model. It comes with all of that overhead of getting all those cables into and out of the servers. We've been exploring other ways to cool the servers beyond cold plate. One of the ways is with single-phase immersion where we take a liquid and we just set the server main board right inside the liquid. Liquid is extremely efficient, especially newer types of liquids at cooling those high-performance servers. But what we've locked on as likely where we're heading in our datacenters is the most promising and that is two-phase immersion. In fact, we've made a ton of progress down this path. I want to give you an insight of where we're heading by showing you some of the prototype work that we've got underway. [MUSIC] Start test, wrap up the CPU utilization 100 percent on all the course and we get this nice bubbly effect. As the fluid boils, it evaporates. As it evaporates, when the vapors hit the condensers, fluid's cooled down and it becomes fluid again. Welcome to the liquid cooling lab. Liquid cooling is bringing the liquid closer to the chip, either by circulating water through a cold plate to the chip or either by dunking all of the IT into an immersion dielectric medium. The reason liquid cooling is important right now is because the demand on higher performance chips, higher speed or core count, continues to increase. This has been resulting in higher power chips and higher fluxes on the chip, which could be sometimes challenging to air cool and we require liquid cooling to do that job for us. Liquid cooling affects the whole ecosystem. When you take a look at the datacenter and the server and the sustainability promise that Microsoft is making, liquid cooling can help us get there faster. With liquid cooling we can have higher density racks in IT or tanks that could lead to smaller datacenter footprints, lower their center energy consumption from the mechanical cooling perspective, and also from the server perspective. Because we could reduce or remove the fans from the servers. We can reduce the leakage current from the chips and because liquid utilizes closed loops of warm water, maybe counter-intuitive, but we actually eliminate the use of water. We don't need to use evaporative cooling anymore because we're always going to operate in loops that are hotter than the ambient temperature. It is pretty awesome. It is amazing to work with state of the art hardware and state of the art cooling. We are seeing that the trend of chip power, whether it's compute chip or a AI chip is only going up. With time, liquid cooling is going to become more and more important. Liquid cooling probably seems pretty exotic, but it's nothing compared to how exotic the next frontier of computing is quantum computing, something that we've been working on for a couple decades now. Microsoft's approach to quantum computing saw a key milestone this year with the general availability of Azure Quantum. Azure Quantum really represents the full top to bottom approach that we're taking. At the very top, we're creating programming languages Q#, that allows developers to write programs to take advantage of quantum computers. We're creating integrations with Visual Studio Code to make it easy to develop and run those programs. We're creating Katas that allow you to learn about quantum computing programming. We're creating simulators that allow you to simulate your algorithms, both on your local machine as willing as in Azure. We're also working with partners across the quantum computing industry, including Toshiba, for example, that has their own quantum optimization program that you can sign up for it using Azure Quantum. We also have been working on our own quantum-inspired optimization. Trying to bring the innovations of quantum computing and the unique computational capabilities of quantum computing into classical computing today. Then finally, at the quantum hardware level or hardware partners like quantum IONQ and QCI now have their hardware available through Azure Quantum, where you can deploy quantum computing programs directly onto that quantum hardware. But we're also working on our own quantum computer. One of the just amazing challenges of quantum computation is the fact that for those qubits, those bits that store that information that you do that quantum processing, to be stable, they've got to be at extremely low temperatures. Now how low? Well, colder than space low. You can see on this chart right here that a quantum computer is running just at a few millikelvin. Because the closer you can get quantum control to the quantum plane, that quantum domain of just a few millikelvin. The more efficient you can be because every degree of temperature means that your dissipating heat and potentially perturbing the quantum computation as you get data into and out of the quantum computer. We've been focused on material science engineering to get quantum computation that is close to the quantum plane as possible. We've had some amazing breakthroughs on solving one of the problems of quantum control, which is when you're not close to the quantum complain. Like you can see in this diagram, this is a picture of the cryogenic refrigerator where at the very bottom of it, at those few millikelvin is where the 54 cubic quantum computer is located, you can see all those wires are coming from room temperature computers that are controlling those qubits. Hundreds of them to support just 54 qubits, a ton of complexity, a ton of heat, a ton of power. By focusing on how we can create computation down at that quantum plane, we can eliminate all of that complexity by putting the computation right inside the fridge next to the quantum computer. This project, cryo-CMOS, is a project that we call gooseberry. You can see here how it fits into the overall architecture on this diagram, where you can see the quantum plane at the bottom, you can see the cryo-CMOS control computer, which will operate as close to that as possible. Controlling those qubits and reading information out of them. Then providing that information, data in and readings out to computers running it at classical temperatures. Here's a picture of that gooseberry processor. You can see here, it's right next to the quantum computer, which is using qubits based off of our topological cubit technology, which we think is the most promising technology to allow scalable quantum computation. In fact, we think it is the only viable path to large-scale quantum computation where you're talking qubits in the millions of qubits. Where on other types of qubit technology, you'd need to be able to store and manage qubits across a quantum computer that's the size of a conference room. Here, we can run millions of qubits on a small wafer at the very bottom of that quantum refrigerator. You can see gooseberry sitting right there with wires connecting directly into that quantum computer to control those qubits using quantum dot technology. What's the difference between running outside the fridge at room temperature and running inside the fridge next to the quantum computer? It's this, you can see just three wires coming out of the gooseberry processor into the real-world to get data and readings into and out of it. That is like a major breakthrough. In fact, we're also working on creating CMOS technologies where you can take existing CMOS technologies and run them at a few degrees kelvin. Gooseberry reuses a special type of CMOS technology to be able to operate just at 100 millikelvin. But we're working on other technologies as well. Huge advances there in support of our top-to-bottom quantum stack. Now let's talk about Azure Resource Manager, which is the universal control plane for Azure. It's not ARM as in the processor, it's ARM as an Azure Resource Manager. Azure Resource Manager as the universal control plane, provides a bunch of uniform capabilities across all Azure services. It's accessible through the Azure CLI, the Azure portal PowerShell, the SDKs, and a REST API. An ARM provides consistent RBAC, role-based access control. It provides consistent monitoring, it provides consistent policy, and it provides consistent gestures and representations of objects across all of Azure resources. The way that it does that is through something called the Resource Provider Contract. RPC allows any service to plug in to Azure's control plane and provide capabilities that are accessible through all these different means and do it in a uniform way. Every single Azure service, all 200 plus plug-in to Azure Resource Manager, which allows you to get policy uniformly and monitor the control plane access is uniformly and perform security insights and monitoring on top of it. If we take a look at ARM's architecture, one of the benefits of ARM is that it provides a global view of Azure regardless of where your resources are. Whether they're in one region, in North Central US or in Europe or Asia Pacific, you can go to the Azure portal or the Azure CLI and see all of those resources. That's because Azure Resource Manager has an active topology that spans all of Azure's public resources. When you create a resource in one region, it becomes visible from all other regions. It uses an architecture that is the same as what we advertise to customers to build on top of Azure, to create these kinds of globally scaled services. It uses Azure Traffic Manager on top of load balancers inside of datacenters, leverages Cosmos DB for replication of state across all of those regions. Now, the way that you've interacted with ARM has been through the ARM JSON schema, either through Azure Resource Manager templates, which is one of the very powerful capabilities of Azure Resource Manager or directly through the REST API. What I'm about to tell you is about a project called Bicep. Project Bicep is the result of us talking to customers that have found it challenging to leverage the ARM JSON for a variety of different reasons. One of them is that it's extremely verbose. Another, that it's very difficult to represent what you want and understand what the JSON is doing because it's obtuse and indirect about what it's trying to accomplish. We've done things like introduce IntelliSense and provided Quickstart galleries, but we felt that that didn't go far enough to making it easy to author declarative templates to deploy Azure resources. We took a step back completely and talked to a bunch of customers and asked, "Would you like to us to make it possible for you to implement JSON ARM templates off of an existing language type like Python or PowerShell? " What we found is that no, really the right answer was a domain-specific language focused on configuration as code, declarative declaration of resources like ARM, but much more succinct and much more programming like to make it easier for people to develop these templates. That is Project Bicep and working in conjunction with programming language expert Anders Hejlsberg who created C#, TypeScript, Turbo Pascal, we've got this language now up and running and ready for you to use. The way that it works is that you author now templates or APIs to ARM in the Bicep language, and you can then transpile the Bicep language into an ARM template. So you can get your ARM JSON if you want. If you've already got tooling built around ARM JSON templates, you can leverage that. Of course, then turn around and give the ARM template through the Azure command line to ARM to deploy your resources. You can also, now, we're excited to announce that with version 0.3, you can give the Bicep manifest directly to ARM and it understands them natively. No reason to transpile in the middle if you just want to focus completely on Bicep. We've also had the ability to take your ARM JSON templates and transpile them back into Bicep and that lets you take your existing investments and now get started on Bicep really quickly, still transpiling back to ARM JSON to update those existing templates and workflows and then deploy to ARM. Let's go take a look at Bicep in action. Here what I'm going to do is open Visual Studio Code. I've got two Bicep files. One is here, website.bicep, which you can see parameters described at the top. You can already see it's much more succinct than the ARM JSON for doing things like describing parameters, which include the name, the location where we want to deploy this resource. This is an App Server farm. You can see it's parameterized with the name that we can give it as a parameter. It's running Linux, and you can see that we're creating a website onto that server farm. Here's an example of a reference from one resource to another. Now, when I go back and create the main deployment, you can see that I've got IntelliSense nicely setup here in Visual Studio Code for the types of target scopes. Here I'm going to target the whole subscription because what I'm actually doing is deploying a resource group. Resource RG is the name. RG isn't what I'm going to give this resource group, and here in the template, and you can see IntelliSense on the APIs. I'm going to pick the latest version of the resource group API, paste in the name of the resource group and location where I want it to deploy to, and then a very cool new feature of Bicep is modules where I can reference other pieces of Bicep. Here I'm going to IntelliSense complete with a Bicep file, that website.bicep that was right in that directory. I'm going to paste in here populated parameters with site deploy, the scope, the Azure Container Registry name that I want to use, and then I'm transpiling. You can see I've got about 150 or so lines of ARM JSON that came out of about 80 lines of Biceps. I've already cut this by close to 50 percent just in terms of verbosity, but it's also got so many other conveniences like modules built into it. Here I'm going to deploy right to Azure using the Azure command line, Azure CLI, and going right into the Azure portal, you can see the site deployment underway. If I go click on "Overview" and "Refresh", now you'll see that sure enough, I've got my Container Registry created, or App Service farm created, and I've got my website deployed onto it all using Bicep. The convenience is a Bicep natively understood by Azure Resource Manager. You've developed your application's, deployed them using Bicep, now you want to make sure that they work resiliently. Chaos Engineering is something that Netflix popularized, so they had something called Chaos Monkey that would go and continuously mess with their deployment of Netflix to make sure that it stayed operable in the face of the everyday failures that you see whenever you are operating at scale. We've been using Chaos Engineering inside of Azure, and now I want to bring Chaos Engineering and make it available to you to use on the Azure platform. With Azure Chaos Studio, you take your existing Azure application, which includes your service code, your service infrastructure, which includes the websites and VMs that your code deploys onto, and now you can deploy what are called experiments against it from Azure Chaos Studio. Azure Chaos Studio supports ARM JSON and Bicep, therefore, for you to define your experiments and then run them against your existing resource groups or subscriptions. It has multiple ways to inject faults into your application. One of the ways is using an agent. If you've got the Azure agent running inside of your website or your virtual machines, can automatically connect to the Chaos resource provider and inject faults right into your virtual machine, like inducing high CPU or consuming memory or consuming disk or killing processes. It also supports those service-based Chaos. Service-based Chaos is using the ARM APIs and data plane APIs of various resource providers to go induce chaos on those specific services. For example, you can go terminate virtual machines or shut them down and see how your site responds. If it's supposed to tolerate, for example, virtual machines going down and stay resilient, does it really do that? Let's go take a look at Azure Chaos Studio in action. Here what I'm going to do is gotten a sample app here, it's a drone delivery service application. If I make a drone delivery request here, and enter some bogus information, I get a tracking ID. If I go back to the website and Enter that tracking ID, I see here the drone moving there from Redmond to Bellevue where I'm going to pick it up. The back-end for this is Cosmos Database, and it's got geo-replication across two regions, the right regions East US, the read region East US2. What I want to do is because a failover of Cosmos DB's read region from one region to another and see if the app still supports it. If I go into look at my Case Studio experiment, you can see the location where it's going to run is EastUS2. You can see I've got steps in the experiment and I've got actions, and one of the actions is a continuous action where I'm going to continuously trigger a failover from the read region. If I go "Start" that experiment now, where I'm triggering that read region failover. Go back to "Now." Look at Cosmos DB. You can see when I do a "Refresh" here, that the failover is happening because it says updating. If I do a "Refresh" again, you can see now that sure enough, my read region has failed over from East US2 to now to East US. As the application can continue to work, well, if I go back to the tracker, let's see if the drone moves and sure enough it's moving. The application as expected, was able to tolerate as it was designed for that Cosmos DB failover. Now, let's go inject a fault that maybe it's not designed to tolerate. If I go back to the Chaos studio experiments here and go to the kill Node Process that is going to use the Agent-based installer, which I've got installed into the virtual machines here, the running Linux to inject a fault. Here in the steps, you can see the action is to leverage the Linux Agent to kill the dotnet process. Because this is DotNet Core app. If I start that experiment, go to Application Insights to look at my containers, I can see that the fabric cam delivery container, which is doing the tracking, is been in faulted state because the dotnet core process was terminated. You can see now I've lost the ability to track that. The application isn't resilient to having that container fault, and that's where I might want to go and add extra resiliency. Great way leveraging Chaos Studio to go make sure your application can tolerate existing types of known faults that should be designed for as well as stress it. Kind of Game Day events where you want to make sure it has maximum reliability because your businesses is dependent on that application working. Now, let's take a look inside of our compute infrastructure. One of the exciting innovations that we've had in Azure is the ability for you to deploy custom extensions into virtual machines and virtual machine scale sets. Our extension infrastructure allows you to do that in a safe, reliable way and to automatically update extensions. For example, a Security extension, if you deploy it through extensions into your virtual machines scale sets, it can do it in a graceful roll-out way across regions and within a scale set, do it in a rolling update way, checking for health signals as the update is progressing, so you can get a safe reliable update of your Security extensions, your Security agents across your infrastructure worldwide. We're taking that a step further now with something new called VM applications. With VM applications, you can deploy the main payload in that safe way using that same infrastructure that's powering Virtual Machine Extensions. Using the same way that you can store disk images in our shared image gallery. Now, you can store applications in that shared gallery. That shared gallery can be a private gallery for just your enterprise or just a particular workload, or you can share it with the community across your enterprise or even publicly. That allows others to take your applications and deploy them into their virtual machines or virtual machine scale sets. Not only that, but the deployment can be based off of health signals and do the rolling updates just the same way extensions can. One of the very cool features of it is while you can leverage the guest agent inside your virtual machine to pull directly from the shared image gallery into the virtual machine over the network, you can also leverage the end host update mechanism, which comes through an endpoint locally on the server, to get that code into that virtual machine, meaning that you don't need an agent at all. Further, you don't need any network connectivity. If you've got a situation where you want to completely isolate those virtual machines off the network, you can still leverage VM Apps to get code into them. We're also innovating in the space of programming run-time. How do you write your code that you want to deploy into [inaudible] virtual machines? Today, enterprise developers are being asked to take on more complexity than ever before. Many enterprise developers have been largely focused on business problems and gotten used to framing that in the context of website plus data-paid architectures. Now, they're being asked to create microservice-based architectures. They're needing to do that because they're needing to break up those monoliths, and to containerize them, scale them independently, update them independently so that they can be more agile. They're also being asked now in many cases to make sure that those applications are portable between their on-premises environment and the public Cloud like Azure, even across Clouds. That means Learning. SDK is for lots of different services. If they want to store some state in their application, now they've got to learn Cosmos DB for public Cloud maybe and Redis cache or MongoDB or Cassandra for an on-premises deployment of that application. Meaning lots of code just boilerplate SDKs and lots of learning of things that the developers don't really want to focus on. They want to focus on their business problem. We've taken learnings from Azure Functions, where Azure Functions is really server-less, write your piece of business logic. The infrastructure takes care of everything including with Azure Functions bindings, connecting your code to external services in a very convenient way. Plumbing inputs and outputs from your code to those other services where you don't have to learn those SDKs, incorporate them, update them, or even authenticate to their services because that's all done in the bindings themselves. Dapr builds on top of that by creating what are called building blocks. Building blocks that take care of those mundane tasks that developers are being asked to use. Providing those capabilities through a local HTTP or gRPC endpoint, delivered as Sidecar functionality so that you as a developer don't have to leverage any SDKs at all, including the Dapr SDK, you simply leverage HTTP calls to talk to Dapr's Sidecars, Dapr's building blocks. Building blocks like PubSub and state management have the concept of components that can plug into those abstractions, so that [inaudible] developer, I mentioned if they wanted rather Cosmos DB in the public Cloud and MongoDB on premises, they leverage the state store component from the state store building block for that particular environment, Cosmos DB component in public Cloud, Cassandra MongoDB On-premises; don't have to change a line of code. Don't have to learn those SDKs and Dapr takes care of everything for them. In fact, it also handles retries for them. Service to service invocation, secret management, pub-sub state management are just some of the key building block capabilities that Dapr has. Got to really see the power of Dapr and to really take advantage of it, you might have been waiting for; is this thing real? Is it production ready? Well, I'm excited to announce that just a while ago, Dapr reached a V1 production. In fact, Dapr is also a completely open-source project. We've got open source governance. We're going to contribute to a foundation. You as an enterprise that wants to take advantage of open source ecosystem, no lock-in, flexibility that get all the benefits. Now's the time where you can start taking advantage of Dapr. One of our close partners on our Dapr journey is the Zeiss group. The Zeiss group here is going to talk a little bit about how they are leveraging Dapr to make their Cloud Platform more Cloud-native. ZEISS an international technology leader in optics and optoelectronic is replacing an existing monolithic application architecture with a Cloud-native approach to order processing. An early adopter of Dapr, ZEISS uses the global reach of Azure and the integration of Dapr with Azure Kubernetes Service. To fulfill orders faster for ZEISS customers, we had a chance to sit down with Kai Walter, Distinguished Technology Advisor for ZEISS Group, and hear how Dapr is helping them with their new Cloud-native application development. >> Dapr makes developing a distributed application a commodity for us and helps us focusing on the business value. We're building a highly-available globally distributed and easy, adaptable order processing platform for our iGlass business. Dapr basically handles the service to service calls for us. It's the basis for all virtual actor model where we keep business objects globally and in the regions. We also use it to abstract top platform services for all applications. Especially, actors on top of a globally distributed multi-master Cosmos DB, helped us implement a scenario which is otherwise complex or challenging to implement. >> Using Azure Kubernetes Service in combination with Dapr, the launch of the new order processing platform has given ZEISS the scalable and resilient architecture it needed to develop services and get them to market faster. ZEISS customers benefit from faster order fulfillment and timely notifications of progress, something the existing system couldn't do. >> To prove to you just how easy it is to take advantage of Dapr capabilities. Let's go take a look at a couple of demos. First, I'm going to Dapr our eyes an ASP.NET Web App where I want to save some state IN a state store without having to pull in an SDK or learn the complexities. The first thing I'm going to do is to install the Dapr runtime to show you that we have no dependencies on Kubernetes or Containers. We can just install the Dapr runtime and CLI locally here from my app. First thing I'm going to do is go create a new routing endpoint for depositing and withdrawing amounts from an account database that this website is going to manage. You can see HTTP post withdraw, I added the HTTP route so that I can test this using curl. But Dapr is going to accept this. If I have another Microservice call withdrawal on this Dapr app I would get invoked as well. But the core logic of Dapr here is saving the updated state for that withdrawal here, updating the account balance, and then saving it using Dapr and that's all we had to do, is reference the Dapr state store using the Dapr client. No messing with SDKs, no messing with credentials, no messing with retries. All we had to do and the same thing happens with depositing amounts into that account. What we have to do though is configure a state store. What I've done here is paste in a component for my state store where I'm leveraging Cosmos DB. You can see I've got a Cosmos DB key here, reference from Key Vault using Dapr's key management. I don't have to worry about storing secrets in my Code or Config. Now, I can do a Dapr run by pasting that command line in and setting up the Dapr port that it's listening at, and giving it a Dapr application ID. You can see it launching here. Now that the site is up and running, I'm going to switch terminals and I'm going to do some curls to show you the state store being exercise here and look at the local URL with the 5,000. The deposit route, passing in the JSON payload, which is the amount, and withdrawing some money. If I do it a couple of times, you can see the balance or deposit some money, you can see I've increased the balance to 24. If I do the same thing with withdrawal, you can see I'm decreasing the account balance. But all I had to do in that ASP.NET App was just add the state store, and I immediately vaporized that application by taking advantage of one of its building blocks. Let me show you a more complicated example now. This is a Kubernetes-based app. This is one that you might be familiar with if you're a.NET programmer because this is the eShop on containers reference application that shows you how to take.NET Apps and run them on top of containers, on top of Kubernetes, AKS. You can see here's my shop front, so I've got the app running. But one of the things that this app has in it is storing, publishing the borders in a pub/sub service. It's configurable to support two pub/sub services; Service Bus and RabbitMQ. But what I'm going to do is create a Dapr class here that is going to encapsulate the pub/sub using Dapr's Pub/sub building block where I don't have to worry about SDKs, I don't have to worry about authentication. All those things that I mentioned before, I just leveraged the pub/sub abstraction. You can see I've got this code here at the pub/sub name, pub/sub. I've got the Dapr event bus, the publish method, the subscribe method. Those are automatically routed with pub/sub topics that now I'm going to configure. But here what I can do is just add a few lines to use Cloud events. I'm mapping my endpoint subscriber. This is the final piece of wiring that I've got to do. But here you can see that the existing code has this RegisterEventBus, which checks this AzureServiceBusEnabled config to see if it's used Service Bus. You can see all the Service Bus SDK Code there. Beneath it is the RabbitMQ code. I'm just going to take all that existing code and delete it because we don't need it and just add one call here to Dapr. I'm going to go delete the RabbitMQ and Service Bus classes because I don't need them anymore. I've actually created negative lines of code but actually made it easier for me to maintain this code. I've made it portable, not just to RabbitMQ and Service Bus to any pub/sub provider. The final thing I need to do is actually implement the component, and I'm going to paste in that component here for RabbitMQ and you can see there's a pub/sub component for RabbitMQ. Once I've got this now if I deploy the app with this, it's going to use RabbitMQ. But if I deployed it with this config instead, which is just different by a few lines, you can see that it's pub/sub Service Bus instead of RabbitMQ. You can see the Service Bus connection string instead of the RabbitMQ 1. Then my app is going to be using Service Bus instead. Now it's not a development time concern, now it's a deployment operation concerned which pub/sub system I want to use. The app is portable between Azure and on-prem or even Azure and other Cloud providers. Finally, to get Dapr into my Kurbenetes app, I need to add some metadata to say that I want the Dapr API port to be Port 80. I specify my config. Dapr is enabled and I'm calling this app the Basket API app. Now lastly, Install Dapr onto my Kubernetes cluster with Dapr in it. Now I'm ready to deploy that app. In fact, I've got it up here running and now I've got my Dapr swag available to me. My cool Dapr hat, my Dapr stickers, and hoodies. Once I do a checkout, the really cool thing is I deployed the Zipkin containers part of my Kubernetes install, which means that if I go to the Zipkin UI, I get monitoring of my orders without me. If I go do this Run Query just to look at everything that's there and look at the basket API, I can see exactly that order that I just placed show up in the telemetry. Dapr supports monitoring with open telemetry compatible providers out of the box, roughly a dozen of them. I automatically get tracing and insights into my application with the developer having to write no lines of code to get it. Now, if we take a look at data protection technologies that have been widely used today, you've got data at rest protection with, bring your own keys or system managed keys. You've got protection of data on the wire with TLS, SSL, other encryption technologies. What's been missing is while that data is being loaded into a CPU and RAM, it's protecting it from the outside world, and that's what we call a confidential computing, protecting that data while it's in use. We believe that protecting data while it's in use, right now, might be the domain of customers with the most sensitive data that they want protected from malicious admins on the box, from insiders, from compromises of multi-tenancy infrastructure. Just to want to make sure nobody but them can touch that data, nobody but the code they write can touch that data. But with the technologies and our investments that we're making with hardware providers as well as software we're building an Azure, we want to bring this to the mainstream for everybody to take advantage of. Including, just as a defense in depth for your every day low-risk enterprise applications. At the core of confidential computing is the concept of an enclave. That enclave there you can see on the right is where that trusted code is going to run with that code that you're going to entrust with the data is going to execute. When you have the application split into the non-confidential and confidential parts or untrusted and trusted parts, the flow works by, first, creating the enclave from the untrusted part, getting an attestation that the enclave is actually running the code. It should be running and that the enclave is actually an enclave provider that you trust. Is it the hardware provider that is creating this box for you that you trust? Once you've established trust, now, you can share secrets with the code in enclave that allow it to go and process the data that you want it to access. Handing it the secrets and saying, go run computation analytics, machine learning on this data that's stored in Azure storage or some other storage location is the next step. We've taken this and actually integrated into SQL server which you could run in an on-premises or a isVM in Azure that we've now made publicly in a public preview available Azure SQL Always Encrypted for Azure Database. We use it in leveraging our DC series of SGX hardware with SGX enclaves. The idea here is now that the SQL query processor runs inside of a trusted enclave, an SGX enclave, and now, you can establish trust with that SQL query processor, know that it's running inside of an enclave, and then release secrets to it that allow it to decrypt encrypted parts of your database and perform rich computation inside of it. Something not previously possible with SQL Always Encrypted where the encryption and decryption happens on the client side. Now with it on the server side, you can get this rich query processing happen as part of the database transactions. Let's go take a look at that. Here, I'm going to pull up my Contoso HR database, and you can see that the database server is leveraging the DC series of virtual machine which enables confidential computing. It's eight core cascade Lake Intel processors, and when I run SQL Management Studio and do a query that doesn't have access to the keys to decrypt, not give the enclave the encryption keys, you can see that I get back encrypted versions of the social security number and other sensitive information rather than the plain text. Now, one thing I've got to do is set up an attestation policy here that says, only trust this enclave. If it's an SGX enclave, it's not in debug mode. If that's the case, I'm going to release my customer managed key to that SQL enclave to allow it to decrypt those columns. When I go and configure that database, you can see here that I've got column encryption specified, that I've got an attestation protocol setup, so it's going to attest with attestation service. Here's the attestation service API, so this is where it's going to hand proof to that attestation service that it is. We're actually running in an SGX enclave and it is signed by Microsoft. At that point, my client app, it transparently because this is built into the SQL clients, goes and contacts SQL server, gets the attestation, checks it, and now, is able to perform those rich computations. If I go now and take a look at what I would see on the wire if I was intercepting traffic, even bypassing the TLS, you're going to see that what's happening inside is that the client is encrypting the query parameters like the min salary, max salary I just had on those ranges, such that only the enclave, and only that particular enclave, because we established the femoral keys with it, can decrypt those queries and get insights into what I'm actually looking for. Taking data protection to the next level by protecting it while it's in use from all of those various threats. Our final section is looking inside of our storage architecture and data services. When we start with storage, there's a growing number of storage services focused on specific types of storage, from disk storage to file storage, backup services, data transports for taking data, importing and then exporting it from Azure public Cloud, hybrid storage solutions, and we're also working on future technologies that I'll get into here in a second. The Azure storage architecture, for the most part, the file and disks NFS v3 which we're excited to announce recently, as well as SMB protocol heads and HDFS, all sit on top of that same basic stack of components. What that means is when we add features to this stacking components, all of those APIs take advantage of it, including high throughput capacity on object sizes. High capacity storage accounts, high-capacity object sizes as well. It will also take advantage of the geo-replication capabilities across all these different APIs. You can see here that we create clustered groups. Those clustered groups are storage stamps in different regions, replicating data between them, and what that allows is for you to have a DR capability on top of Azure storage that is given to you without you having to go and manage that replication state. What I'm going to talk about is one of the cool innovations we've gotten in a particular data service, and that is Azure Data Explorer. Azure Data Explorer, you might be using on a daily basis if you're using Azure, to go and look at insights into the way that your services are operating. It's a fantastic, custom-designed for unstructured log storage analytics. Being custom-designed, it can take unstructured, non-schematized data, index it at very high-performance, store it very efficiently, and then perform queries on it at scale with very low latency and high performance. Now, Azure Data Explorer, you can see here the pipeline, how it fits in. The engine that you've been using is Azure Data Explorer v2, and v3 which I'm about to show you has major enhancements on the way that it indexes, as the ways that it does scale out sharding when you've got a cluster spread out. You can almost linearly scale out adding servers and getting just basically linear performance improvements in the throughput and scale of those queries. Let's go take a look at that v3 versus v2 comparison. Here I'm going to open to views into Azure Data Explorer. One the V2 engine on the left, and the V2 engine on the right and you can see that I'm not going to play around with it. I'm going to throw some serious queries at it. The data that I've got is almost 404 billion records, about 100 terabytes of data, partially structured. You can see that both databases for both clusters are the same. If I do a top on them or pull the 10 records, you can see that they're the same records. Here you can see how there's some schema and unlike client idea requests. But then the last column here is a message which is completely unstructured, free form text and different for every one of these records, which is extremely challenging for structured the operational databases because they're not graded indexing unstructured data like something designed purposefully for a log data would be. Now if I run this query on it, which is going to be looking at a particular time window across time stamps where the level column has warning in it, and the message has the word enabled anywhere in it. I want to count the number of records like that, and you can see that out of that query on the right side, I pulled a few 100 million records that match that. When just a few seconds here, actually less than a second. On the right side, for the same 250 million records. If I go take a look at the stats, you can see a total wall clock time of close to 24 seconds, a CPU time of 18 seconds, 228 megabytes peak per node on the left and on the right with V3 engine, 144 megabytes. So drastic improvements in the efficiency of that query processor. Now the last thing I'm going to show you is something that I think is one of the coolest aspects of this whole talk in terms of innovation coolness. This chart from IDC talks about the digital universe, which is all the data generated. If we compare the digital universe, which is the top line, with the amount of data that we can store. The bottom line, you can see that gap is continuing to widen. That doesn't mean we want to store and process all of that digital universe data. But certainly by not having the capacity that matches it closely, we're just not able to from a technical perspective, we're just lots of data that we might want to analyze, and understand or get insights out of that we simply can't store. How do we go solve this problem? We've been focused on finding new storage technologies that can store ever increasing amounts of data for with more durability, and for longer periods of time without having to be rewritten. We've had projects with Microsoft Research that look at storing data in glass with Project Silica, or storage which serves as archival storage replacement, storing data in holographics storage with Project HSD, which is aimed at another type of hard disk that will overcome the limitations of hard disk storage, and throughput capacity as we go forward. But one of the things that we've got as a project between Microsoft Research in the University of Washington is Project Palix, and Project Palix links looks to actually store data inside of DNA molecules. DNA molecules have the benefit that they're extremely durable. They last thousands to millions of years. In fact, I was just reading an article yesterday about woolly mammoth DNA that was discovered that some million years old. So stored in the right conditions, it can last almost forever. But the other benefits of it is that it's extremely dense. It's seven orders of magnitude more dense than the next closest archival storage technology. We could store as zettabyte of data inside of a rack, inside of a datacenter, something that would take multiple dozens of data centers today to store a 100s of data centers today in rack. Just amazingly promising. Now how do we store data in DNA? Well, we leverage just the natural way that DNA does encoding through the nucleotides. There's these base pairs here, ACGT. By synthesizing molecules to encode bit streams into the DNA, we can then store it. Then we can use standard DNA sequencing technologies to go read those DNA molecules out, and get back the original data. The system we built with the University of Washington do this is pictured here. Here on the left side you can see this is where we take the data and synthesize it. Then you can see here is where we do the storage prep, which is put placing it in the storage devices that didn't go put in those racks that I referenced earlier. Then if we want to read it back out, here's the sequencing part of this bench. Now this isn't exactly, of course, what it would look like in a real datacenter once we get to production. But this allows us to those end-to-end tests from synthesis to prepping, and pretending it's ready for storage. Then sequencing it back out to make sure that this works fully. Let's go take a look at this in action. What I'm going to do here is take a few files. You might recognize these zero-day Trojan horse rode code is some really great novels that I coincidently wrote, cybersecurity thrillers, and we're going to encode those into a zip file, here these ePub files, and synthesized DNA out of that. Actually I already did that yesterday because it took a while. Let's go take a look at the results of that run by clicking on that, and you can see here the Blob path because this went to end Azure Storage. If I take a look at that strands.txt in Azure Storage, I can see all of the DNA sequences that were produced, all the strands that does ePub zip file went into creating. What this actually does is synthesized multiple thousands of each of these, because DNA is so compact that why not get that extra robustness reliability by having multiple versions. But that DNA that we sequenced, I actually have a copy of it. It's right here in the tip of this vial, it's too small for us to really see, but it's there. They tell me they, and I trust them. This is, I think one of the really cool things about working at Microsoft, because working with Microsoft Research now I've had zero-day encoded in silica glass. I had it encoded in holographic storage with a Project HST, and now I've got my books encoded in DNA. I plan on going home tonight and reading them again. Now if we go back to take a look at what happens on the sequencing part, that was the synthesis part, to make sure that that really can be read back out. If we go look at the decoding details here that they're coded file, and see if we got back what we wanted and not some weird monster created on that DNA, then of course there's no possibility of that. But here we get and sure enough, the zip file with those three files in it. So end-to-end synthesis, storage and sequencing of content, arbitrary content including books. That brings me to the conclusion of this talk. I hope you found this useful, interesting, excites you about Azure and the promise of the Cloud and some of the innovations we've been working on here in Azure. With that, I hope you have a great Ignite.

Info

Channel: Microsoft Azure

Views: 66,311

Rating: 4.9219713 out of 5

Keywords: microsoft ignite 2021, ms ignite 2021, Inside Azure Datacenter Architecture with Mark Russinovich | OD343, Azure, Session, Mark Russinovich, Aaron Crawfis, Azure datacenter architecture, Azure datacenter, Microsoft cloud, Datacenters, supercomputing clusters, DNA synthesis, liquid immersion cooling, datacenter architecture, Azure Resource Manager, Intelligent infrastructure, Azure innovation and Datacenters

Id: 69PrhWQorEM

Channel Id: undefined

Length: 90min 53sec (5453 seconds)

Published: Mon Mar 08 2021