[MUSIC] >>Hi, welcome to Inside
Azure Datacenter Architecture. My name is Mark Russinovich. I'm Chief Technology Officer and
Technical Fellow at Microsoft. This is a talk that I've been
giving now for several Ignites. Each time it's a brand new talk. This talk is no different,
lots of new demos. In fact, I've got a dozen
demos to show you that all highlight some of the latest
innovations across all of Azure. Whether you're new to Azure or you've been following
Azure for some time, including watching
these presentations, there's something here for you. In fact, here's the agenda
that I've got for you. I've got seven sections. I'm going to start by taking a look at our datacenter infrastructure, including the way that we design
them to go out across the world. I go in then into our
Intelligent Infrastructure. Intelligence means machine-learning,
how we're applying that for AIOps across lots of different services inside
of the infrastructure. I'll take a look at networking, including physical networking
architecture as well as the logical services
that run on top of it. I'll talk about our servers, highlighting some of
our largest servers, highlighting also some of our
smallest and coldest servers. Then I'll talk about
Azure Resource Manager, which is a universal
control plane to Azure. One of the cool innovations we've
got there is making it much easier for you to author Azure Resource Manager
templates using a new language. I'll go inside of Azure compute. I'll show you how we're leveraging confidential
computing to make it easier for you to protect
your data while it's in use. Then finally, I'll go into
Azure storage and data, showing you some of the
innovations and login analytics, and the way that we can store
data extremely efficiently with extreme high density for millennia without
having to touch it. It's an exciting show, and let's get started by
looking at our datacenters. Our datacenters, actually are
part of a spectrum of Azure, which we call the world's computer , that starts there on the right with our Azure Hyperscale Regions that you'll probably think
of when you think of Azure, that spread the spectrum all
the way down to the left, down to Azure Sphere
microcontroller units, MCUs, that are just
four megabytes in size, where you still have connectivity
and services consistent across this entire platform
which is our goal, connection, and be able to deploy and operate consistently throughout
that entire spectrum, including those in the middle
where it's our hardware like Azure Edge Zones as
well as hardware that's on your premises like Azure Private edge zones and Azure
Stack Hub and Azure Stack HCI, which is completely on hardware that you bring Azure Stack
HCI certified hardware. But this hyperscale
public Cloud regions, that's the big focus, that's the center of the Cloud. We've continued to build
that out over time. We now have over 65 Azure regions. We've taken Azure to
so many countries and in response to demand
from countries and businesses that want data
close to enterprises there as well as data on their sovereign ground
protected by their laws. Just last year alone, we've introduced nine new regions. Including Indonesia, our latest announcement
in February of this year. The state of Georgia,
New region there, Chile, bringing in a region
to South America, Denmark, Greece, Sweden, Taiwan, Australia, and then Arizona in United States
in September of last year. Within those regions, we've
got an architecture that is built on top of
availability zones. We've been working on availability
zones for many years. You've heard us introduce them many years ago and continue to build out availability
zone capabilities. The concept behind availability
zones is making it possible for you to have high
availability within a region. High availability meaning that you
can store data durably and have compute that can survive localized problems inside
of a physical datacenter, like a flood in the datacenter, a power outage in the
datacenter an HVAC, a cooling problem in the datacenter, but still be able to serve, computing data out of those remaining two
availability zones because we promised a minimum
of three with every region. With availability zones, like I mentioned, we've
been on a long journey, but this is a pivotal year for availability zones
because we're excited to announce that we're going to
have an A-Z in every country. A minimum of three
availability zones in every single country that Azure's in by the end
of this calendar year. Going forward, every
single new region that we announce will have
availability zones in it. Not only that, but when it comes
to you writing applications that take advantage of the resiliency that availability zones offer, we've now promised to have all foundational and
mainstream Azure services support AZs by the end of
the calendar year as well. Regardless of what
service you're using, if it's an a region with
availability zones, and availability zone has a problem, that service will
continue to operate. Hyperscale public Cloud regions is one example of our
datacenter designs, but we're taking datacenters and
Azure regions elsewhere too. If you take a look at our
modular datacenter designs, you can see here that we've got
a form factor that is portable. In fact, that box right there, it's designed to be
pulled on the back of a semi truck so it can
go almost anywhere, which makes it great for humanitarian disaster relief
or if you want to forward operation center or taking it to an edge site for temporary
base of operations. This is designed to be fully remote, so it works in fully
disconnected modes as well as partially disconnected. It's got a SATCOM module
built into it so that it can communicate via satellite if it doesn't have
a wired connection. It's got a high availability
module built into it also for UPS. The servers are on shock absorber so it can tolerate the
transport as well, and it's got HVAC
built right into it. It's fully self-contained Azure inside of a
tractor-trailer basically. But all of these regions
that we've talked about, add up to tens, hundreds of megawatts, even
gigawatts of power consumption. One of the top concerns for us, for our customers, for governments is taking
care of the environment. Azure has a big focus on sustainability efforts
that are part of Microsoft's overall commitment
to the environment. There's a number of
different ways that Azure is contributing to our
sustainability efforts. One example is in the procurement of renewable energy and powering our datacenters with renewable
energy wherever possible, Microsoft has made
some of the largest renewable energy deals on the planet. Including one that we
recently made here in the state of Virginia
for solar power, which adds up for a
total of 300 megawatts. This project is well underway with 75 megawatts coming
online in last October, another 225 megawatts of solar
power coming online by the summer. We also announced that
in Denmark region, it's going to be 100
percent renewable energy. You're going to see
that pretty much trend going forward with our
new region announcements, but it doesn't just stop at
renewable energy efforts, it's also how efficient our
datacenters are at consuming energy. There's an industry
standard term called PUE, which measures datacenter efficiency. Basically, the higher the number, the more inefficient it is. A typical IT Datacenter has
a PUE of between 1.8 and 2. Microsoft's datacenters, just like traditional IT datacenters was around that number in the late 2000s. We've continued to invest in new datacenter designs
that it more efficient, minimizing the amount of components, streamlining the power from the connection to the
datacenter all the way to the server so that we get as close to one as possible,
meaning perfect efficiency. Our latest data is datacenter
designs called Ballard, have a datacenter
efficiency or a PUE of 1.18, the industry leading. It also, our energy, our sustainability efforts go into; how do we not just be friendly to
the environment going forward, but how do we help
improve the environment? As part of that, we issued a
request for proposals in July of 2020 to procure one million metric
tons of carbon removal in 2021. In response to that RFP, we received 189 projects
in 40 countries, and we've already bought 1.3
million tons of carbon removal. One of the projects that
we have underway is one with the Swiss company
called Climeworks, that Microsoft is invested in. Together, we're going
to permanently remove 1,400 metric tons of
carbon this year. That carbon is going to be put back for useful
purposes in many cases, including going back into
synthetic fuel production, used in greenhouse agriculture, going into carbonated beverages
or even permanently stored underground in volcanic rock
using a mineralization process. Let's turn our
attention now to taking a look at the way that we
operate our infrastructure. We call it the intelligent
infrastructure, just like we've got intelligent
Cloud and intelligent Edge, we actually apply
machine learning and intelligence to everything
that we're doing. This is part of our broader efforts on raising the reliability of Azure, and continuously improving it. If you're interested in some of the things I'm going to
be talking about here, you might be interested in
the blog post series that I started about a couple of years ago, the Microsoft Azure
Reliability blog series, you can see there's the
first post from that series, which covers some of these topics. Now, AIOps, the places
that we apply it, meaning using machine
learning to look for signals, anomalies,
autocommunication. If there's an issue, we'll first detect it using AIOps and then we'll automatically
communicate it to just the impacted customers
so that they're aware that something's
happening using AIOps, That's a great example
of where we simply cannot provide the
time to notify that customers expect if we're
waiting on humans to make an assessment and then go figure out who they
should communicate it to. Instead, we rely on automated systems to go and try to perform that
as much as possible. We also will do root cause analysis. Once we do have an issue, we can root cause it by having AIOps look at all the signals and
point us in the right direction, that can help us
resolve an issue more quickly or even if it's
not impacting customers to figure out how we can
prevent those issues from surfacing again and
potentially impacting them. At the heart of our AIOps is a system we call brain,
you might expect that. Brain puts all these
signals together, it puts them all together using
machine learning algorithms to figure out where the highest
fidelity signals are, and looking at correlations of signals together so
that it can perform all these operations
that we've been talking about on top of one platform. Now, we also apply
AIOps in another place, one of the other places
that we apply it to is for efficiently operating
our infrastructure and providing a better
customer experience. The system that we built here
is called Resource Central, and it's actually one that
we've publicly talked about. We've got published
academic papers that go into how Resource
Central actually works. At the heart of Resource Central
is machine learning training. Machine learning training that is offline by taking all
of the signals from our production services
that are related to power management for spot price, VM eviction, for more efficiently packing
virtual machines together. Coming up with the train
models and then pushing those train models out to this
Resource Central service, which then acts as part of our allocation system and our
infrastructure control system. This is a continuous feedback loop, as the system continues to evolve, as customer behavior patterns change, as we introduce new
services and capabilities, Resource Central just get smarter
and gets applied more broadly. But another great place that we're
applying Resource Central is in minimizing customer impact towards failures that we think are imminent. Failures that are
imminent would include ones where we're getting signals
from the hardware in our servers, that it's starting to produce errors. Those errors might be
correctable at this point, or there might be a
failure of a component, that component is not impacting any virtual machines
but signals that, that server is going to be moving
likely in a degraded state. We're using Resource Central
to take those signals, create these models of how a
server is going to perform. How likely is it going to fail? What's the timeline before it fails? Then also figuring out the
best ways to respond to that. In some cases, it just
means resetting the server, it could be a software-induced
failure and we might need to do what's
called a Kernel Soft Reboot, which means just resetting the
hypervisor environment but leaving the virtual machines preserved
in place through that process. Resource Central and through Project Narya has been operating now for about
a year in production. As a result of it, applying this intelligence to
prediction and the best way to mitigate the impact is we've seen a 26 percent decrease
in VM interruptions, meaning some software
or hardware problem on the server is causing an
impact to virtual machines, so already a huge impact and we're just getting started
with this capability. We're also applying
intelligence to the way that we operate our
datacenters themselves. If you take a look at a
datacenter architecture, besides the physical infrastructure
that I've been talking about, there's humans involved
in the process. There's maintenance activities
that have to take place, maintenance activities that if done improperly could impact
production services. For example, if you're taking out a power feed that's redundant for maintenance but the one that is redundant for is offline
because of a problem, you've just caused an outage. We're applying intelligence
to datacenter operations, again looking at all the signals together but also keeping track of these performance and availability of the datacenter so
that we can make sure that maintenance operations
don't impact customers, and that other operations, like increasing the datacenter
infrastructure footprint also don't impact customers because they're done
in opportune times. To do this looking holistically
at the datacenter, understanding datacenter
impacts, where's this breaker? What is going to be impacting? If we may perform
maintenance on this, what other correlated components do we want to make sure are up and running when we do that maintenance? We're modeling our datacenters
using Azure Digital Twins. Azure Digital Twins is one of our newest services
in the IoT category, where digital twins
allow you to create a full model of a physical environment as well
as a virtual environment. Modeling things like the devices
themselves and then modeling logical abstractions on top of
them as well in a graph form, where the graph is really reactive. What that means is
that you can pump data into those digital twins
from the live environment, you can perform simulations. What if I did this
what would it impact? Because the graph is representing those connections between
those different resources. You can also invoke external
business process using Azure Digital Twins
like calling out to Logic Apps workflow
and Azure Function. Some digital twin, modeling some particular component
moves into a particular state, go trigger this workflow
and that workflow could be getting datacenter
operations involved to go take a look at a
problem or kicking off an automated workflow
that's going to go mitigate or prevent some
problem from happening. Let's take a closer look at that digital twins environment that
we set up for our datacenters. Here, I'm going to open our datacenter digital twin
environment viewer focused on anomaly views. You can see a timeline
at the top that lets us pick a time range during which we want to look for anomalies that
might represent real problems. I'm going to select
about an hour window, I'm going to slide it over part of the timeline that I know that
there were some anomalies. Specifically two breakers here that show as potentially having an outage, I can also see that there's a number of other devices that also
have potential anomalies. I can use this data selector to select any of them
and I'm going to pick those two outage breakers and then look at their digital
twins in the graph viewer, I'm going to select one of them, and what that does is
automatically connects using Azure Digital Twins to
Time Series Insights, our data stream of the power
coming off of that breaker. You can see sure enough
there was a dip in the power over a period of that
time in that one hour window. I can see here at the properties
of that digital twin, I can see for example the state that there was a
power drop at that point, so the state of that digital
twin switched to drop. I can also see that it was
connected to this other one. In fact, if I select
the other breaker, I see the connection in the digital
twin graph between the two, they're pointing at each other which tells me that they were dundant. But when I select the redundant one, I see that it also has an
outage during that time window. That means both of those breakers lost power for some period of time, and that could have had downstream
impact that I want to explore. I can look beyond just the immediate
graph connections to a four-level steep to see
what else was impacted. You can see a number
of other devices and maybe logical components that were impacted by that
particular outage, and that lets me go explore,
understand the impact. If we're looking at this
historically in a postmortem, we can understand how this datacenter behaved and
figure out ways to prevent the same kind of problems from potentially impacting
customers in the future. Let's now talk a little
bit about Azure Networking. Azure Networking consists of a
bunch of different capabilities, a physical infrastructure, and services on top of that
physical infrastructure. What I'm going to do is take a look from the servers and the racks, where you can see that we've got, for example, 50-gigabit SmartNIC, meaning FPGA accelerated network
adapters attached to our servers. Going up to 200-gigabit
Software Defined Appliances, our Top of Rack routers, Accelerated Networking services
that leverage that FPGA as well as Container
Services called SWIFT that I've talked about in
previous presentations. Going into our DC scoped hardware, which includes roughly
one millions of fiber cable that we deploy within one of our
hyperscale Cloud regions. Services like Azure Firewall,
Azure DDos protection, that gives your websites
protection for DDos attacks, our Load Balancing services. Into our inter-region network, where we've got something called
regional network gateways, Just like for availability
zones where we have redundant, fully isolated datacenters on
which to run compute capacity, we also make the network
completely redundant, including the physical
infrastructure. There's two RNGs at least in every region so that
if one of them fails, we still have connectivity between those availability zones
as well as out to the WAN. This RNG architecture which goes in T-shirt sizes from 28 megawatts
of capacity in the region up to 528 megawatts really gives
us a ton of flexibility with minimal network hardware connecting those zones together
and out to the WAN. We have to meet our
latency boundaries. That means that all of this
infrastructure to meet that two-millisecond
inter-region latency envelope, has to be within a 100
kilometers of each other or so. That's one of the many dozens of factors that determine where we place availability zones and the RNGs inside an area where we're creating
a hyperscale Cloud region. Then that takes us out to the WAN. Microsoft's WAN's one of
the largest in the world. We've got a 130,000 kilometers
of fiber and subsea cable. We've got over 300 terabytes that we've added to
the WAN just in 2020. We saw there's huge surge of
work from home, learn from home. As countries went to lockdown, which caused us to expand our
network capacity to support all that shift of activity out
from traditional enterprises, out to the edges, and the connectivity and increasing
demand for Cloud services as well. We also tripled our
transatlantic cable capacity. I talked about that in my
presentation last summer on how Microsoft reacted
to the COVID demands. Then finally, last-mile connectivity. We've got over a 180 edge
sites and continue to grow. What that means is that when
there's an edge site and your network traffic is aimed at
Microsoft or an Azure service, it enters our backbone right at that edge site enters our
dark fiber WAN network, so you get the highest
performance and consistency of performance
into our network as possible. The closer you can enter, the more that you're in
our network and under our control where we can provide
the best quality of service. We've had a 100 percent growth
in peering capacity again, as part of that expansion that we had in response to the
COVID demand spikes. Some of the services
that we've got there include Express Route 100 gigabit, which allows you to connect your
own enterprise networks into Azure immediately from your edge
of your enterprise network. Basically entering our
dark fiber backbone at a 100 gigabits per second
of network capacity. We also have our CDN services
that you can leverage across all of those
peering and edge sites. Now one of the ways that the network, not just Microsoft's network, but the world's network
is programmed is with something called
Border Gateway Protocol. As part of our focus on making
our networks more resilient, not just at the physical
infrastructure level, we've also been focusing
on making it more resilient at the
logical level as well. Border Gateway Protocol is
the protocol that is used to advertise the routes from one
IP address to another one. How do I get this packet
over to the server, which could be sitting inside
of an Azure Datacenter? In most cases, the BGP
routers along that path, advertise the correct
path to the destination. As the packet arrives in each point, it gets routed in the correct direction and eventually makes it to that destination server. But it's possible with
the way that BGP has been architected and
used up to very recently for bad actors to miss route packets
by advertising false routes, or for an operator to make a
mistake, and leak a route. Leaking a route is what happens
when they advertise a broad set of IP addresses that should be routed
in a certain direction that includes routes that
shouldn't be part of that, and that will route
legitimate traffic from its destination to get
blackholed by accident. We've been working with a bunch of different companies to
focus on improving BGP, and this problem isn't
just theoretical, it's actually a real problem. For example, a couple of years
ago you might have heard of a Cryptocurrency Heist
where myetherwallet.com's traffic was redirected
because somebody deliberately misadvertised routes
associated with DNS Route 53, addresses related to
myetherwallet.com, causing traffic that's going to myetherwallet.com to end
up in servers in Russia. Where those Russian attackers, if the user clicked
through the warnings, they got in their browser
and authenticated, the attacker got their login
credentials to the EtherWallets, able to steal their ethereum, and then basically abscond with it. Estimates are that they
made anywhere between 50,000 and a few $100,000
off of this heist. As part of our working with the
industry consortium back starting in 2019 when we joined
the Mom's project, which is focused on
reliably broadcasting and advertising these BGP routes
using PKI or signature signing, where Microsoft for its
routes would sign them with Microsoft signatures and
other BGP routers would look for Microsoft signatures on
any Microsoft to advertise routes has actually gone into effect. We started signing all over routes
earlier in January of 2020, and you can see that going
back to January 2020, that the hijacks that we saw in
our network went to zero where we've got over a 149,000 routes
now signed for Azure services, well all of Microsoft Services making us have the most signed routes of
any organization in the world. But Azure Datacenter infrastructure, I've been talking a lot about the WAN and physical infrastructure
on the ground, but Azure has actually gone
to space with Azure Orbital. Azure Orbital is really
effectively ground station as a service where those ground stations are
connected to satellites. Azure Orbital's all
about how do you connect satellites into Azure Datacenters. There's two kinds of
capabilities that Azure Ground Station as a service
or Azure Orbital provides. One of them is Earth observation. With Earth observation,
the idea is that you've got satellites that are
taking images of the Earth, of the atmosphere, of the ground, of water, of pollution, and you want to perform analytics and Machine
Learning on that type of data so that we can
understand how climate is changing or how the
environment's changing. What we can do to improve and prevent catastrophic
problems from happening. The best place therefore is to
get that data directly into Azure Data Lake Storage Gen2, and then get that process with Azure Synapse Analytics that I'll
talk about a little bit later. Azure Orbital is
focused on that aspect, but it's also focused
on communications, specifically IoT
communications, where you've got devices that
are in remote locations, maybe even the modular datacenter
I talked about earlier, where you don't have a
ground-based connection into Azure Datacenters and you want to leverage
satellite connectivity. This one is where we
work with a host of satellite partners that then
register their satellites with Azure and allow customers then to leverage those satellites
for communication from an edge device up through
satellites and down into Azure Datacenters to talk
to Azure services remotely. Let's go take a look
at that in action. I've got a really cool
demo here to show you. Where I'm going to actually show
you a network setup here that's connected to Azure Orbital that
starts with a Virtual Network. Pulling up the Azure Portal
here you can see I've got a Virtual Network with a bunch
of network interfaces on it. If I go, I can see that one of those network interfaces
has a public IP address, 52.150.50, you can see we're
going to come back to that later. You can also see that I've
got a storage account as part of that resource
group that's connected to that Virtual Network that blocks access from anything
except for that subnet. That means that when I go try to click on the
container because I'm not accessing that
subnet from the portal, I get access denied. If I go try to access that container in Azure
Storage from a browser, I get access denied because the
IP address the browser is coming from isn't in that subnet,
in that Virtual Network. However, I've got right here, a satellite up-link here that I'm
going to connect my phone to, and that's going to let me join that Virtual Network
here, from my phone. When I take my phone out of "Airplane Mod" you can see that it's connecting now to the Orbital Network
that I've got configured here. When I go to "Settings"
you can see that, sure enough, my network selection
is the Orbital Network. That Orbital Network is
connected through that device, and because it's configured
to be part of that subnet, you can see that IP address that
I show up as publicly is in fact that same IP address we saw
attached to that network interface, that 52 dot IP address. That means that because it's
part of that Virtual Network coming from that IP address,
that network interface, I can, sure enough, access that
same Azure Storage account from space through a
satellite here from my phone. Speaking of virtual
networks and subnets. One of the big challenges that we've heard from our
customers is they get bigger and bigger deployments in Azure is a sprawl of
virtual networks. Where those virtual networks aren't designed to
operate in isolation. They have applications in
one virtual network that need to talk to services
in another one or virtual machines and a third
one and up to this point, they've had to set up complex peering relationships between all of those virtual networks. You can imagine if you've got
a dozen virtual networks, how many peering
relationships you have to create to allow them all to connect together and you've got to create those connections every time a new virtual network joins
and you want to add it. Now I have to setup that many number
of new peering relationships. To make management at large scale
of virtual networks much easier, we're creating centralized
network management. In last Ignite, I showed you
centralize network management, showing you how to set up connectivity relationships
between networks very easily where you can tag
your virtual networks with a tag and with that tag, automatically they get joined
in hub and spoke or peer to peer relationships through Azure,
centralized network management. So you don't have to go manually set up in a peering relationships. It's simply tagging something provides you the connectivity of that virtual network
that you're looking for. Now I'm going to talk about a new capability that we're introducing, which is related to
security management. Where you want to provide
security policies that apply to a collection of virtual networks and centralized network management
let's you do that too. Let's go take a look at that. So here I'm going to pull up the
Azure portal again and I've got two virtual machines here in remote desktop connection manager
and users internals tool. You can see that one of those virtual machine is able to
access bing.com just fine and that's because it's got no
network security group rules attached to the virtual
network it's part of. When I go into the
Central Network Manager, you can see that I've got a
network group here called Spoke, and that's the one that I created last Ignite that allows my virtual networks to connect
together in hub and spoke. If I go to the connectivity
configuration, that's where I've got
that policy set up. But when I go to the security, that's where you can see I've got
a security configuration that I can go and create a rule for that blocks outbound
traffic to the web. So I'm going to call this block web. Say give it a priority, say deny, say outbound, any protocol. Destination port 80 and 443 which
would be web addresses and I'm going to select my
Spoke network group and save that configuration. Now that I've set up
that configuration with that rule in it, I
need to go deploy it, so I go to deployment and you can see that I'm going to deploy
that configuration here. Security, I'm going to deploy
that particular configuration. I target all of the regions
that I've got virtual networks in and then apply and within
a few seconds that's applied. Now if I go back to that
same virtual machine in one of those regions and
try to access Microsoft.com, you can see that it's
unable to access it. If I go to another virtual
machine in a different region that it's part of that same
network security group, you can see that I'm also
not able to access bing or Microsoft.com because neither of those were cached on
the local machine. So sure enough, I'm able to now
perform at scale management of network connectivity as well as network security across dozens
or hundreds of virtual network. Something that's been very
onerous up into this point. Now let's go take a look at inside of our servers and our
server infrastructure. One of the ways that
I find really fun to look at the evolution of
Cloud servers over time, is to look at our high
memory skew evolution. Back in 2014, we introduced something that we internally
called the Godzilla skew. We called it the Godzilla
skew because it was the largest Virtual Machine in
the public Cloud at the time. It had 512 gigabytes of RAM in it. It had 32 cores on it. It had a, you can see 9, 800 gigabyte SSDs on it. So just a monster machine for
back in it's day in 2014. But we've evolved so quickly in the evolution of
hardware in the Cloud, driven higher and higher by
in-memory databases like SAP 4 HANA, by our customers migrating
their SAP workloads, asking for larger workloads. As far as the general
evolution of hardware, you can see that our general
purpose servers now are rivaling or beating what
Godzilla was just six years ago. So these are the servers that
we're deploying at scale, our DS series virtual machines on top of it and our F
series virtual machines. You can see we've got Intel
and AMD lines of servers. They've got more RAM
now than Godzilla had six years ago and they've
got more cores on them, the same or more cores
on them as well. But when it comes to those
SAP 4 HANA workloads, this isn't enough these days. We've been pushed ever
higher and you can see our Beast server that we introduced in 2017 that
had four terabytes of RAM. Might think four terabytes is
enough RAM for SAP 4 HANA, well, we had customers bringing even larger SAP
deployments into Azure, wanting larger servers and
so we introduced BeastV2, and you can see BeastV2
we introduced in 2019. This one has 12 terabytes of ram. You might think that's enough
for SAP 4 HANA workloads, but no, we've been asked to
get even larger server sizes. Last night I talked about
mega Godzilla beast. Mega Godzilla beast here has 24 terabytes of RAM and
it has 448 cores on it. I expect in six years we'll be thinking this thing
is relatively small, but right now it seems really big and here's a
picture inside of it. This has a 192, 128 gigabyte dims in it, so it's basically packed with dims. You can see they're dim slots. But I thought it'd be fun to show you a little demo of what
this thing can do. Of course, it can run
Notepad really, really fast. But I decided to have some fun with it and I think you
might appreciate this. Here I'm logged in to
omega Godzilla B server. You can see looking at
Task Manager that it's got 420 cores of those 448 available to the
virtual machine and it's got 22 terabytes of that
24 available to it. If we go back to the
test manager, CPV, you can see there's enough
pixels there that I thought I could have
some fun with it. There were some videos
going around last summer of somebody doing things
with task manager, animations and games and I
thought I'd do it for real. So I wrote a little program here that takes in a bitmap and then by pinning CPU activity to particular cores
in response to the bitmap, I can actually show bitmaps right on Task Manager and here you can
see a scrolling Azure logo. This is actually available in my GitHub if you'd like
to go take a look at this program and that's
the first thing I did. But I got a little bit obsessed
as I was playing with this over the holidays and decided to
get some games working on it. One of my favorite games
in college was Tetris. Here you can see I've taken a console Tetris game and
I'm here playing it right on mega Godzilla Beast in Task
Manager by manipulating the CPU. Land another block here and then I've posted
videos of this on Twitter, but I thought I'd do something
new for this Ignite presentation. So I took another one of my
favorite games, Breakout, a console version of that and also integrated it and here I'm playing Breakout now on Task Manager
on the mega Godzilla Beast. Say hit the ball and
hopefully this takes out a few bricks up there
and sure enough, dead. So some amazing things you can do with a machine that costs
millions of dollars. Some of the other ways that we're pushing hardware in our
datacenters is with AI HPC infrastructure and one of
our great partners is NVIDIA. NVIDIA has got a new GPU that they're coming out
with called the A100 ampere GPU and this is the most powerful machine
learning GPU ever. You can see that we've
got a variety of different ways you
can leverage the A100 in Azure's datacenter on top of our new NDv4 type virtual machines. You can either use them as
a single GPU or you can use eight GPUs at a
time connected with NVIDIAs NV link protocol
on a single-server. But something very unique to Azure is that we have a
version of the NDv4, the full server version
that leverages 8, 200 gigabit HDRInfiniBand
connections for a total of 1.6 terabits of
back-end network connectivity between our A100 servers to allow you to connect many hundreds
or thousands of them together and run
large-scale distributed machine learning training
or HPC jobs on top of it. Let's go take a look at what that InfiniBand
back-end network gives you, that the front end network, if you're just
leveraging that doesn't. You can see here I'm connected
to two Azure virtual machines. The left one is running an A100
GPU and so is the right one, and they're both connected to four server clusters
with A100 GPU is. They're both running the same CPUs, AMD Rome systems with 96 cores, with AMD EPYC processors. If I run the NVIDIA
SMI utility on them, you'll see that they've got the eight A100 GPUs attached to
them with 40 gigabytes of HDR2, or HBM2 memory connected to them. The difference though is that on
the right one, like I mentioned, this is an InfiniBand
back-end network that we've connected to it or
that we're leveraging. It's got the eight 200 gigabit
InfiniBand HDR2 adapters on top of it that connect those
four virtual machines together. Now, if I look at the machine
learning training run, I'm going to execute
on these clusters. You can see that the only difference between the two is that I've disabled InfiniBand on the one on
the left because we're not leveraging the InfiniBand
network connections, we're just leveraging
the front-end network to connect these jobs together. We can see that I've got
a batch size of eight, I've got to warm up of eight, and I've got 128 iterations or samples that I'm going to be
running through this training job. This is actually the GPT2 model from open AI that I'm actually
going to be training here. Let's go time that
on the two systems, the key difference
again is InfiniBand connecting for distributed
training efficiency. We can see already as this thing gets underway here and iteration start completing on the right
that it's taken me a little over 300
milliseconds per batch. On the right and on the left, here you can see 2,000 milliseconds, about two seconds on the left. That's the right is about seven times faster because
it's leveraging that 1.6 terabits of bidirectional
connectivity between those servers that the
left doesn't have access to. As it's exchanging
weights across them, it runs into that inefficiency. The A100 really represent trend
that had been seeing and so does the high memory skews and the number of cores that we see on the
general-purpose skews, trends in the datacenter of more and more power
consumption per server. How do we cool the increasing demand for concentrated compute power? How do we pack and leveraged the floor space on our
datacenters more efficiently. Because if we're air cooling them, which is the way that we're
cooling these datacenters today, we've got to leave hot air aisles
and cool air aisles and have huge H facts systems
that are pumping air into and out of the datacenter if
it's not adiabatically cooled. Large overheads and
lots of floor space that is just wasted for moving
air in and out of those servers. We've been investigating
now how can we achieve better cooling efficiency and liquid cooling is what
we've been focusing on. There's a type of
liquid cooling called cold plate cooling that
you might be using. I'm using it in my home system
when I play my video games. I'm cold plate cooling the GPUs and CPUs on my system so I can run
them at very high clock rates. That is a great way to cool, but it's got the downside
that every single server has to be custom fitted for the pipes in the cold plates on top of that internal
server infrastructure. Meaning that it's not a
one size fits all model. It comes with all of
that overhead of getting all those cables into
and out of the servers. We've been exploring other ways to cool the servers
beyond cold plate. One of the ways is with
single-phase immersion where we take a liquid and we just set the server main
board right inside the liquid. Liquid is extremely efficient, especially newer types of liquids at cooling those
high-performance servers. But what we've locked on
as likely where we're heading in our
datacenters is the most promising and that is
two-phase immersion. In fact, we've made a ton
of progress down this path. I want to give you an insight
of where we're heading by showing you some of the prototype
work that we've got underway. [MUSIC] Start test, wrap up
the CPU utilization 100 percent on all the course and
we get this nice bubbly effect. As the fluid boils, it evaporates. As it evaporates, when the
vapors hit the condensers, fluid's cooled down and
it becomes fluid again. Welcome to the liquid cooling lab. Liquid cooling is bringing the
liquid closer to the chip, either by circulating water through
a cold plate to the chip or either by dunking all of the IT into an immersion
dielectric medium. The reason liquid cooling
is important right now is because the demand on
higher performance chips, higher speed or core count,
continues to increase. This has been resulting in higher power chips and
higher fluxes on the chip, which could be sometimes
challenging to air cool and we require liquid cooling
to do that job for us. Liquid cooling affects
the whole ecosystem. When you take a look at the
datacenter and the server and the sustainability promise
that Microsoft is making, liquid cooling can help
us get there faster. With liquid cooling we can
have higher density racks in IT or tanks that could lead to
smaller datacenter footprints, lower their center energy consumption from the mechanical
cooling perspective, and also from the server perspective. Because we could reduce or remove
the fans from the servers. We can reduce the leakage current
from the chips and because liquid utilizes closed
loops of warm water, maybe counter-intuitive, but we actually eliminate the use of water. We don't need to use
evaporative cooling anymore because we're always going to operate in loops that are hotter than the ambient temperature.
It is pretty awesome. It is amazing to work with state of the art hardware and
state of the art cooling. We are seeing that the
trend of chip power, whether it's compute chip or
a AI chip is only going up. With time, liquid cooling is going to become more and more important. Liquid cooling probably
seems pretty exotic, but it's nothing
compared to how exotic the next frontier of computing
is quantum computing, something that we've been working
on for a couple decades now. Microsoft's approach
to quantum computing saw a key milestone this year with the general availability
of Azure Quantum. Azure Quantum really represents the full top to bottom
approach that we're taking. At the very top, we're creating
programming languages Q#, that allows developers to write programs to take advantage
of quantum computers. We're creating integrations with
Visual Studio Code to make it easy to develop and
run those programs. We're creating Katas that allow you to learn about quantum
computing programming. We're creating simulators that allow you to simulate your algorithms, both on your local machine
as willing as in Azure. We're also working
with partners across the quantum computing
industry, including Toshiba, for example, that has their own quantum
optimization program that you can sign up for it
using Azure Quantum. We also have been working on our own quantum-inspired
optimization. Trying to bring the innovations
of quantum computing and the unique computational capabilities of quantum computing into
classical computing today. Then finally, at the quantum hardware level or hardware partners like quantum IONQ and QCI now have their hardware available
through Azure Quantum, where you can deploy
quantum computing programs directly onto that quantum hardware. But we're also working on
our own quantum computer. One of the just amazing challenges of quantum computation is the
fact that for those qubits, those bits that store that information that you
do that quantum processing, to be stable, they've got to be
at extremely low temperatures. Now how low? Well,
colder than space low. You can see on this
chart right here that a quantum computer is running
just at a few millikelvin. Because the closer you can get quantum control to the quantum plane, that quantum domain of
just a few millikelvin. The more efficient you can be because every degree of
temperature means that your dissipating heat and
potentially perturbing the quantum computation
as you get data into and out of the quantum computer. We've been focused on
material science engineering to get quantum computation that is close to the quantum
plane as possible. We've had some amazing breakthroughs on solving one of the
problems of quantum control, which is when you're not close
to the quantum complain. Like you can see in this diagram, this is a picture of the cryogenic refrigerator
where at the very bottom of it, at those few millikelvin is where the 54 cubic quantum
computer is located, you can see all those
wires are coming from room temperature computers that
are controlling those qubits. Hundreds of them to
support just 54 qubits, a ton of complexity, a ton
of heat, a ton of power. By focusing on how we can create computation down at
that quantum plane, we can eliminate all of
that complexity by putting the computation right inside the fridge next to
the quantum computer. This project, cryo-CMOS, is a
project that we call gooseberry. You can see here how it fits into the overall architecture
on this diagram, where you can see the
quantum plane at the bottom, you can see the cryo-CMOS
control computer, which will operate as
close to that as possible. Controlling those qubits and
reading information out of them. Then providing that information, data in and readings out to computers running it at
classical temperatures. Here's a picture of that
gooseberry processor. You can see here, it's right
next to the quantum computer, which is using qubits based off of our topological
cubit technology, which we think is the
most promising technology to allow scalable
quantum computation. In fact, we think it is
the only viable path to large-scale quantum computation
where you're talking qubits in the millions of qubits. Where on other types
of qubit technology, you'd need to be able to
store and manage qubits across a quantum computer that's
the size of a conference room. Here, we can run millions of qubits on a small wafer at the very bottom
of that quantum refrigerator. You can see gooseberry sitting right there with wires
connecting directly into that quantum computer to control those qubits using
quantum dot technology. What's the difference between
running outside the fridge at room temperature and running inside the fridge next
to the quantum computer? It's this, you can see just
three wires coming out of the gooseberry processor into the real-world to get data and
readings into and out of it. That is like a major breakthrough. In fact, we're also working on
creating CMOS technologies where you can take existing
CMOS technologies and run them at a few degrees kelvin. Gooseberry reuses a special type of CMOS technology to be able to
operate just at 100 millikelvin. But we're working on other
technologies as well. Huge advances there in support of
our top-to-bottom quantum stack. Now let's talk about
Azure Resource Manager, which is the universal
control plane for Azure. It's not ARM as in the processor, it's ARM as an Azure
Resource Manager. Azure Resource Manager as
the universal control plane, provides a bunch of
uniform capabilities across all Azure services. It's accessible
through the Azure CLI, the Azure portal PowerShell, the SDKs, and a REST API. An ARM provides consistent RBAC,
role-based access control. It provides consistent monitoring, it provides consistent policy, and it provides
consistent gestures and representations of objects
across all of Azure resources. The way that it does that
is through something called the Resource
Provider Contract. RPC allows any service to plug
in to Azure's control plane and provide capabilities that
are accessible through all these different means
and do it in a uniform way. Every single Azure service, all 200 plus plug-in to
Azure Resource Manager, which allows you to get
policy uniformly and monitor the control plane
access is uniformly and perform security insights
and monitoring on top of it. If we take a look at
ARM's architecture, one of the benefits of
ARM is that it provides a global view of Azure regardless
of where your resources are. Whether they're in one region, in North Central US or in
Europe or Asia Pacific, you can go to the Azure portal or the Azure CLI and see
all of those resources. That's because Azure
Resource Manager has an active topology that spans
all of Azure's public resources. When you create a
resource in one region, it becomes visible from
all other regions. It uses an architecture
that is the same as what we advertise to customers to
build on top of Azure, to create these kinds of
globally scaled services. It uses Azure Traffic Manager on top of load balancers
inside of datacenters, leverages Cosmos DB
for replication of state across all of those regions. Now, the way that
you've interacted with ARM has been through
the ARM JSON schema, either through Azure
Resource Manager templates, which is one of the very
powerful capabilities of Azure Resource Manager or
directly through the REST API. What I'm about to tell you is
about a project called Bicep. Project Bicep is the result of us
talking to customers that have found it challenging to leverage the ARM JSON for a variety
of different reasons. One of them is that
it's extremely verbose. Another, that it's very difficult
to represent what you want and understand what the JSON
is doing because it's obtuse and indirect about what
it's trying to accomplish. We've done things like introduce IntelliSense and provided
Quickstart galleries, but we felt that that didn't
go far enough to making it easy to author declarative templates
to deploy Azure resources. We took a step back completely and talked to a bunch
of customers and asked, "Would you like to us to make it
possible for you to implement JSON ARM templates off of an existing language type
like Python or PowerShell? " What we found is that no, really the right answer was
a domain-specific language focused on configuration as code, declarative declaration
of resources like ARM, but much more succinct
and much more programming like to make it easier for people
to develop these templates. That is Project Bicep and
working in conjunction with programming language expert
Anders Hejlsberg who created C#, TypeScript, Turbo Pascal, we've got this language now up and
running and ready for you to use. The way that it works is that
you author now templates or APIs to ARM in the Bicep language, and you can then transpile the Bicep language
into an ARM template. So you can get your
ARM JSON if you want. If you've already got tooling
built around ARM JSON templates, you can leverage that. Of course, then turn around and
give the ARM template through the Azure command line to ARM
to deploy your resources. You can also, now, we're excited to announce
that with version 0.3, you can give the Bicep manifest directly to ARM and
it understands them natively. No reason to transpile
in the middle if you just want to focus
completely on Bicep. We've also had the ability to take your ARM JSON templates and
transpile them back into Bicep and that lets you take your existing investments and now get started on
Bicep really quickly, still transpiling back
to ARM JSON to update those existing templates and
workflows and then deploy to ARM. Let's go take a look
at Bicep in action. Here what I'm going to do
is open Visual Studio Code. I've got two Bicep files. One is here, website.bicep, which you can see parameters
described at the top. You can already see it's
much more succinct than the ARM JSON for doing things
like describing parameters, which include the name, the location where we want to
deploy this resource. This is an App Server farm. You can see it's parameterized with the name that we can
give it as a parameter. It's running Linux, and you can see that we're creating a
website onto that server farm. Here's an example of a reference
from one resource to another. Now, when I go back and
create the main deployment, you can see that I've got
IntelliSense nicely setup here in Visual Studio Code for
the types of target scopes. Here I'm going to target the whole subscription
because what I'm actually doing is deploying a resource group. Resource RG is the name. RG isn't what I'm going to
give this resource group, and here in the template, and you can see IntelliSense on the APIs. I'm going to pick the latest
version of the resource group API, paste in the name of the resource group and location
where I want it to deploy to, and then a very cool
new feature of Bicep is modules where I can reference
other pieces of Bicep. Here I'm going to IntelliSense
complete with a Bicep file, that website.bicep that was
right in that directory. I'm going to paste in here populated
parameters with site deploy, the scope, the Azure Container
Registry name that I want to use, and then I'm transpiling. You can see I've got about 150 or so lines of ARM JSON that came out of
about 80 lines of Biceps. I've already cut this
by close to 50 percent just in terms of verbosity, but it's also got so
many other conveniences like modules built into it. Here I'm going to deploy
right to Azure using the Azure command line, Azure CLI, and going right into
the Azure portal, you can see the site
deployment underway. If I go click on
"Overview" and "Refresh", now you'll see that sure enough, I've got my Container
Registry created, or App Service farm created, and I've got my website deployed
onto it all using Bicep. The convenience is a Bicep natively understood by
Azure Resource Manager. You've developed your application's, deployed them using Bicep, now you want to make sure
that they work resiliently. Chaos Engineering is something
that Netflix popularized, so they had something called
Chaos Monkey that would go and continuously mess with
their deployment of Netflix to make sure that
it stayed operable in the face of the everyday failures that you see whenever you
are operating at scale. We've been using Chaos
Engineering inside of Azure, and now I want to bring
Chaos Engineering and make it available to you to
use on the Azure platform. With Azure Chaos Studio, you take your existing
Azure application, which includes your service code,
your service infrastructure, which includes the websites and
VMs that your code deploys onto, and now you can deploy
what are called experiments against it
from Azure Chaos Studio. Azure Chaos Studio supports
ARM JSON and Bicep, therefore, for you to define your
experiments and then run them against your existing resource
groups or subscriptions. It has multiple ways to inject
faults into your application. One of the ways is using an agent. If you've got the Azure agent running inside of your website
or your virtual machines, can automatically connect to the Chaos resource provider and inject faults right
into your virtual machine, like inducing high CPU or consuming memory or consuming
disk or killing processes. It also supports those
service-based Chaos. Service-based Chaos is
using the ARM APIs and data plane APIs of various
resource providers to go induce chaos on
those specific services. For example, you can go terminate virtual machines or shut them down
and see how your site responds. If it's supposed to
tolerate, for example, virtual machines going down and stay resilient, does it really do that? Let's go take a look at Azure
Chaos Studio in action. Here what I'm going to do is
gotten a sample app here, it's a drone delivery
service application. If I make a drone
delivery request here, and enter some bogus information,
I get a tracking ID. If I go back to the website
and Enter that tracking ID, I see here the drone moving there from Redmond to Bellevue
where I'm going to pick it up. The back-end for this
is Cosmos Database, and it's got geo-replication
across two regions, the right regions East US, the read region East US2. What I want to do is
because a failover of Cosmos DB's read region from one region to another and
see if the app still supports it. If I go into look at my
Case Studio experiment, you can see the location where
it's going to run is EastUS2. You can see I've got steps in the
experiment and I've got actions, and one of the actions is a
continuous action where I'm going to continuously trigger a
failover from the read region. If I go "Start" that experiment now, where I'm triggering that
read region failover. Go back to "Now." Look at Cosmos DB. You can see when I
do a "Refresh" here, that the failover is happening
because it says updating. If I do a "Refresh" again, you can see now that sure enough, my read region has failed over
from East US2 to now to East US. As the application can
continue to work, well, if I go back to the tracker, let's see if the drone moves
and sure enough it's moving. The application as expected, was able to tolerate as it was designed for that Cosmos DB failover. Now, let's go inject a fault that maybe it's not
designed to tolerate. If I go back to the
Chaos studio experiments here and go to the kill Node Process that is going to
use the Agent-based installer, which I've got installed into
the virtual machines here, the running Linux to inject a fault. Here in the steps, you
can see the action is to leverage the Linux Agent
to kill the dotnet process. Because this is DotNet Core app. If I start that experiment, go to Application Insights
to look at my containers, I can see that the fabric
cam delivery container, which is doing the tracking, is been in faulted state because the dotnet core
process was terminated. You can see now I've lost
the ability to track that. The application isn't resilient
to having that container fault, and that's where I might want
to go and add extra resiliency. Great way leveraging Chaos Studio to go make sure your application can tolerate existing types of known faults that should be
designed for as well as stress it. Kind of Game Day events where
you want to make sure it has maximum reliability
because your businesses is dependent on that
application working. Now, let's take a look inside
of our compute infrastructure. One of the exciting innovations that we've had in
Azure is the ability for you to deploy custom extensions into virtual machines and
virtual machine scale sets. Our extension infrastructure
allows you to do that in a safe, reliable way and to
automatically update extensions. For example, a Security extension, if you deploy it
through extensions into your virtual machines scale sets, it can do it in a
graceful roll-out way across regions and
within a scale set, do it in a rolling update way, checking for health signals
as the update is progressing, so you can get a safe reliable
update of your Security extensions, your Security agents across
your infrastructure worldwide. We're taking that a
step further now with something new called VM applications. With VM applications, you can
deploy the main payload in that safe way using that same infrastructure that's powering Virtual Machine Extensions. Using the same way that you can store disk images in our
shared image gallery. Now, you can store applications
in that shared gallery. That shared gallery can
be a private gallery for just your enterprise or
just a particular workload, or you can share it
with the community across your enterprise
or even publicly. That allows others to take
your applications and deploy them into their virtual machines
or virtual machine scale sets. Not only that, but the
deployment can be based off of health signals and do the rolling updates just the same
way extensions can. One of the very cool features of it is while you can leverage
the guest agent inside your virtual machine to pull directly from the
shared image gallery into the virtual machine
over the network, you can also leverage the
end host update mechanism, which comes through an endpoint
locally on the server, to get that code into
that virtual machine, meaning that you don't
need an agent at all. Further, you don't need
any network connectivity. If you've got a situation
where you want to completely isolate those virtual
machines off the network, you can still leverage VM
Apps to get code into them. We're also innovating in the
space of programming run-time. How do you write your code that you want to deploy into
[inaudible] virtual machines? Today, enterprise
developers are being asked to take on more
complexity than ever before. Many enterprise developers have been largely focused on
business problems and gotten used to framing
that in the context of website plus data-paid architectures. Now, they're being asked to create microservice-based architectures. They're needing to do that because they're needing to break
up those monoliths, and to containerize them, scale them independently, update them independently so that
they can be more agile. They're also being asked now in many cases to make sure
that those applications are portable between their
on-premises environment and the public Cloud like Azure, even across Clouds.
That means Learning. SDK is for lots of
different services. If they want to store some
state in their application, now they've got to learn Cosmos DB for public Cloud maybe and Redis cache or MongoDB or Cassandra for an on-premises deployment
of that application. Meaning lots of code just
boilerplate SDKs and lots of learning of things that the developers don't
really want to focus on. They want to focus on
their business problem. We've taken learnings
from Azure Functions, where Azure Functions
is really server-less, write your piece of business logic. The infrastructure takes
care of everything including with Azure
Functions bindings, connecting your code to external services in a
very convenient way. Plumbing inputs and
outputs from your code to those other services where you
don't have to learn those SDKs, incorporate them, update them, or even authenticate to their
services because that's all done in the bindings themselves. Dapr builds on top of that by creating what
are called building blocks. Building blocks that take care of those mundane tasks that
developers are being asked to use. Providing those capabilities through a local HTTP or gRPC endpoint, delivered as Sidecar
functionality so that you as a developer don't have to
leverage any SDKs at all, including the Dapr SDK, you simply leverage
HTTP calls to talk to Dapr's Sidecars, Dapr's
building blocks. Building blocks like PubSub
and state management have the concept of components that
can plug into those abstractions, so that [inaudible] developer, I mentioned if they wanted
rather Cosmos DB in the public Cloud and
MongoDB on premises, they leverage the
state store component from the state store building block for that particular environment, Cosmos DB component in public Cloud, Cassandra MongoDB On-premises; don't have to change a line of code. Don't have to learn those SDKs and Dapr takes care of
everything for them. In fact, it also handles
retries for them. Service to service invocation,
secret management, pub-sub state management are just some of the key building block
capabilities that Dapr has. Got to really see the power of Dapr and to really
take advantage of it, you might have been waiting
for; is this thing real? Is it production ready? Well, I'm excited to announce
that just a while ago, Dapr reached a V1 production. In fact, Dapr is also a
completely open-source project. We've got open source governance. We're going to contribute
to a foundation. You as an enterprise that wants to take advantage
of open source ecosystem, no lock-in, flexibility
that get all the benefits. Now's the time where you can
start taking advantage of Dapr. One of our close partners on our
Dapr journey is the Zeiss group. The Zeiss group here is going to
talk a little bit about how they are leveraging Dapr to make their Cloud Platform
more Cloud-native. ZEISS an international technology leader in optics and
optoelectronic is replacing an existing monolithic
application architecture with a Cloud-native approach
to order processing. An early adopter of Dapr, ZEISS uses the global
reach of Azure and the integration of Dapr with
Azure Kubernetes Service. To fulfill orders faster
for ZEISS customers, we had a chance to sit
down with Kai Walter, Distinguished Technology
Advisor for ZEISS Group, and hear how Dapr is
helping them with their new Cloud-native
application development. >> Dapr makes developing a
distributed application a commodity for us and helps us
focusing on the business value. We're building a highly-available
globally distributed and easy, adaptable order processing
platform for our iGlass business. Dapr basically handles the
service to service calls for us. It's the basis for all virtual
actor model where we keep business objects globally
and in the regions. We also use it to abstract top platform services
for all applications. Especially, actors on top of a globally distributed
multi-master Cosmos DB, helped us implement
a scenario which is otherwise complex or
challenging to implement. >> Using Azure Kubernetes Service
in combination with Dapr, the launch of the new
order processing platform has given ZEISS the scalable and resilient architecture it needed to develop services and get
them to market faster. ZEISS customers benefit from faster order fulfillment and
timely notifications of progress, something the existing
system couldn't do. >> To prove to you just how easy it is to take advantage
of Dapr capabilities. Let's go take a look
at a couple of demos. First, I'm going to Dapr our eyes an ASP.NET Web App where I want to save some state IN a state
store without having to pull in an SDK or learn
the complexities. The first thing I'm going to do is to install the Dapr runtime to show you that we have no dependencies
on Kubernetes or Containers. We can just install
the Dapr runtime and CLI locally here from my app. First thing I'm going to do is go create a new routing endpoint for depositing and withdrawing amounts from an account database that
this website is going to manage. You can see HTTP post withdraw, I added the HTTP route so that
I can test this using curl. But Dapr is going to accept this. If I have another
Microservice call withdrawal on this Dapr app I would
get invoked as well. But the core logic of Dapr here is saving the updated state
for that withdrawal here, updating the account balance, and then saving it using Dapr
and that's all we had to do, is reference the Dapr state
store using the Dapr client. No messing with SDKs, no messing with credentials, no messing with retries. All we had to do and
the same thing happens with depositing amounts
into that account. What we have to do though
is configure a state store. What I've done here is
paste in a component for my state store where
I'm leveraging Cosmos DB. You can see I've got
a Cosmos DB key here, reference from Key Vault
using Dapr's key management. I don't have to worry about storing
secrets in my Code or Config. Now, I can do a Dapr run by pasting that command line in and setting up the Dapr port
that it's listening at, and giving it a Dapr application ID. You can see it launching here. Now that the site is up and running, I'm going to switch terminals and I'm going to do some
curls to show you the state store being
exercise here and look at the local URL with the 5,000. The deposit route, passing
in the JSON payload, which is the amount, and
withdrawing some money. If I do it a couple of times, you can see the balance
or deposit some money, you can see I've increased
the balance to 24. If I do the same thing
with withdrawal, you can see I'm decreasing
the account balance. But all I had to do in that ASP.NET App was just add the state store, and I immediately vaporized that application by taking advantage of one of
its building blocks. Let me show you a more
complicated example now. This is a Kubernetes-based app. This is one that you
might be familiar with if you're a.NET programmer
because this is the eShop on containers
reference application that shows you how to take.NET Apps and
run them on top of containers, on top of Kubernetes, AKS. You can see here's my shop front, so I've got the app running. But one of the things that
this app has in it is storing, publishing the borders
in a pub/sub service. It's configurable to support
two pub/sub services; Service Bus and RabbitMQ. But what I'm going to do is create a Dapr class here that
is going to encapsulate the pub/sub using Dapr's
Pub/sub building block where I don't have
to worry about SDKs, I don't have to worry
about authentication. All those things that
I mentioned before, I just leveraged the
pub/sub abstraction. You can see I've got this code
here at the pub/sub name, pub/sub. I've got the Dapr event bus, the publish method,
the subscribe method. Those are automatically routed with pub/sub topics that
now I'm going to configure. But here what I can do is just add a few lines
to use Cloud events. I'm mapping my endpoint subscriber. This is the final piece of
wiring that I've got to do. But here you can see that the existing code has
this RegisterEventBus, which checks this
AzureServiceBusEnabled config to see if it's used Service Bus. You can see all the Service
Bus SDK Code there. Beneath it is the RabbitMQ code. I'm just going to take all that
existing code and delete it because we don't need it and
just add one call here to Dapr. I'm going to go delete the
RabbitMQ and Service Bus classes because I don't need them anymore. I've actually created
negative lines of code but actually made it easier for
me to maintain this code. I've made it portable, not just to RabbitMQ and Service
Bus to any pub/sub provider. The final thing I need to do is
actually implement the component, and I'm going to paste in that
component here for RabbitMQ and you can see there's a pub/sub
component for RabbitMQ. Once I've got this now if I
deploy the app with this, it's going to use RabbitMQ. But if I deployed it with
this config instead, which is just different
by a few lines, you can see that it's pub/sub
Service Bus instead of RabbitMQ. You can see the Service
Bus connection string instead of the RabbitMQ 1. Then my app is going to be
using Service Bus instead. Now it's not a
development time concern, now it's a deployment operation concerned which pub/sub
system I want to use. The app is portable between Azure and on-prem or even Azure and
other Cloud providers. Finally, to get Dapr
into my Kurbenetes app, I need to add some
metadata to say that I want the Dapr API port to be Port 80. I specify my config. Dapr is enabled and I'm calling
this app the Basket API app. Now lastly, Install Dapr onto my Kubernetes
cluster with Dapr in it. Now I'm ready to deploy that app. In fact, I've got it up
here running and now I've got my Dapr swag available to me. My cool Dapr hat, my Dapr stickers, and hoodies. Once I do a checkout, the really cool thing is I deployed the Zipkin containers part
of my Kubernetes install, which means that if I
go to the Zipkin UI, I get monitoring of
my orders without me. If I go do this Run
Query just to look at everything that's there and
look at the basket API, I can see exactly that
order that I just placed show up in the telemetry. Dapr supports monitoring with open telemetry compatible providers out of the box, roughly
a dozen of them. I automatically get
tracing and insights into my application with the developer having to write no lines
of code to get it. Now, if we take a look at data protection technologies that
have been widely used today, you've got data at
rest protection with, bring your own keys or
system managed keys. You've got protection of
data on the wire with TLS, SSL, other encryption technologies. What's been missing is
while that data is being loaded into a CPU and RAM, it's protecting it from
the outside world, and that's what we call a
confidential computing, protecting that data
while it's in use. We believe that protecting data
while it's in use, right now, might be the domain of customers
with the most sensitive data that they want protected from
malicious admins on the box, from insiders, from compromises
of multi-tenancy infrastructure. Just to want to make sure nobody
but them can touch that data, nobody but the code they
write can touch that data. But with the technologies and our
investments that we're making with hardware providers as well as software we're building an Azure, we want to bring this
to the mainstream for everybody to take advantage of. Including, just as a
defense in depth for your every day low-risk
enterprise applications. At the core of confidential computing is
the concept of an enclave. That enclave there you can
see on the right is where that trusted code is going
to run with that code that you're going to entrust with
the data is going to execute. When you have the
application split into the non-confidential
and confidential parts or untrusted and trusted parts, the flow works by, first, creating the enclave
from the untrusted part, getting an attestation that the enclave is actually
running the code. It should be running
and that the enclave is actually an enclave
provider that you trust. Is it the hardware provider that is creating this box for
you that you trust? Once you've established trust, now, you can share secrets with the
code in enclave that allow it to go and process the data
that you want it to access. Handing it the secrets and saying, go run computation analytics, machine learning on this
data that's stored in Azure storage or some other
storage location is the next step. We've taken this and actually integrated into SQL server
which you could run in an on-premises or a isVM in
Azure that we've now made publicly in a public
preview available Azure SQL Always Encrypted
for Azure Database. We use it in leveraging our DC series of SGX hardware with SGX enclaves. The idea here is now that the SQL query processor runs
inside of a trusted enclave, an SGX enclave, and now, you can establish trust with
that SQL query processor, know that it's running
inside of an enclave, and then release secrets to
it that allow it to decrypt encrypted parts of your database and perform rich computation
inside of it. Something not previously
possible with SQL Always Encrypted where the encryption and decryption
happens on the client side. Now with it on the server side, you can get this rich
query processing happen as part of the
database transactions. Let's go take a look at that. Here, I'm going to pull up
my Contoso HR database, and you can see that the
database server is leveraging the DC series of virtual machine which enables
confidential computing. It's eight core cascade
Lake Intel processors, and when I run SQL Management
Studio and do a query that doesn't have access
to the keys to decrypt, not give the enclave
the encryption keys, you can see that I get
back encrypted versions of the social security number and other sensitive information
rather than the plain text. Now, one thing I've
got to do is set up an attestation policy here that says, only trust this enclave. If it's an SGX enclave, it's not in debug mode. If that's the case, I'm going to release my
customer managed key to that SQL enclave to allow it
to decrypt those columns. When I go and configure
that database, you can see here that I've got
column encryption specified, that I've got an
attestation protocol setup, so it's going to attest
with attestation service. Here's the attestation service API, so this is where it's
going to hand proof to that attestation
service that it is. We're actually running
in an SGX enclave and it is signed by Microsoft. At that point, my client app, it transparently because this
is built into the SQL clients, goes and contacts SQL server, gets the attestation,
checks it, and now, is able to perform those
rich computations. If I go now and take a
look at what I would see on the wire if I was
intercepting traffic, even bypassing the TLS, you're going to see that what's
happening inside is that the client is encrypting the query
parameters like the min salary, max salary I just
had on those ranges, such that only the enclave, and only that particular enclave, because we established
the femoral keys with it, can decrypt those queries and get insights into what
I'm actually looking for. Taking data protection to the next level by protecting it while it's in use from all of
those various threats. Our final section is
looking inside of our storage architecture
and data services. When we start with storage, there's a growing number of storage services focused on
specific types of storage, from disk storage to file storage, backup services, data
transports for taking data, importing and then exporting
it from Azure public Cloud, hybrid storage solutions,
and we're also working on future technologies that I'll get into here in a second. The Azure storage architecture,
for the most part, the file and disks NFS v3 which we're excited
to announce recently, as well as SMB protocol
heads and HDFS, all sit on top of that same
basic stack of components. What that means is when we add features to this stacking components, all of those APIs
take advantage of it, including high throughput
capacity on object sizes. High capacity storage accounts, high-capacity object sizes as well. It will also take advantage of the geo-replication capabilities
across all these different APIs. You can see here that we
create clustered groups. Those clustered groups are storage
stamps in different regions, replicating data between them, and what that allows is for you
to have a DR capability on top of Azure storage that is
given to you without you having to go and manage
that replication state. What I'm going to
talk about is one of the cool innovations we've gotten
in a particular data service, and that is Azure Data Explorer. Azure Data Explorer, you might be using on a daily basis
if you're using Azure, to go and look at insights
into the way that your services are operating. It's a fantastic, custom-designed for unstructured log
storage analytics. Being custom-designed, it
can take unstructured, non-schematized data, index
it at very high-performance, store it very efficiently, and then perform queries on it at scale with very low latency
and high performance. Now, Azure Data Explorer, you can see here the
pipeline, how it fits in. The engine that you've been
using is Azure Data Explorer v2, and v3 which I'm about to show you has major enhancements on
the way that it indexes, as the ways that it does scale out sharding when you've got
a cluster spread out. You can almost linearly scale
out adding servers and getting just basically linear
performance improvements in the throughput and
scale of those queries. Let's go take a look at that
v3 versus v2 comparison. Here I'm going to open to views
into Azure Data Explorer. One the V2 engine on the left, and the V2 engine on the
right and you can see that I'm not going to
play around with it. I'm going to throw some
serious queries at it. The data that I've got is
almost 404 billion records, about 100 terabytes of
data, partially structured. You can see that both databases
for both clusters are the same. If I do a top on them
or pull the 10 records, you can see that they're
the same records. Here you can see how
there's some schema and unlike client idea requests. But then the last column here is a message which is
completely unstructured, free form text and different
for every one of these records, which is extremely challenging for structured the operational databases
because they're not graded indexing unstructured
data like something designed purposefully
for a log data would be. Now if I run this query on it, which is going to be looking at
a particular time window across time stamps where the level
column has warning in it, and the message has the word
enabled anywhere in it. I want to count the number
of records like that, and you can see that out of
that query on the right side, I pulled a few 100 million
records that match that. When just a few seconds here, actually less than a second. On the right side, for the
same 250 million records. If I go take a look at the stats, you can see a total wall clock
time of close to 24 seconds, a CPU time of 18 seconds, 228 megabytes peak per node on the left and on the right with
V3 engine, 144 megabytes. So drastic improvements in the
efficiency of that query processor. Now the last thing I'm going to show you is
something that I think is one of the coolest aspects of this whole talk in terms
of innovation coolness. This chart from IDC talks
about the digital universe, which is all the data generated. If we compare the digital universe, which is the top line, with the amount of data
that we can store. The bottom line, you can see
that gap is continuing to widen. That doesn't mean we
want to store and process all of that
digital universe data. But certainly by not having the
capacity that matches it closely, we're just not able to from
a technical perspective, we're just lots of data that
we might want to analyze, and understand or get insights out
of that we simply can't store. How do we go solve this problem? We've been focused on finding new storage technologies
that can store ever increasing amounts of data
for with more durability, and for longer periods of time
without having to be rewritten. We've had projects with
Microsoft Research that look at storing data in
glass with Project Silica, or storage which serves as
archival storage replacement, storing data in holographics
storage with Project HSD, which is aimed at another
type of hard disk that will overcome the
limitations of hard disk storage, and throughput capacity
as we go forward. But one of the things that
we've got as a project between Microsoft Research in the University of Washington
is Project Palix, and Project Palix links
looks to actually store data inside of DNA molecules. DNA molecules have the benefit
that they're extremely durable. They last thousands
to millions of years. In fact, I was just reading
an article yesterday about woolly mammoth DNA that was discovered that
some million years old. So stored in the right conditions, it can last almost forever. But the other benefits of it
is that it's extremely dense. It's seven orders of magnitude more dense than the next closest
archival storage technology. We could store as zettabyte of data inside of a rack,
inside of a datacenter, something that would take
multiple dozens of data centers today to store a 100s of
data centers today in rack. Just amazingly promising. Now how do we store data in DNA? Well, we leverage
just the natural way that DNA does encoding
through the nucleotides. There's these base pairs here, ACGT. By synthesizing molecules to
encode bit streams into the DNA, we can then store it. Then we can use standard
DNA sequencing technologies to go read those DNA molecules out, and get back the original data. The system we built with the University of Washington
do this is pictured here. Here on the left side you
can see this is where we take the data and synthesize it. Then you can see here is
where we do the storage prep, which is put placing it
in the storage devices that didn't go put in those
racks that I referenced earlier. Then if we want to read it back out, here's the sequencing
part of this bench. Now this isn't exactly, of course, what it would look like
in a real datacenter once we get to production. But this allows us to those end-to-end tests from
synthesis to prepping, and pretending it's
ready for storage. Then sequencing it back out to
make sure that this works fully. Let's go take a look
at this in action. What I'm going to do here
is take a few files. You might recognize these zero-day
Trojan horse rode code is some really great novels
that I coincidently wrote, cybersecurity thrillers,
and we're going to encode those into a zip file, here these ePub files, and synthesized DNA out of that. Actually I already did that
yesterday because it took a while. Let's go take a look
at the results of that run by clicking on that, and you can see here the Blob path because this
went to end Azure Storage. If I take a look at that
strands.txt in Azure Storage, I can see all of the DNA
sequences that were produced, all the strands that does ePub
zip file went into creating. What this actually does is synthesized multiple
thousands of each of these, because DNA is so
compact that why not get that extra robustness reliability
by having multiple versions. But that DNA that we sequenced, I actually have a copy of it. It's right here in the
tip of this vial, it's too small for us to really
see, but it's there. They tell me they, and I trust them. This is, I think one of the really cool things
about working at Microsoft, because working with Microsoft
Research now I've had zero-day encoded in silica glass. I had it encoded in holographic
storage with a Project HST, and now I've got my
books encoded in DNA. I plan on going home tonight
and reading them again. Now if we go back to take a look at what happens on the sequencing part, that was the synthesis part, to make sure that that
really can be read back out. If we go look at the decoding details here that they're coded file, and see if we got back
what we wanted and not some weird monster
created on that DNA, then of course there's
no possibility of that. But here we get and sure enough, the zip file with those
three files in it. So end-to-end synthesis, storage
and sequencing of content, arbitrary content including books. That brings me to the
conclusion of this talk. I hope you found this
useful, interesting, excites you about Azure
and the promise of the Cloud and some of the innovations we've been working on here in Azure. With that, I hope you
have a great Ignite.