[MUSIC PLAYING] DANIEL PETERSSON: So
to get things right, this is not rocket science. It's kind of simple. So it's mainly an
application of KISS. So let's keep all of this
as simple as possible. So before we get started,
we've seen the diagrams. But what is signaling, really? So signaling is the offer
answer exchange process that is needed to set up
a peer-to-peer connection between two participants. I normally think of
this as the handshake. How do we agree on the
connectivity pieces? So another sequence diagram. So we have four pieces. We have Alice and Bob,
their application sites. And Alice's WebRTC
library instance and Bob's WebRTC
library instance. Alice starts all of this
off by calling create offer. That gets her an offer,
which is sent across to Bob by any means. In this case, we just
call it carrier pigeon. Bob uses that offer
to call create answer. Sends that back
to Alice, who can complete the handshake by
calling set remote description. Once that happened, we have
a peer-to-peer connection created between Alice and Bob. Through the magic of [INAUDIBLE]
that has been mentioned before. So the title of the talk is
fast, reliable, and scalable. What does that really mean? So fast means we want
to minimize the call setup latency. Reliable means we want to
have really reliable call setup, so we have as few
failures as possible, both for the users, but
also for the developers. You get that wrong, you're going
to have pissed off developers that tries to use your system. And the final thing
is you want to have a system that actually
survives the stampede of your own success. So what's the impact
of doing this? As I said, I've been
on a bunch of products. And in our
experience, we managed to reduce the call setup time
from about 10 to 2 seconds. We managed to get significantly
more reliable signaling, or call setups. The system is a lot simpler
to build and run than what we used to have before. And it turns out, it's
pretty cheap to run as well. So let's replace the carrier
pigeon with some details. We have some more components
to look at in this system. We have HTTP servers
and a database. And we have the same
use case as before. We want to set up a peer-to-peer
connection between Alice and Bob. And just as before, Alice starts
off by calling create offer. She then sends it to Bob using
a post, HTTP post operation. Bob is, sort of by magic,
already polling the system, trying to receive that offer. So that once it's inserted
into the database, Bob is going to receive it. As soon as Bob receives it,
he can use it, just as before, to call create answer on
WebRTC, send it back to Alice using post. Alice, as soon as
she did the post, she entered her waiting loop. So she keeps polling, waiting
for the answer to come back. As soon as insert
completes, then Alice is going to receive the answer. She can call set
remote description and the peer-to-peer connection
is going to be created. It looks good, right? Not really. So if you build this, and
you try to run it at scale, you're going to
detect that it's slow. It's unreliable. And it doesn't scale at all. So as always, the devil
is in the details. So what are the problems
with the previous solution? So polling is pretty crappy. It adds overhead, both on the
server side and the database side. It also adds a lot of latency. Using polling and HTTP
adds sort of double latency because you need to
pay for the poll cycle. And you need to pay
for the TLS handshake for every single poll,
which is going to be crazy. Used as a single database
also have a set of problems. It's going to add latency
because that database needs to live somewhere globally. It's going to have replication
problems because it's just one small database. And it for sure, won't scale. There are some other
details to this. There isn't any delivery ACKs. If you don't have a
delivery ACKs in the system, you actually don't know if
the message made it across. That's going to make developers
a lot happier if they know. And beyond that, we don't have
any new rendezvous system. So as I said, we
rely on magic for Bob to know when to start polling. So it just doesn't work. So the details that we sort of
need to go through and design. We need to build a
rendezvous system. We need to look at the
semantics of message delivery, so we get those right. We need to figure out how we do
system shorting or petitioning, so we get the properties
out of the system we want. We need to have a
better plan of protocol. HTTP get and post,
they are great. But if you actually
want to have performance and you want to have
performance on mobile devices, you need something else. And finally, but
probably the most important from a
developer perspective, let's make sure the system can
handle retries and idempotency. Otherwise, it's going
to be [INAUDIBLE]. So rendezvous. It's kind of a fancy name
for a really simple question. How do Bob know that
Alice is calling him? So we have two main options. We have GCM and APNs. Or we could build
our own system. GCM, or Firebase Cloud Messaging
or Google Cloud Messaging-- same thing, different names-- is really push
notifications for Android. It's reliable in quotes. And it has a
reasonable low latency. So the normal latency is
about 350 [INAUDIBLE], assuming you can actually
reach the other device. If it's in a drawer
somewhere without battery, it's going to take longer. And I say reliable within
quotes because there are corner cases where they
drop messages silently. Especially if you run out of-- if you run out of
capacity on the server to store more messages
for that endpoint. APNs. So this is Apple Push
Notification service. It's built-in to
all iOS devices. It's best effort
by default. So it means that we're going to
drop messages, both client side and server side. And I'm fine with that, at least
they deliver what they promise and not the GCM approach. They have better latency
most of the time than GCM. So about 250 [INAUDIBLE]. It sort of depends. But you can count on it. We have the third option. You could always go
off and say, hey, I want to build my own thing. You could use a
long-lived TCP connection from the device to the
cloud, handle that. You can make it as
reliable as you want to. You can make it really,
really low latency. Turns out it's scary hard to
implement, especially at scale. It will most likely
drain the battery rather than make
the user happier. And on iOS, it's
almost impossible to actually implement because
of the constraints put on background processes on iOS. So message delivery. Semantics, but really important. What does delivered really mean? Right So a message is delivered. We define it as
it's been delivered as it has been received
by the destination device and handled by the
application successfully. So what would that
mean in our example? That means that the offer from
Alice has been received by Bob. Bob have been able to apply
it using create answer. And that operation succeeded. First, when we
reach that point, we mark that message as delivered. Everything else is
just lost or random. We have some other
details in here. We need to decide how
strict we're going to be with message delivery. There are really
two big options. We can say in
order or any order. In order means that
we promise to deliver all messages in exactly the
same order as they were sent. The other option is to say,
let's be a little more relaxed and say, we deliver them most
of the time in the same order as they were sent. But sometimes, we get it wrong. The reason for the "sometimes"
is when you scale out, you're going to realize
that servers go up and down at the worst possible time. When that happens,
you need to retry. And that tends to rearrange
stuff on the wire. The second thing, which
is also sort of semantical but critical, you
can say that we're going to deliver every message
exactly once or at least once. So exactly once means
one message was sent. We delivered exactly that one. And the other one
is yeah, sometimes we deliver more than once. From an inherent
perspective, it feels kind of nice to say, ah, let's provide
in order and exactly once, right? We only have four options. That's the one that sounds
really simple to use. But when you apply latency
and complexity dimensions, you're going to
realize it's slow and it's scary hard to build. So the Google ways is,
let's make it simple and let's make it fast. We need to handle the
problems client side. But still, it turns out
that works really well. System sharding. We need to partition the
system we're building. We also know by experience
that most calls are made within a single region. You only call your friends. Most of the friends are in
the region that you are-- most of the time. If we wouldn't do this,
then every single request that leaves one region has to
go to another for no good reason at all is going to add
another 250 milliseconds round-trip time to every call. So if we apply this to the
initial sort of polling one, that would mean that the
lowest possible latency would be 250 [INAUDIBLE] because
that's the cost of round trip. Even if you got extremely lucky. If you got unlucky,
it would be worse. So the thing we
realized from this is we need to store
user data in the region where the user lives. And we need to avoid
cross-region calls if at all possible. The last word is
critical, right? If at all possible. So we know we need
to shard the system. We know we need to
shard it by users. So how do we do it? We can be very lucky and lazy as
we were when doing [INAUDIBLE] and say, hey, this is
phone number-based. Phone numbers are, by
default, sort of regionalized. We can use that one. It works. OK. There are other options. A really simple one is
to rely on a registration system and DNS. So you rely on DNS to
find the server that is closest to the
user and that server to actually know which
region it belongs in. So it's the simple example. So we have Alice again. We have a global user database. We have a DNS system. It happens to be the
one Google provides, but you can actually
use any one. Then, you have servers
in each of the regions. So when Alice want
to do a registration, the first thing she
does is she's going to do a DNS query for some DNS. If you configure DNS
correctly, that's going to end up resolving
to the closest use-- the server that is
closest to the user. That server knows
which region it's in. All the cloud
providers, you can just query the metadata
for instance and it's going to return the region. When we update the user
record in the global database, we can say Alice is homed in EU. Pretty simple. The drawback of this is if
Alice is traveling to the US the first time she
uses the application, she ends up in the US
and she's never moved. She is going to have a bad day. So you need to add some kind
of user migration screen if you do it like this. OK. So a few times here,
I've said local storage or global storage. So what does that really mean? So local storage is really
some kind of storage. It could be MySQL database
running within a single region. Low replication factor. But we can use it at scratch-- as scratch. That means we're going
to have super-fast writes and we're going to
have super-fast reads. Global storage is the
thing we want to avoid. Sadly, as you can
probably tell, you can't. It's globally replicated. It's very cacheable, because
we don't change it very often. Reads can be fast because
we can't cache everything. But writes are going
to be glacially slow. That's just a fact. So don't do it often,
but you can live with it. OK. Client server protocol. As I said, we know that HTTP
polling is expensive and slow. TLS overhead, overhead
of [INAUDIBLE] doing an extra call,
or to do round trips, especially on mobile networks
we risk hot-spotting databases because we keep
reading the same entry over and over and over again. It's not very good. So we sort of have this
wish list for mobile devices on how should the client
server protocol work. We want it to be-- to support server
to client push. So we don't need
to do the polling. We just need to do
one TLS handshake. And then when the
server knows something, it can tell us about it. We need it to be really
cheap on the wire. Sure, we're going to set
up a mobile call soon. That's going to
burn a lot of data. But still, the actual
negotiation should be cheap. We want it to be fast. Really, really fast. And the last one,
especially as a developer, it should be simple to use. It shouldn't be
complex and hard. So at Google, we use
protocol buffers. There is this standing
joke at Google that there are only
two kinds of jobs. You do proto-to-proto or
you go proto [INAUDIBLE]. And that's all you do at
Google as a software engineer. Essentially, it's true. So protocol buffers
is a very flexible way to define data structures
that are language-independent. They are pretty easy to use
once you get used to the quirks. They have some. They're strongly type and
they have a really nice compact binary representation. So when you send them across
the wire, they're cheap to send. About a year ago, we released
something called [INAUDIBLE] as Google. Remote procedure calls. So this is a public port
of an internal system that we use, more or less,
day-to-day for everything we do. It's defined using
proto buffers. So you define your service as a
set of operations on a service. It supports server
to clients pushes through long-living
bi-directional RPCs. So essentially, the
client connects, and then you can start
to send and receive across that connection
without retrying. It's based on HTTP/2 or QUIC. For Duo, we used QUIC. And that's been
sort of publicized. And the nice thing
is it has support from more than 10 languages. So when you want to
build something that's like iOS, Android, server-side,
and something else, you don't need to write
the same service API library multiple times. OK. Probably, the most
critical things to get this working
for real, you need to realize that
mobile networks are flaky. I can't stress that enough. They are really flaky. Even though they seem
to work, they don't. You have frequent disconnects. You have a lot of
timeouts and packet loss. And you just need
to deal with it. So in practice, that means
that every single API that you publish has
to be retry friendly. The fancy word for
that is idempotent. So what it means
is that you need to be able to do the same
operation multiple times with the same arguments
and you should only have the same outcome. In this example, we have a
really crappy retry loop that tries to send Bob an offer. If it succeeds, we
break out of the loop. And if it fails, we try again. If this isn't implemented
in a retry friendly way, we're going to send Bob
three copies of the offer if he is on or crappy network. Because the call may
succeed, even though client-- the sender think it
failed because of timeout. So the request makes
it to the server. The server processes the
request and sends it on, but the client timeouts
before it actually realized that it worked. So design summary here then. We want to have a rendezvous
system based on GCM and APNs. We want to define
delivered as successfully handled by the application. We want to have a
delivery model that say, yeah, we're going to do it
at least once and in any order. So that means the
client needs to handle with duplicate messages,
out of order messages, all kind of crisis stuff. The lucky thing is that's kind
of easy to do on the client. Doing that on the server, that's
tricky and really, really hard. Sharding, we say,
hey, let's do it by ear on the
first registration. And protocol. We make sure to build
everything as idempotent GRPCs. [INAUDIBLE] simple. No rocket science, as I said. So we'll remember
this one, right? This is where we
started, more or less. HTTP post and gets and polling. And we already know
why this is bad. So a reliable GRPC solution
looks more or less the same. Somewhat. You could even claim that
it's a bit more complex. But if we work it
through, you're going to realize that it's
kind of nice and easy to reason about. We have a new server type,
CS, Connection Server. The first thing you connect to. You have GCM, which is
Google Cloud Messaging. Just to make it simple
and not have APNs and all of this in the same picture. You still have Alice's app and
Alice's library, Bob's library and Bob's app. OK. So how does this work? Alice call binds as soon
as the application starts. We want to do it as
early as possible to make sure that this is
ready when we want to send. Setting up the connection is
going to be the slowest thing. Bind, in this case, is
a bidirectional GRPC. So it's going to
stay up and it's the one we're going
to use for all of the subsequent operations. As soon as the
offer is created, we send it to Bob using the
bidirectional bind channel. That's going to hit
one connection server. Yes, any random one. We're going to
tickle Bob because we know that Bob isn't around. I'm going to go into
the details soon. GCM is going to forward a
tickle to Bob's application, causing it to wake up. And actually, do
something about it. Or actually, causing Bob
to actually answer, really. On the other side, Bob is going
to do the same bind to sort of set up the connection. As soon as he calls bind,
the offer sent by Alice is going to be delivered. But we say that
this is the offer and it's coming from Alice. As soon as Bob receives it,
it calls create answers. Once that succeeds, he
acknowledges the offer and sends it to Alice. We know that Alice is connected,
so the first connection server just forwards
the answer directly to Alice's connection server. We don't need to go
through any other loops. Alice does the same thing. She receives the answers,
calls set remote description, and then acknowledges
the answer. So once again, we have the
peer-to-peer connection up and running. That's pretty simple,
not very hard. It's kind of easy to
reason about, anyway. So let's dig into
the actual details on how to implement this. This is a bit gritty, but
this is the way it is. Again, correction server, DCM,
another connection server. We have a local
connection database. We have the global
user database. And we have the local
message database. Alice starts by
issuing her bind. As soon as she binds
to a connection server, we write her connection
information directly into the connection table. That's essentially IP port
and some number, indentifier, Alice's state on that
connection server. When we bind, we also
check the messages table. No one called Alice, so there
is really nothing to fetch. Alice wants to send the
first message again, so she called send,
addresses it to Bob, and passes in the message. We need to verify that,
a, Bob is a valid user. And b, where he is. Because we know that
it's not unlikely that it's going to be
in the same region. As soon as we know
where Bob is, we can issue a select to actually
get the connectivity details for Bob. But as we know,
Bob isn't around, so that's going to return empty. We also insert that message
directly into the database. This is actually the
point where we tell Alice that this operation succeeded. And the reason we do
it as early as possible is to minimize the risk of an
error internally propagating out. As soon as we return to
Alice, we actually continue. We call GCM and say, hey, tickle
because we need to reach Bob. At some point, further
into the future, GCM is going to wake up the
application on Bob's side. And we sort of have
started the first thing. OK. So application is
running on Bob's side. He calls bind. Once again, we insert
his connection info into the database. We check any messages pending. Yep, there is a
message from Alice, so we're going to
deliver that one. Bob handled it by
calling create answer. Then, he calls ACK. On ACK, we update the state of
the message in the database, but we don't actually delete it. It saves a tad bit of overhead,
especially on our database. But it also requires you have
some kind of cleanup system that goes around after
the fact and deletes all of these old messages. So Bob wants to send his
answer back to Alice. So he calls and once again, we
check, does Alice even exist? That's pretty cacheable,
but we do it anyway. We figure out, where
is Alice connected? And we know that
Alice is connected, so we're actually going
to get the result back. And then, we insert the
message into the database. We know that Alice
was connected, so we got her IP port and ID. That means Bob's
connection server can create the direct
connection with Alice and do a direct call saying,
hey, here's the message. But the trick is, there is
always a tiny, tiny risk that Alice goes away and
loses her connection before we manage to deliver it. So we always tickle
GCM, even if we know that there is a message there. Even if we know that there
is a connection there. There's going to be a
race between the delivery of the message and
the tickle from GCM. We just need to live with it. Client-side, that's
kind of easy to do. Server-side, it would be a pain. Once again, Alice receives
the message, calls ACK. She updates the state in
the database and done. So how does this really work
when you have a bigger system with all the pieces in place? So once again,
connection servers. GCM, local state, local state,
and global user database. As I said, this is drawn
as sort of global only, but think of it as
highly cacheable state you can move about. So again, Bob wants to
send his response to Alice. He goes out. Oh, where is Alice's home? Then, goes to the
connection database in exactly that
region and checks, is Alice connected
at the moment? And also, inserts the
messages into the database. Alice is connected, so
Bob's connection server connects directly to Alice's
and forward the message. We do the tickle
in parallel to make sure we don't lose
the connection. And we have the normal race. Alice receives the answer,
calls set remote description. That succeeds, so she says ACK. As soon as she
says ACK, we update the entry in the database that
say this is now delivered. That message will never
be delivered again. There are, of
course, races where you sort of have
messages coming in, reconnects, and stuff like that. So that may actually
be delivered again. And that's why we say
it may or may not. The client need to deal with it. OK. So again, back to the overview
of the reliable solution. It's kind of easy to
reason about because you don't have polling loops. You don't have strange state. It has clean and
simple operations. One thing that this
doesn't show is that a lot of the
operations in this slide and the previous slides are
very easy to parallelize, so you can do them
at the same time. And you can also do them a
bit sort of optimistically. You just try. And if it fails, it's
just the cost of a fail. There has been a lot
of details in here. But stuff we haven't covered,
which is equally hard or simple, depending on how you
look at it, is authentication, identity, load
balancing, idempotency-- how do you actually implement
that to make it work-- message identity,
cache eviction, especially global state. How do you make sure you evict
the stuff that's in the cache so we can update it? And how do you handle
poison messages? In the current system, if you
get a poison message in, that actually costs Alice's-- the receiver's clients to crash. We can never get
out of that state. It's going to keep
crashing, keep crashing. Because as soon as
we deliver, it dies. So that needs to be handled. So the summary of
this tiny talk, you've got to
remember the details because that's where the
actual complexity of this is. Don't go home and build a
custom rendezvous system. It just won't work. It has been tried once too
many, so let's not repeat that. Make sure you have very clear
message delivery semantics. It doesn't need to
be exactly the ones I picked here, but
make sure they are clear so the client teams
know what to expect. [INAUDIBLE] the system based
on actual user behavior. For Duo and Hangouts, we
know how the users behave. But however your users are
going to behave, I don't know. You need to figure it
out, or assume something when you start. Use something really fast
and really lightweight as client server protocol. Preferably, something
that's easy to use as well, because it's going to make
your developer a lot happier. And finally, and I can't
stress this enough, make sure all APIs
are retryable. Because if they're
not, you're going to have very strange
states flying around in your application. That's it. Questions? [APPLAUSE] [MUSIC PLAYING]