[MUSIC PLAYING] ANDREW FITZ GIBBON:
Hello, everyone. Now, I'm sure we've
all seen this happen. Your app crashes. Not only have we
seen it happen, we've watched it unfold in
real time and then really hoped the user sends
us a crash report. And these crashes take a
variety of forms, right? Maybe it's an application
not responding, a blue screen, a kernel panic,
or any of the many other forms of a crash. And we've all seen these happen. And the more devices
we end up running on, the more we'll run into
different kinds of crashes. App crashes happen. And, as much as
we, as developers, try to mitigate them, they
will continue to happen. So how do we actually see
them happen if our apps are deployed and out in the wild? How do we measure and observe
these highly distributed processes? Hi. I'm Fitz, Developer
Advocate with Flutter. And, early on in my
computing career, I was a sysadmin
responsible for keeping the servers and the services
on them up and running, 24/7. And, in my first
software engineering job, I was heavily involved in
the team's DevOps work. It wasn't called that back
then, but it was the same deal. Make sure our services
stayed up and deployments happened successfully. The common thing
between both of those? App crashes happened. It's definitely a thing. And, in both, I got
really comfortable looking at graphs
like this fake one, here, showing us how
well things are behaving. This graph might
be fake, but it's got the right idea, measuring
the health of our app. Here, the y-axis is
the percent of apps that don't report a crash. X-axis is the time,
roughly segmented by day. Everything looks
great here, right, a smooth straight line at 100%. Except, what is that on day 4? Alarm bells start going
off in my head here. That tiny little bump is
obviously a huge problem, but is it a crash? So before we get into the
code, let's clarify that word, "crash." If you're just using an app,
going about your business, then the app is not crashing. Right? Everything's working
and normal for whatever that means for the app, right? So, in its simplest
form, a crash is just anything happening
that's not normal. So, on one end of
things, we have normal. Everything's fine. And to understand if
something is a crash, that is, not normal, we'll have
to understand what normal is, too. On the opposite end,
we have those things that are obviously and
definitely a crash. Uncaught exceptions,
segfaults, whatever it is, the app is definitely
not working anymore. And, in the middle,
we have things that are slightly
different than normal. Maybe bad things
happen inconsistently or they happen
slower than normal. These kinds of crashes
are sometimes small enough for people to not notice them. My question, then, how do
we see things happening in the first place so
that we can figure out what to do later? To explore this, I
built a simple demo app. It starts with a Google
Sign-In and then flips to a randomly generated
grid of clickable squares. The blue ones are pretty boring. Click on them, and we get a
new page with a nice image. Go back. Click another blue square. Same thing. It's pretty reliable, and we can
go about our business as normal without crashes and keep doing
our perfectly normal thing of clicking blue squares. The yellow ones are similar
but slightly different. Sometimes we click
them, and the same thing happens as with
the blue squares. We go to a new
page with a photo, and sometimes it takes a while. And sometimes it
simply just crashes. Whoops. You might be able to guess,
by now, what the red ones do. Click a red square. App crashes. Restart. Click a red square,
and it crashes again. That is our definite
crash variety. What I'll do for the
rest of this talk is show how to monitor
these things, in the hopes that we can do something
about them later. And the reason that
this is a problem is that, yeah, we've got
our app running here. That's all well and good
when you've got one instance and it's on your dev machine and
you can attach the dev tools. In reality, your app
is wildly popular, and there are many, many,
many, many, many thousands of instances of it, and
you can't attach a debugger to any of them. As a former sysadmin working in
the scientific computing space, to me, this looks a whole heck
of a lot like a distributed system. And how do we monitor
highly distributed systems? With the cloud. And so I've got this app
wired up with both Google Cloud and Firebase, maybe
in a slightly unusual way, but today I'm going to focus
mainly on the Google Cloud integration. This app uses it in three ways. First, to get a sense
of what normal is. How long do various user
actions actually take to happen? In the demo app,
this would be things like the time between a
user clicking on a square to the image showing up. And a big part of that
is also just general logging and getting some
extra information about what the various instances
of the app are doing so that we can see
what's happening even without user interaction. And finally, of course,
is the interesting bit-- the outright crashes with,
if possible, stack traces. For this app, we first
sign in with Google to get that
authenticated client. And, on platforms where Google
Sign-In isn't available yet, you should use a lower-level
OAuth 2.0 client. But, in both cases,
we need to end up with an authenticated
HTTP client. And then, once at the point
of writing a log message, you should import the
relevant Google APIs package, in this case,
logging and trace, to access those
respective services. These are pretty
raw API clients. And so, for this case, I have
a small wrapper to help out. This wrapper does three things. First, initializes the resource
that we're working with. This defines, most
importantly, the cloud project to send the logs and
traces to and, second, it creates the log entry itself. The message log levels-- such
as info, critical, error, et cetera-- go here. And, finally, we actually
make the async call to actually write the log
message to the cloud service. Next, in order to log
uncaught exceptions, I've wrapped the
top-level runApp to both catch Flutter errors
as well as all other errors. These both call out to the same
logging helper we just wrote. Let's go turn this
into some actual code. Now that we have the client and
the basic logging mechanisms, let's go and actually use it. So here is where the various
square clicks are handled, a single onTap for
all three colors. The destination is the same,
so let's use the same function. Obviously, things are a
little bit contrived here, randomly forcing different
unique exceptions to be thrown in
order to guarantee that they're not caught. But my first step, here, is
to get a sense of normal. How long does it take between
tapping a button and the dash image showing up? So, in my helper, I'm
going to add some tracing. If we don't already have
a trace and a span ID-- these represent the
operation that we're doing-- we should first create one
for this instance of the app. If we already have one,
we can just reuse it. And then we can add
it to the log entry to get some of those timings. I also want to
differentiate between what's part of a trace
versus what's just a general log, that second thing
that we wanted to do with cloud logging, and so I'll add
another Boolean parameter here to say that. I'll then have to go back
to the calls of this helper and add this parameter. When the button is clicked,
for example, no matter what the color is,
log it as a trace. Second, when the new widget
loads with that image, we'll send another trace log up. That'll give us the difference
between the two timings for how long that
operation takes. And, for all the
other logs, these are just the general logs. Right now, they're
going to be not traces. Once we have that
up and running, we can open up the
logging console and see how it shows up. I've pre-generated
a bunch of data here so that we can
see a sense of normal. And, here, it looks a lot like
that fake graph of mine, maybe just reverse. It's a histogram of the various
logs colored by their level. Most are just info, so we see
that steady stream of normal, those light blue bars. And that trickle of
red, here at the bottom, represents the yellow
and red button presses, those critical and
error logs respectively. Going back and
looking at the traces, we can also get a sense for the
timing of these user actions. Most of these take a
reasonable amount of time, but some are quite a
bit slower than others. Curious. These metrics give us a
good idea of what's normal. That's our baseline,
the most common duration between these two widgets. Now, this may be an instance
running in Debug mode, but we can still click a
few buttons here and there. Click some red ones,
clicks some yellow ones, click some blue ones, and
go back to the console and see, for the
outright errors, if we click through
to a specific entry, we should be able to
see the stack traces that I added earlier. And there's our logs, and there
are some minute traces, too. Now, as mentioned
earlier, in reality, you're likely to have lots of
instances of this app running in production. And, under normal circumstances,
everything works great. That's why we call it normal. And so we don't
want all of those to be constantly logging
near useless metrics. And so, with Firebase,
I've pulled it into this app to handle some of
that fire hose of information. Using Remote Config,
I've set a value to control how much gets logged. And once we have that
baseline understanding of what normal
looks like, we can use that Remote Config
value to pull back on the logging frequency. In the code, there are
three things to note. The Firebase client is
using those same credentials that we created for
the logging service. For the Config
client, we also have to set some default values. In this case, log everything by
default, unless told otherwise. And then, within
the logging helper, I check the Config client to
see if there are actual values there for us to use. Since this is just
an info-level log, I might not want
to always log it. The Config value will
tell me if I need to and, if I shouldn't, just
skip everything else. That way, I have now
limited the rate of when this app sends logs, if needed. OK, throughout this talk,
I've shown a few things but, most importantly,
defining a crash as anything that's not normal. And to understand
what not normal is, we first needed to
understand what normal is. To do that I, used Google
Cloud's logging service, treating our app as just
another distributed system to get a feel, with numbers,
of what that actually is. After that, I looked at how
to use Firebase to help limit some of that fire hose of data. Under the vast majority of
cases, everything is fine. Once we know the
baseline of normal, we might want to dial down
that logging frequency so that we don't exhaust
our logs quota too quickly. That's been me watching
a Flutter app fail, but with logging and
metrics behind it. Thanks for watching.