Watching a Flutter app crash

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[MUSIC PLAYING] ANDREW FITZ GIBBON: Hello, everyone. Now, I'm sure we've all seen this happen. Your app crashes. Not only have we seen it happen, we've watched it unfold in real time and then really hoped the user sends us a crash report. And these crashes take a variety of forms, right? Maybe it's an application not responding, a blue screen, a kernel panic, or any of the many other forms of a crash. And we've all seen these happen. And the more devices we end up running on, the more we'll run into different kinds of crashes. App crashes happen. And, as much as we, as developers, try to mitigate them, they will continue to happen. So how do we actually see them happen if our apps are deployed and out in the wild? How do we measure and observe these highly distributed processes? Hi. I'm Fitz, Developer Advocate with Flutter. And, early on in my computing career, I was a sysadmin responsible for keeping the servers and the services on them up and running, 24/7. And, in my first software engineering job, I was heavily involved in the team's DevOps work. It wasn't called that back then, but it was the same deal. Make sure our services stayed up and deployments happened successfully. The common thing between both of those? App crashes happened. It's definitely a thing. And, in both, I got really comfortable looking at graphs like this fake one, here, showing us how well things are behaving. This graph might be fake, but it's got the right idea, measuring the health of our app. Here, the y-axis is the percent of apps that don't report a crash. X-axis is the time, roughly segmented by day. Everything looks great here, right, a smooth straight line at 100%. Except, what is that on day 4? Alarm bells start going off in my head here. That tiny little bump is obviously a huge problem, but is it a crash? So before we get into the code, let's clarify that word, "crash." If you're just using an app, going about your business, then the app is not crashing. Right? Everything's working and normal for whatever that means for the app, right? So, in its simplest form, a crash is just anything happening that's not normal. So, on one end of things, we have normal. Everything's fine. And to understand if something is a crash, that is, not normal, we'll have to understand what normal is, too. On the opposite end, we have those things that are obviously and definitely a crash. Uncaught exceptions, segfaults, whatever it is, the app is definitely not working anymore. And, in the middle, we have things that are slightly different than normal. Maybe bad things happen inconsistently or they happen slower than normal. These kinds of crashes are sometimes small enough for people to not notice them. My question, then, how do we see things happening in the first place so that we can figure out what to do later? To explore this, I built a simple demo app. It starts with a Google Sign-In and then flips to a randomly generated grid of clickable squares. The blue ones are pretty boring. Click on them, and we get a new page with a nice image. Go back. Click another blue square. Same thing. It's pretty reliable, and we can go about our business as normal without crashes and keep doing our perfectly normal thing of clicking blue squares. The yellow ones are similar but slightly different. Sometimes we click them, and the same thing happens as with the blue squares. We go to a new page with a photo, and sometimes it takes a while. And sometimes it simply just crashes. Whoops. You might be able to guess, by now, what the red ones do. Click a red square. App crashes. Restart. Click a red square, and it crashes again. That is our definite crash variety. What I'll do for the rest of this talk is show how to monitor these things, in the hopes that we can do something about them later. And the reason that this is a problem is that, yeah, we've got our app running here. That's all well and good when you've got one instance and it's on your dev machine and you can attach the dev tools. In reality, your app is wildly popular, and there are many, many, many, many, many thousands of instances of it, and you can't attach a debugger to any of them. As a former sysadmin working in the scientific computing space, to me, this looks a whole heck of a lot like a distributed system. And how do we monitor highly distributed systems? With the cloud. And so I've got this app wired up with both Google Cloud and Firebase, maybe in a slightly unusual way, but today I'm going to focus mainly on the Google Cloud integration. This app uses it in three ways. First, to get a sense of what normal is. How long do various user actions actually take to happen? In the demo app, this would be things like the time between a user clicking on a square to the image showing up. And a big part of that is also just general logging and getting some extra information about what the various instances of the app are doing so that we can see what's happening even without user interaction. And finally, of course, is the interesting bit-- the outright crashes with, if possible, stack traces. For this app, we first sign in with Google to get that authenticated client. And, on platforms where Google Sign-In isn't available yet, you should use a lower-level OAuth 2.0 client. But, in both cases, we need to end up with an authenticated HTTP client. And then, once at the point of writing a log message, you should import the relevant Google APIs package, in this case, logging and trace, to access those respective services. These are pretty raw API clients. And so, for this case, I have a small wrapper to help out. This wrapper does three things. First, initializes the resource that we're working with. This defines, most importantly, the cloud project to send the logs and traces to and, second, it creates the log entry itself. The message log levels-- such as info, critical, error, et cetera-- go here. And, finally, we actually make the async call to actually write the log message to the cloud service. Next, in order to log uncaught exceptions, I've wrapped the top-level runApp to both catch Flutter errors as well as all other errors. These both call out to the same logging helper we just wrote. Let's go turn this into some actual code. Now that we have the client and the basic logging mechanisms, let's go and actually use it. So here is where the various square clicks are handled, a single onTap for all three colors. The destination is the same, so let's use the same function. Obviously, things are a little bit contrived here, randomly forcing different unique exceptions to be thrown in order to guarantee that they're not caught. But my first step, here, is to get a sense of normal. How long does it take between tapping a button and the dash image showing up? So, in my helper, I'm going to add some tracing. If we don't already have a trace and a span ID-- these represent the operation that we're doing-- we should first create one for this instance of the app. If we already have one, we can just reuse it. And then we can add it to the log entry to get some of those timings. I also want to differentiate between what's part of a trace versus what's just a general log, that second thing that we wanted to do with cloud logging, and so I'll add another Boolean parameter here to say that. I'll then have to go back to the calls of this helper and add this parameter. When the button is clicked, for example, no matter what the color is, log it as a trace. Second, when the new widget loads with that image, we'll send another trace log up. That'll give us the difference between the two timings for how long that operation takes. And, for all the other logs, these are just the general logs. Right now, they're going to be not traces. Once we have that up and running, we can open up the logging console and see how it shows up. I've pre-generated a bunch of data here so that we can see a sense of normal. And, here, it looks a lot like that fake graph of mine, maybe just reverse. It's a histogram of the various logs colored by their level. Most are just info, so we see that steady stream of normal, those light blue bars. And that trickle of red, here at the bottom, represents the yellow and red button presses, those critical and error logs respectively. Going back and looking at the traces, we can also get a sense for the timing of these user actions. Most of these take a reasonable amount of time, but some are quite a bit slower than others. Curious. These metrics give us a good idea of what's normal. That's our baseline, the most common duration between these two widgets. Now, this may be an instance running in Debug mode, but we can still click a few buttons here and there. Click some red ones, clicks some yellow ones, click some blue ones, and go back to the console and see, for the outright errors, if we click through to a specific entry, we should be able to see the stack traces that I added earlier. And there's our logs, and there are some minute traces, too. Now, as mentioned earlier, in reality, you're likely to have lots of instances of this app running in production. And, under normal circumstances, everything works great. That's why we call it normal. And so we don't want all of those to be constantly logging near useless metrics. And so, with Firebase, I've pulled it into this app to handle some of that fire hose of information. Using Remote Config, I've set a value to control how much gets logged. And once we have that baseline understanding of what normal looks like, we can use that Remote Config value to pull back on the logging frequency. In the code, there are three things to note. The Firebase client is using those same credentials that we created for the logging service. For the Config client, we also have to set some default values. In this case, log everything by default, unless told otherwise. And then, within the logging helper, I check the Config client to see if there are actual values there for us to use. Since this is just an info-level log, I might not want to always log it. The Config value will tell me if I need to and, if I shouldn't, just skip everything else. That way, I have now limited the rate of when this app sends logs, if needed. OK, throughout this talk, I've shown a few things but, most importantly, defining a crash as anything that's not normal. And to understand what not normal is, we first needed to understand what normal is. To do that I, used Google Cloud's logging service, treating our app as just another distributed system to get a feel, with numbers, of what that actually is. After that, I looked at how to use Firebase to help limit some of that fire hose of data. Under the vast majority of cases, everything is fine. Once we know the baseline of normal, we might want to dial down that logging frequency so that we don't exhaust our logs quota too quickly. That's been me watching a Flutter app fail, but with logging and metrics behind it. Thanks for watching.

Info

Channel: Flutter

Views: 18,505

Rating: undefined out of 5

Keywords: App development, production app troubleshooting, Flutter SDK, App developers, troubleshooting, app production, Google I/O, Google IO, I/O, IO, Google I/O 2022, IO 2022

Id: aIqy-Ulu4Gw

Channel Id: undefined

Length: 12min 23sec (743 seconds)

Published: Thu May 12 2022