Meet Gatus - An Advanced Uptime Health Dashboard

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
As you start to self-host more and more services, it's probably likely that you'll start to depend on some of these. And what do you do if one of them goes out? Do you wait for someone to tell you? [Random person enters the room] "Hey, is the internet down?" Or do you set up your own alerting and monitoring system? Now, we know that there are plenty of uptime and monitoring services out there, but the one we're going to look at today, Gatus, does things a little bit different. Gatus takes a slightly different approach, more of a developer's lens on uptime monitoring. You can monitor things like websites with HTTP or GraphQL, check on DNS entries, ping host, or even make connections using TCP or UDP. And on top of that, it's going to measure response times and plot them on a chart over time. So you get more data than just whether the service is up or down. You also get tons of alerting options like email, Slack, Discord, Teams, and many other services that you can hook into to let you know if something's going wrong. And a couple of things that makes this really exciting, at least for me, is that once you set this up, this becomes your status page. There is no backend to configure because it's all config based. Now, I get it. The config based approach with YAML isn't for everyone. But I really like this approach because it makes it repeatable. It makes it GitOps. It makes it copy and paste friendly. And there is no backend or a WYSIWYG editor or anything to set up. This config based approach allows you to configure it like code and then deploy it along with the service. And it's automatically configured when it spins up. Oh, and another nice thing too is you have options for storage. You can use in memory if you don't want to use any storage at all. You can use a SQLite database, which is easy to set up. Or you can use, my favorite, a Postgres database. So that's what we're going to do today. We're going to spin up Gatus inside of a Docker container. We're going to create a Postgres database and connect Gatus to it. We're going to then deploy some monitors to monitor some services. Then we'll set up some alerts to know if something goes wrong. Then we'll test Gatus by taking some of our services down. And then we'll explore the status page looking at all of the features and all of the charts that it makes for us. Sound good? So let's get started. Before we dive in, I just want to let you know that all of this will be documented. And you can find that link to my documentation in the description below. So Gatus is open source and you can find it on GitHub. Another nice thing about Gatus is that it's written in Go. So it's going to be super small and super performant. Not a lot of bloat when it comes to Go. And they have lots of documentation here. You can see it all here. But I'm going to walk you through how to get this set up. The first thing you want to do is to remote into the server that's running Docker. Once you get there, you want to create a directory for Gatus and then you'll want to cd into that directory. Then we'll want to create a Docker-compose.yaml file and then we'll want to edit that file using nano or femme or whatever you want to use. In here, we're going to paste our Gatus config. And this looks like a typical Docker compose file. You can see our image, our restart preferences, some ports we're going to expose, 80 on the outside, 80 on the inside, and then some environment variables. We'll talk about those here in a second. Then we're going to map a volume to a config folder that we still need to make to the container's config folder inside the root directory. So about environment, you saw that I was pretty excited about this supporting Postgres. We're actually going to use Postgres to store our data. So in here, we're going to set a couple environment variables. One is our Postgres user. That's a user of our choosing. I'm just going to call it Gatus user. And then our Postgres password, which I'm just going to put Gatus password. And then our Postgres database, which would be the IP address of our Postgres database. So that's a lot of Postgres without explaining what Postgres is. It's an open source database, relational database, like MySQL and a lot of other SQL like databases. And it's a favorite among a lot of startups and a lot of people due to the licensing and how open it is. Now, if you're not running Postgres, you could absolutely create Postgres in this same stack that we're going to set up. But I would recommend against that and create your own Postgres database. It can still be in Docker and it could still use this configuration right here. But I would create it in a standalone stack so that other services can access it easily. And it's not tied to the stack. Anyways, you have a ton of choices on how you want to set it up. But after you set it up, we need to create a database, we need to create a user, and then we need to connect to it. So to connect to it, I use pgAdmin, you can see some of my databases right here. In here, we'll want to create a Gatus database, you can see that I already have one because this is my production. But I'm going to create one called Gatus uptime. So let's create one Gatus uptime. And we don't need to do anything else. Let's just create this database first. You can see we have a database, we don't have any tables or anything like that. It's just a blank database. Next, let's create a user that can access this database. So I'm going to create a new user, I'm going to call this Gatus uptime user. Then you want to create a secure password. I'm setting mine as Gatus password just for this demo, but I'll delete this user in the database later on. Then we'll need to set some privileges for this user. Can they log in? Yes, they can log in. That's about all they need from this page. Then you don't need to set any memberships and you don't need to set any security. Let's just create this user. Let's save. Now you can see our Gatus uptime user. Now we need to give this user access to our database. So let's go back to our database. And then let's go to properties. And let's go to security. And let's go to privileges. And let's add some privileges. And who is the grantee? Well, it's the user we just created, Gatus uptime user. And what privileges does it have? We're going to say all, but this is nice about pgAdmin. It's all but grant. So they can create, they can do temporary, they can do connects, but they don't have the grant option when it's just super nice that they do that for you. Anyways, these are the permissions it needs. Let's do a save. Okay, so now we have our user and we have our database and that user has access to this database. Let's go back to our Docker compose and fill in those details. So I did say that this is Gatus underscore user and the password was Gatus password. So that's right. And then for a Postgres database, this is actually going to be the name of the Postgres database that we just created. I know that in this note, I said that this should be the IP address or the host name, but that's not it. It's a wrong note, a comment on my part. But it should be the name of the database, which is Gatus uptime, which we can see in pgAdmin that we just created. Okay, now let's say that. But before we close out, we have one more task to do. And that's this volumes piece. So we need to create a config folder that points to the config folder within the container. So let's create the config folder. And then let's make dir config. Sure, we have it, we do. Let's create a config.yaml. And then let's edit this config.yaml. What's going to go inside of the config.yaml? Well, a couple of things. So inside of this config, we actually have two sections, we have storage, which is our Postgres database. And then we have our endpoints, which are all of the endpoints that we want to monitor. And for storage, we need to set up the path to our Postgres database. So this path right here is a connection string to connect to our database. And you can see here, it's going to use the environment variables that we pass in. So it's going to use our Postgres user, colon Postgres password, @Postgres on port 5432, on our database, and disable SSL mode because this Postgres setup doesn't have SSL turned on. Anyways, that's not important, at least not for this demo. But what is important is that we actually need to make one change. So right here, you can see @Postgres. And this is supposed to be your IP address or your DNS entry. If you're running it in a container on the stack, you probably named it Postgres, and this will connect just fine. But we actually need to put in the IP address of our Postgres server if it's outside of this Docker stack. So the IP address of my Postgres server is 192.168.30.240. And obviously, you can use a DNS entry too, if you'd like, I'm just going to put my IP address. And 5432 is the default Postgres TCP port. If you use something different, you'll want to change that. If you're using the default, it's 5432. So now that we have that, let's take a look at the rest of our endpoints. Now these are some example ones that they have set up, we're going to customize it here in a little bit. But let's talk through this really quick. So first of all, we have a name for this first item, this first endpoint, it's in a group called core, the URL we're going to monitor, how often we're going to monitor it, and then some conditions. So we have a list of conditions, we want to make sure that our status is 200 and HTTP status code 200 is okay. And then we want to make sure that our certificate expiration is greater than 48 hours, which this is pretty nice. So we can do certificate monitoring too. If it dips below 48 hours, then it will show an error. And similarly, we have a another endpoint here that's called monitoring, it's in a different group called internal. And it's going to monitor this endpoint here, every five minutes, it's going to look for a 200 status code. So pretty self explanatory there. Next is a another endpoint, this one's called NAS, it's in the same group as the one above it internal, same example.org, it's looking for 200. And now we get into something a little bit different, like I talked about earlier, and this is monitoring DNS. So you can see we have the name of this endpoint is just example DNS query. And then this is nice, we can actually point it to a specific DNS to use. So we're going to point this to 8.8.8.8 that's Google's DNS, the interval is five minutes, and then the query name we're going to pass is lookup example.com. And we're looking for an A record. And then for conditions in order for this to pass, we're going to look at the body. And we're going to make sure that that A record is pointing to 93 184 216.34. And then we're also looking at this DNS, our code must be some DNS status, and making sure it says or is equal to no error. And last but not least, we have an ICMP request or ping. And this is going to the URL, the protocol ICMP. And we're going to ping example.org. We're going to do this every one minute. And we're going to make sure that the condition matches connected equals true. So you're probably wondering, what the heck are all these conditions? And how do I know what these conditions should be? Well, if we look at their documentation page on GitHub, you can see some of the conditions that you can check for. For example, if you want to check a condition and make sure it's less than 300, this one's actually really good, it will pass. And so that means anything below 300 will actually pass. Because in HTTP, anything with a 200 ish status code is okay, once you get into 300, you're getting into redirects and stuff like that. So anything lower than 300 should be okay. So this is this is actually a really good one, we should be using this one. But but on gets a 200 is what you're going to get anyway. So anyways, I'm don't mean to go all developer on you. But I really like HTTP status codes. And you can see another one here must be greater than 400 pretty self explanatory. And this one right here we saw in ours to connected equals true. So this means we actually made a successful connection. So say for instance, you want to check to see if TCP port 5432 if you want to check if Postgres is up, you would query on that port, which we'll do here in a little bit, and you would make sure that connected equals true. Now if the database server was down, you would get a false and it would fail. But if you want to explore these on your own, you absolutely can. But let's continue setting up our endpoints so we can get some uptime statuses. So back in our config, we want to save this and then we want to close out. And then let's see where we're at. Okay, we're in our config folder. So let's go one folder up. Let's do an LS and we should be right here. So you want to be in the root of the folder that we created, which was Gatus uptime, basically at the same level of your Docker compose file. Next, we want to do a Docker compose up dash D, we don't need to do a force recreate, but I'm going to do it to spin up this container. So let's spin up this container. Let's check on this container really quick. Let's do a Docker logs. And I'm glad I actually checked because it looks like I have something going on right here. That's why I'm glad I did a Docker logs password authentication failed for user Gatus users, so I must have something wrong in my configuration. So let's fix that really quick. And I see what's going on here, I named it Gatus user. And if we look at PG admin, and look in the database, it should be Gatus uptime user. So let's fix that really quick. Okay, so let's save this. Let's start that container back up again, look at my logs. And now we don't have any errors in our log. So this is good. So now we can get to this using the IP address of this server. And then on the port of 8080 that we set. So if we go to it, here it is, it's already doing some uptime checks. Pretty cool. So we have this first uptime check, it's checking example.org, checking for 200 and making sure the certificates not expired. And it's not our response time was 69 milliseconds. And that was 55 seconds ago, I think this is going to check every five minutes. So we're not going to see it that often. But we'll fix all this here in a second. And as you can see, we can group these. So this is the group of core. And then this was the group of internal. So we had one called monitoring on example.org. And our NAS here, we can see we have green checkboxes there. Awesome. Then we have one that's not in the group at all. So our DNS query to Google and it's looking up, I think example.com. And here we are, we're getting a response back that the A record does match 93, 184, 216, then 34. And then we're pinging example.com. And we can see we're getting a ping back in eight milliseconds. And then if we drill into these, let's go into this one called back in into example, you can see it's starting to plot this out for us. So it's going to plot out the response times over time for all of the services that we're measuring. So this is pretty cool. Now we don't have a lot of data there, but we can see we also have these badges, which is also pretty cool too, because you can use these badges wherever you want. If you do an inspector, open image and new tag, you'll get the URL to this badge. And then you can use that in your documentation or put it anywhere you like. So that's pretty cool. We have these badges too. And then like I said, it's going to plot all of this and then we'll get a lot of badges we can actually use. And then all of our events will start to be listed here too. So pretty cool, pretty easy. So now let's make this a little bit more useful and use less example data and I'll monitor some things that I actually want to monitor. So I'll bring this into VS Code because it'll be a little bit easier to see, but I'm going to make some changes to this. Now I'm not going to change anything in the storage because I still want to use the Postgres database, but I am going to modify my endpoints that I'm going to monitor and I clean them up a little bit using anchors. We'll talk about that. So let's monitor a few things. What are some things I want to monitor? Well, I want to monitor my shop, my brand new shop I just launched. You should totally check it out. This is part of the dark mode collection along with some other things, but that's something I definitely want to monitor. I also want to monitor my website, TechnoTim.live. I want to monitor my short links site. Then I want to monitor DNS for my shop site, so I don't accidentally change DNS and have all kinds of problems. And then I want to ping it too, even though I really don't need to ping it. My HTTP request above kind of does that same thing, but we'll ping it just for fun. And then I want to monitor this Postgres database that it's running on. I put in the internal DNS name. You could use the IP address here, but I want to monitor that database too. So you're probably wondering, well, what's going on here? So this is a YAML anchor. So I clean this up a little bit and the way that this works, instead of listing out all of my endpoints listed in a group with a URL, setting the interval and setting conditions on a lot of these, well, I set up some defaults for all of my endpoints and I called it defaults. So what I can do is actually use that as a template. So here you're only seeing the URL, but I'm referencing this default anchor. They call them in YAML, which is a template. So this is going to say, hey, okay, your group is external. Your interval is 30. Your timeout is 10 seconds. And here's the condition I'm looking for. So I did this for all of the HTTP requests, the web request. So defaults here, defaults here, defaults here, defaults here. It keeps it really clean. YAML anchors are pretty cool. They're kind of hard to like understand the syntax is super weird. But once you do it once, it's pretty awesome. So now I didn't have to do that everywhere. So I have my default HTTP should call this default HTTP instead of endpoint defaults, but that's besides the point. And now I can use that anywhere in here. So why didn't I use these everywhere? Why didn't I use it for DNS or for my ping or Postgres? Well, if you look at it, I'm looking for a status of 200. And I'm checking the certificate two. And these have different timeouts and intervals. And these conditions don't apply to DNS and they don't apply to ping and they don't apply to Postgres. So let's copy and paste this and update our configuration file. So let's edit our config slash config, edit this guy, going to paste all of this in here, going to save, quit. And let's do a docker compose up - d, we can stop it, start it, but I'm going to do this force recreate, probably not the best thing to do, but it's easy and it always works. So now that that's up, let's do a Docker logs. I don't know, I'm not feeling as brave as I was last time after, after seeing that fail. So let's look at Gatus, Gatus logs does look like some things are going on. Okay, yeah, it's it's going, it's going. Okay, so let's check out our website. Now, we go back and we refresh. Awesome. So now we can see it is monitoring all this stuff. Pretty cool. So I have them grouped in external. So an internal I have my Postgres database and it already did a check and it's connected response time is three milliseconds. And then in external, I have all of my sites that I set up so l dot techno tim dot live. So my link site DNS for my shop, my shop site, pinging my shop, and then my main website where the documentation is maybe something you're on right now. And then again, if we go into here, we could see we have some checks, I think I set my checks to be around 30 seconds, we have some coming in. And then we have our response time. So pretty cool. Now I already set this up and I let it go for a day or two, just so I can build up some data to fill up these charts. So it's kind of like a cooking show where I had one baking in the oven. Well, I'm going to pull that one out of the oven right now. Give me a second. So here's the one I had running for a little over a day, I think you can see all of these checks are checking in, everything looks good. And if we drill into say my shop, you can see we have some additional data and some plots over time. So you can see my uptime is 99.99%. It must have been down for just a second when it checked. Over the last 24 hours, it's been 100%. Last hour it's been 100%. My response time 109 milliseconds should probably be a little bit faster than that. But over the last 24 hours, it's been 118 milliseconds. In the last hour, it's been at 112 milliseconds. And you can see it was unhealthy for one minute, 37 minutes ago. So that's probably why my uptime is a little bit lower than it should be. I need to check on my shop. But you can see the same thing. Let's look at something else, something I host. And here, this is internally, you can see my response times are super duper fast, 12 milliseconds on my website. So that's pretty good. My uptime is 100%. It all looks pretty good. So this is really good. It's really nice, especially for a developer kind of lens on a status page. And the nice thing is when things go down, things go up, this will automatically respond and update the status page. So what happens if something goes wrong? Well, we need to set up our alerts. We need to set those up so we can get notified when something goes down. And so we can also see this reflected on our chart here, maybe with something red. With Gatus, we can alert through many different systems to let us know if our services go down or even if they recover. You can see it supports Discord, email, GitHub, Google, pushover, PagerDuty, Telegram, Teams, Slack, you name it. So a lot of configuration here. I'm going to go with something that I use all the time. And I usually have on and on me. And that's Discord alerts. So if we look at our Discord alerts, this is pretty simple to configure. But you can see we have some properties that we can configure. So it looks like in our config file, we're going to add an alerting section, we're going to add the type of alert or the system we're going to alert through Discord, then we're going to set a property of webhook URL. And it's going to be the webhook URL that we're going to create here in a second. Now you can do some other things here. And you can do some fancy stuff too. But I'm actually going to go with a lot of the defaults. Speaking of Discord, you should absolutely join our Discord server. We have a Discord server full of almost 10,000 people in our server. We have a lot of nice people, a lot of very helpful people when it comes to Homelab, and other topics in that nature. So you should totally join. But anyways, plug for Discord. In my private Discord server, I created a channel called Gaddis. You can name this anything you like. And so now we need to create a webhook so Gaddis can post to it and send a message to it. So it's very simple, you just go on to the gear of edit channel, then you go to integrations, then you go to webhooks, and then you create a new webhook. And then you copy this webhook URL, that's all you need. You can see I already set one up. So I'm going to delete this. I set one up called Gaddis and I gave it an icon, you really don't need to because it has its own icons. But let's copy this webhook URL from here. Next, we'll need to edit our config slash config yaml file on our server again, and we'll need to create that alerting section. So I'm going to go right underneath storage. And I'm going to create an alerting section with my default alerting for Discord. So you can see here I have a webhook URL. Now, this is a secret, you probably shouldn't show this to anyone. I'll delete this when I'm done. But your webhook URL here, and then I'm going to set up a default alert. So my default alert is going to have a description of hey, health check failed, then you can choose whether or not you want to send a message when the alert is closed, or when it's resolved. I do because I want to know when my services have recovered. Next, we're going to set up a failure threshold. So how many times does it need to fail before we send this alert? And then our success threshold, how many times does it actually need to succeed before we send the send on resolve. So I said two both ways. So that should be pretty okay. But I think this is a good setup here. Now that we have these alerts set up, now we actually need to add them to each endpoint. To do that on each endpoint, what you could do is add them to each endpoint, like I just said. So you would add alerts here, and then a type of Discord. And this would reference our type of Discord up there and use the defaults. And then you would do this for every site that you have, you would set this here, and you would set this here, so on and so forth. So what you can do if you want to get fancy, remember the anchor we had with our defaults for our endpoints, you can actually add that key here. So our default alert of Discord will now apply to all of our anchors of default for all of our endpoints. Now I know that we don't have that for everyone. So let's actually pull this off here, pull this off here, and pull this off here. And we would probably need to add them to say here because this isn't using the anchor of default. This is our DNS, so we need to add it there, so on and so forth. We need to add it here, and we need to add it to our database too. So let's say then, let's close out of here, let's bring our container back up, and then let's check on our site. So if we go back to our site, nothing's different, because all we did was configure those alerts. So let's actually introduce some chaos now. Let's actually bring one of my sites down. Yes, I really need to bring it down. So I'm going to go into my Kubernetes cluster really quick, and let's look for my little links site. So my Techno Tim links, let's look for it here, Techno Tim live, little links. And you can see I have three replicas, so I have three pods running. And now I want to take this down so that the next health check fails. So I need to take down all three of these replicas or all three of these pods. So let me scale it down to zero. Now we're scaled down to zero, almost, these are still terminating. And then that quickly, it's already marked as down. Man, that was quick. I guess I have my timings really tight. But you can see, it's marked as down now. I got a status of 404, and 404 isn't equal to 200. And that one failed. I still got a cert expiration passed, because it's still terminated TLS, but I got a 404. So that's not good. Now my site is down. My uptime doesn't look as good. The health is set to down on this little badge. Pretty cool. We have this event. But did we get an alert? Well, let's actually check in our Discord server. So I did here's an alert from Gatus. It's getting another service alert. This is really taking it down into production. But Gatus is saying an alert for external slash links.technotim.live. HTTP has been triggered due to having failed two times in a row. My health check failed, and it gave us the status code here. Okay, let's scale that back up just in case someone's trying to get to that site. Let's go back into Rancher. Scale it up to three again. There's one, there's two, there's three. You should spin up really quickly. Let's go back to Gatus. You can see here we had four failures. Now it needs to respond healthy at least twice now before it considers it back up. Might need to wait a couple of seconds. There we go. So it's back up now. And you can see we get a 200 and our certificate. It's still good, obviously. But you can see both of those passed. And if we go back into Discord, we can now see that it's recovered. An alert for external links.technotim.live has been resolved after passing two successful times in a row. Health check failed. That was the name of the description. Health check failed. That's not really a good description. I probably should have named it links.technotim is down. And so then it would say that here. But anyways, that's not important. You can see I get two green checks, and we're back up. And we're getting more checks here. And if we go into here, we could see my uptime is now less than 100%. But we're looking pretty good. And the health is up on this badge. And you can see in the events it recovered. So as you can see, Gatus is pretty powerful. It definitely takes a different approach to configuring your uptime monitors, as well as displaying your status page. Really, because the site takes a config based approach, there isn't anything to configure in the UI. So the UI just becomes your status page with all this information. Now I really like this approach. And I'm going to be test driving this for a little bit. And who knows, maybe it'll replace my current uptime monitor. Well, I learned a ton about Gatus, a ton about HTTP status codes. And I hope you learned something too. And remember, if you found anything in this video helpful, don't forget to like and subscribe.
Info
Channel: Techno Tim
Views: 30,067
Rating: undefined out of 5
Keywords: techno tim, technotim, homelab, home lab, gatus, gatus uptime, uptime monitor, self-hosted uptime, status page, response times, alerts, docker, compose, Developer, endpoints, IaC, infrastructure as code, config, config based, open source, statuspage, uptime, monitoring server, monitoring tool, golang, devops, gitops, notifications, slack, teams, discord, dashboard, health, ping
Id: LeZQjWlDUHs
Channel Id: undefined
Length: 28min 36sec (1716 seconds)
Published: Mon Feb 26 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.