Designing My Multi Site HomeLab

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
As you saw in my last video, I decided to co-locate some of my servers in a data center not too far away. In that video, you saw most of my hardware choices, which was giving my HomeLab servers a new life, in a new rack, on a new network, in a new location. And with this new location comes a whole host of challenges, like networking, security, VPN, virtualization, DNS, multiple clusters, backups, and a lot more. So I wanted to share some of my progress, candidly with you. This will be a lot less formal, but a lot more in-depth about some of my choices coming up. And that's where I could use your help again. Think of this like an architecture review, where you be the judge. Don't go too hard on me. So let's dive into my architecture and get started. The first piece was getting my network set up. And at first I thought this was going to be the really complicated part, but it turned out a little bit easier than I thought it would be. Moving my public workloads to the co-location simplified a lot of things across my network. So a quick review on what my network used to look like before this, I just drew up this diagram like three months ago. But I had lots of VLANs here. As you can see, I had my default network, a camera VLAN, an IOT VLAN, a main VLAN, guest VLAN, servers untrusted VLAN, and then servers trusted VLAN. And between all of these networks, I had a lot of complicated firewall rules for lack of a better term or ACLs. A lot of complicated ACLs that said, "Hey, devices from this main network can communicate with some devices on the trusted network." Or a rule that said, "Anything on the default network can communicate with anything on the outside network." And then a rule to say, "Only established and related traffic can flow back." So it was really complicated. And I did this because I only had one network and I was hosting things out of my home. But a lot of that stuff was hosted in this server's untrusted network you see right here. And by moving those workloads out of my home, it also meant that I didn't have to port forward anymore. So I no longer have to port forward or allow incoming traffic into my network. And I had a lot of complicated port forwarding rules to allow that traffic to come into my network. So now I can push most of those rules and some of that complexity to the UDM that's there in the co-location. So that will greatly simplify all of these rules you can see in my home network. So I no longer have to have probably 75% of these rules because I'm no longer allowing some of that traffic in. And I was also thinking about flattening out some of my VLANs, but that's where I'll need your help later on. And so I did move some of those VLANs to my co-location. And you can see I just have a couple of VLANs here. I have one called trusted. Not even sure if I'm going to use that or just consider everything that's there untrusted. I have one just called servers and that's not really trusted, but it's where I put things like my DNS server that I don't want exposed to the public. And then I have one called public, which is just that. These are workloads that are exposed directly to the public either by ingress controller, load balancer, but have port forwarding rules back to those servers. I'll show you them here in a second. And then management. This is one for just a lot of my management interfaces and that's pretty locked down. So how does that work between my two networks? Well, currently I have a site to site VPN set up. Now I thought this was going to be really hard, turn out to be a lot easier because I ended up using UniFi site magic. Part of the reason why I chose to UniFi devices because it literally is as simple as checking these check boxes and clicking connect. And within like five seconds, devices on my main network can now communicate with say devices on the public network. Now the initial configuration of site magic is one thing, but the firewall rules is another. Now I'm not sure if this is complicated or confusing because the way that UniFi does it, or if this is just complicated and confusing because that's how site to site VPNs are and their ACLs are? "They're just complicated and confusing." But after a lot of attempts, I figured out how these firewall rules work between two site to site VPNs. And it's this little interface right here called LAN out. Now, most of the time I've used LAN in for my firewall rules and that can restrict traffic on the LAN side, on the inside of the LAN, I think. And the rules that you configure, I think apply to the inside of that LAN connection. I don't know, maybe this LAN out does make sense. But this is traffic I would think is egressing, but I don't think it is. It's traffic actually that has already passed through the VPN and that is attempting to enter this VLAN here. But it's on the outside, I think. Not a networking person and definitely not an expert when it comes to all UniFi stuff. Products aside and focusing on the technology, I set up an SD-WAN or a site to site VPN between these two sites. So between home and away. And this is what's going on between the two sites. Home, obviously being me and away being the co-location. Now, this isn't a formal or fancy diagram. It's just something I whipped together. I had to get everything out of my head and put it on something. And so it ended up looking like this. Now, I know you're going to ask what this tool is because in my last video, when I showed a diagram, I had a thousand people asking me what the tool was because I failed to mention what it was. But this is FigJam. It's by Figma. You might think of them as a design kind of tooling for the most part for web design, but they also make a collaboration board where multiple people can collaborate together. And I just found it was really good to make these artboards. So anyways, back to VPN. What I was talking about, I have a site to site VPN configured between these two sites now. And so that allows me to send traffic here if I want and/or send traffic back if I want. And so this makes it a little bit complicated just because of all the rules you have to have and not only the rules you want, but also you need to make sure that you're not exposing anything you don't want to expose. So with SiteMagic, I filtered out all of the VLANs I don't want to expose anyway, but I still need to create these ACLs in here to make sure that some of these devices can't get back to my network. So a lot of VPN configuration here. I can definitely use your help if you have any tips. So at the colo, I have three servers and I'm running Proxmox on each. I have PVE1, PVE2, and PVE3. So these are my three Proxmox nodes. And then within these three nodes, I have virtual machines running. And within the first one, I have DNS1 running. And within the third, don't ask me why it's three, I have DNS2 running. So these are two DNS servers. Now, these are actually PiHole if you see over here. Now, there is an explanation for that. I know that there are better DNS systems out there, but I already had a lot of DNS entries here that I created over time, along with a lot of CNAMEs, probably 30, 40, 50 CNAMEs. And I didn't want to duplicate that in another DNS system, at least not now. So the easiest way I found was just build two more PiHoles, put them in here, and then create a sink back home to my PiHole server. And so this worked out super well. I was surprised at how well it worked. I mean, it works at home. But at home, I have three technically DNS servers, doesn't matter. And I'm using Gravity Sync to sync these DNS servers. So I have DNS1 right here, which is the source of truth. And then currently, all of these DNS servers pull from this DNS1. So what this does mean is I only have to configure it in one place, and it's the source of truth and everybody's happy. Now, I might switch this to a push configuration, kind of like that a little bit better, where I'm allowing the trusted stuff over to the untrusted stuff. But I might end up choosing a different DNS server altogether. So I'm going to keep that as is. So if we double click into that diagram, here is my Proxmox cluster. So here's the first Proxmox server, the second Proxmox server, and the third Proxmox server. And as you can see, I already have some virtual machines running on here. DNS1, that's one we just kind of talked about. GHARUNNER1, this is a GitHub actions runner that runs my CI and my CD jobs for my Ansible K3S GitHub repo. So every time an approved person opens a pull request, it's actually going to test test that K3S Ansible Playbook, build up all kinds of clusters, tear them down, and make sure the tests pass. Then I have k8s-public-01 and then I have k8s-public-worker-01. As you may have guessed, this is a Kubernetes node, this is the etcd and control plane node for this cluster, and this is the worker for that cluster 1. As you can see, I have a convention here, 1 1 1, 2 2 2, and 3 3 3. So I kind of do that on purpose so I know which VMs are on which node, helps me a little bit, but also it's to spread these across these three servers to make it a little more resilient and to make it HA. But we'll talk about load balancers and stuff like that here in a bit. And then I have a Rancher server and as you can see I have Rancher 1, 2, and 3. Yes I still use Rancher, I'm actually managing right now five clusters, five Kubernetes clusters with Rancher because I'm in the middle of migrating my home Rancher and my home cluster to this new Rancher instance and then creating downstream clusters for everything. It's kind of complicated but it's actually really fun to do. And so currently these are the machines that are running in the colocation. And something that made this super simple was actually connecting this Proxmox server through the site to site VPN to my NAS, which what's on my NAS? Well, my backups with NFS. So if I look in the backups of one of these nodes, once it loads, you can see I have lots of backups, lots and lots of backups. So what I did, what I'm doing temporarily, is connecting both the public cluster and my home private Proxmox cluster to the same NFS share so that I can back them up at home and then restore them here in the colocation. So let's look at the diagram really quick because this might make a little bit more sense. So these nodes are all going through the site to site VPN to my NAS right down here. And right now they're backing up to NFS. And right now, like I mentioned, they're backing up to the same share. I'm going to say share A. Well, in the future, I want to break these out and say that, okay, well, I'm going to have share A and share B over NFS. The public ones go to, say, A and the private ones go to B because I want to separate these just in case someone was able to get into these Proxmox nodes. I want to make sure that, okay, well, if they are able to and they can get to the backups, they can only get to the backups of those servers and they can't get to the backups of my private workloads. So TBD, I still have to do that. Not TBD, to do. How about that? Also, most of this stuff will be done by the time you see this video. I would probably be just as forthcoming with all of my architecture since I've been pretty open with what I run and how I run it. But some of these seem like loopholes in my security. Just know that some of these things I'm talking about will be in place by the time you see this video. So since we're back on this diagram and I talked a little bit about the services I'm keeping at home versus the services I'm putting in the Go location, my plan so far is separating public and home, but just giving public a way to get home for some of these services. Now I don't want to build a NAS and put a NAS in the cloud or have NFS in the cloud or the same with SMB if I need it there or the same with my object storage with Minio or S3. So I decided that for me, I think it's a little bit easier, maybe not the right choice, for me to configure a firewall rule to allow these devices to back up to my NAS like this. And that kind of makes it nice for disaster recovery too because let's say these services were here at the co-location. Well if that went down, if I didn't have this anymore, I wouldn't be able to get to those backups or restore anything. Even though there is a slight risk, I decided that backing up kind of off-site to something else to my NAS is a little bit better or easier to manage. I could be wrong. Let me know if I'm wrong. Another way to solve it is just put that stuff in the cloud somewhere else and then pull that stuff from the cloud here. So basically put it in another cloud here and then this would pull it down and back that up. So that might be a better way to do it but I don't have this third site right here. So here's some of the workloads that are running there in the cloud. I decided to try out RKE2 in my co-location where I'll still run K3S at home and that lets me dabble in both. And I'm kind of liking RKE2 because it's already hardened according to some government standards and it's a lot closer to upstream Kubernetes. Not that I've had any problems with K3S and it not being close to upstream Kubernetes but that's one of the selling points of RKE2. But now since I have two private clouds, I figured I'll try that there. Calico did come as a default CNI with RKE2. I just left it at the default because I really didn't want to run into any unknown bugs trying something else. That's different at home and we'll get into that where I am going to try other things. But here I also run my MySQL database. I run another database that should be in here, MongoDB. I don't see it. Maybe you do. Anyways, pretend MongoDB is up here and then also pretend Postgres is up here because I have no idea where it is. But I do run my databases there and I do run them in cluster and I decided that instead of putting them at home and them traversing this site-to-site VPN all the time, that putting them there was probably a better choice. Not so much because of latency or anything like that. You know, I don't mind the latency for the database. Things are just going to take a half a millisecond longer. What I do mind is going back to DR or anything like that. I feel like this colo should be, for the most part, self-sustaining to where if I lost the internet at home, I didn't pay my bill, UDM broke, someone cut the line, that this can stand on its own two legs with the databases there. I don't know. That's not a good idea. Let me know. Throw in the comments. It's up to you to decide. And then I'm also running Jekyll for my documentation, cert-manager for certificates, running Longhorn there. I'm talking about GitHub actions runners, Shlink for my redirects or my short links, documentation site, also other websites, Nginx for my documentation site. And then my GitLab runners are actually running there too, building code and also running Flux. And so I am doing GitOps again and I am using Flux and I'm also keeping a lot of my custom code there. I have some of the APIs and bots that I run all over the place. I'm now hosting them in my colo instead of at home. But going back to Flux, so I do use GitOps and you've probably heard me talk about it before. If you didn't have a video on it explaining what it is and why it's cool and why it's awesome, Flux really made this super simple. So as I mentioned in that last video, I wasn't going to back up my current cluster and restore it to the cloud. I was going to make some architecture changes like you saw there. And so that meant I had to build a new cluster there and then migrate some of my workloads there. This made it super easy with Flux. And let me show you why that's super easy and why GitOps is awesome. It is awesome because my cluster is defined in code and I know that sounds intimidating, but the more you start to do it, the more awesome it is. And it really helped when moving some of these workloads to my colocation. So in my public cluster, you can see some of these apps and this was as simple as just copying all of these folders and pasting them into a new cluster. So for instance, if I wanted to deploy all of those apps to a new cluster, well, at first I need to create the folder. And then in here I would paste all of these files, all of these folders and files to find these workloads as far as the ingress, a helm release, a cluster for my database. And then if I commit this and push this up, within a couple of minutes, this cluster will now have all of these applications running. That helped tremendously because what I ended up doing was exactly that. I went into my cluster 01, which was my home cluster. Now you can see I only have home things there and I copied all of those and I pasted them into my public 01 where they now are. Now there is a caveat to that. I said it was just as simple as copying and pasting. There were a couple of things that I had to have in place first, which brings me to my storage and it's Longhorn. So this is my Longhorn instance at home. And as you can see, these are all detached. These were attached to Kubernetes workloads, but I've moved those workloads off. So now those containers are no longer attached to these volumes. But what I did before that was back up all of these volumes. And you can see that all of these volumes right here have backups and they've run three, four days ago, 15 minutes ago. Should be a lot more, but I also disabled this job. But I back these up to object storage or S3 or MinIO, whatever you want to use. And then in my new cluster, which is right here, I ended up going to my backups and restoring them. And so similarly how I backed up my virtual machines to NFS and then I went into my new cluster and restored those virtual machines from NFS. I did the same thing, but with object storage. So on my NAS, I have object storage running and from my old home cluster, I backed up all of those Longhorn volumes to object storage. Then I connected the new cluster after getting Longhorn installed, went into the Longhorn UI. I could have done it through GitOps too. And then restored all of these volumes back to the cluster. And then after that, I could copy and paste all of those applications and then they would spin up and then attach to those volumes. But that saved me a ton. If I didn't have those defined in code, I'd probably still be clicking buttons and trying to figure out how this all fits together. Another reason why I like GitOps and I'll "Git off" this after this last thing about GitOps is that it's also my documentation too. So I never have to worry or wonder how things are configured. I can always look here to see how they're configured and I can compare it to other people's if I like. Anyways, I said I was going to "Git off" of GitOps, so let's do that. Now just backing up a little bit on how Rancher got there, I did build a new Rancher cluster at the colo and I created new downstream clusters from Rancher. That's the cool thing about Rancher. You can spin up a Rancher management cluster and then from there you can easily create downstream clusters. And so that's exactly what I did and let me show you. In Rancher, you can see I have my Rancher management cluster where nothing is running but Rancher and then you can see I have my public and now my home cluster. So I decided to put my Rancher instance at the colo and then let it manage all of the clusters all over the world no matter where they're at. And I decided to do that because I didn't want to host Rancher at home anymore. Remember, I want to cut off the firewall. I don't want to port forward anymore. And so I thought running Rancher in my private cloud at the colo with the static IP with proper DNS was probably a better choice than hosting it at home anymore. And so that's what I ended up doing. I just created a new cluster. I chose custom node. I chose what distribution of Kubernetes I wanted. I chose the CNI. I wanted, chose whether or not to include Nginx ingress and all of the other options. And then I created the cluster and then it gives you a curl or a command line to run and then you create your cluster. You run that curl command on all of those machines. And so that's what I did on these machines at home. As you can see, my home cluster isn't doing too good right now. Storinator off because I moved all those workloads off. Some machines are shut down. Some machines are still running. And I am now running, as you can see, k8s-home-01, k8s-home-02, k8s-home-03. So those are the etcd nodes and the control plane nodes for controlling and managing Kubernetes. And then you could see home worker one, home worker two, home worker three. And these are spread out across these three nodes. And so this is how I spread out that load and make things a little more higher available than if I had one server. Caveats to that, I've talked about those a lot, but for the most part, a little bit more highly available than it would otherwise. So again, to illustrate that, Rancher would be here. These clusters all communicate. All of these nodes communicate with this Rancher management cluster, all of these nodes. And then all of these nodes at home too also communicate with this Rancher management cluster. And that allows me to manage three clusters right now with one instance of Rancher. And I still have my old instance of Rancher that I'm moving some of the workloads to here. Once I do that, I'll shut that one down. So I'm like mid-flight right now. But let me know, is that a smart thing to do? Should I have kept my Rancher servers here at home and figured out something different? Or is keeping them somewhere public and allowing my nodes to connect to them a better architecture? You tell me. And so now diving into my home environment, as you saw, we freed up some compute at home on all of those servers at home. Some nodes were off. A lot of those are going to be deleted. And at home, I'm just going to run a few virtual machines. Yes, I still have a HomeLab at home. Just think of my colo as giving those servers new life because they were off for four months. And so, yes, I still have a HomeLab. Did I move everything to the colo? No, I gave servers that were off and I was probably going to sell a new life and put them there. Anyways, I feel like I'm explaining that a lot to a lot of people, but I still have a HomeLab. Absolutely I do. So what am I going to run in my HomeLab? That's what I want to talk about. So at home, you know, I'm still running my DNS servers. I have my home cluster. We just talked about that. And here's the capacity that I freed up right here. And so I just didn't know what to put there. I put in a playground because that's really what my HomeLab has always been about. Although I did run some public services for my documentation and stuff like that. Self-hosted a lot of stuff for most of the community. I also had a playground space. This gives me more playground space, which I'm super excited about. So what am I going to run in that playground space? Well, I already chose RKE2 at home and that's just for now. I think I'm going to switch these to K3S and I think I might try a different CNI. So I think I might try Cilium instead of Calico at home. And then I do run Traefik at home, which is pretty cool. And I know it really well, but I'm thinking about experimenting with Nginx Ingress Controller. And the same goes for Longhorn. I've been running Longhorn at home. I'm thinking, hey, why not? I have this playground. I can play around with Rook Ceph. And so there are things I'm still going to run at home. Still going to run my DNS, still going to run Home Assistant, a lot of other stuff. I'm still going to run Scrypted and I'm still going to run some custom code. And I am still going to run Flux. But hey, while I'm trying new things, who knows? Maybe I could try out Argo CD at home. So tons of possibilities that I didn't have otherwise because I didn't have a lot of compute space left. So yeah, moral of the story is, hey, I might be trying a lot of different stuff at home and it might not be a mirror of my architecture that's now at the colo, which is pretty tried and true. And I think I'm going to stick with for a long time and I'm going to do more experimentation at home. And I know you probably see this TailScale icon. And this was a question I posed in the last video. Should I use a site to site VPN or should I use TailScale? I think the answer is yes. I think the answer is yes. And I am going to create a site to site VPN as I already did, mainly because I have it working and the firewall rules are, well, were complicated at first, but now I figured it out. But I think TailScale still has a place here. And I'm thinking I might do some fancy stuff with TailScale and either run it as an exit node up here, run an exit node up here so that I could, you know, maybe open up some services on my NAS and then have them exit here at the colo. Or who knows, maybe I could install TailScale on my NAS and install TailScale on these Proxmox servers so that they can just get to my NAS to get to NFS, SMB and object storage without creating this site to site VPN rule. I'm not sure. Or it could be something even crazier like, hey, what if I moved all of the etcd nodes? So the ones that hold the Kubernetes information, all the secrets, what if I move them out of the colo, move them back home, but then use TailScale to automatically connect to these workers? And then we can configure it a bunch of different ways on Proxmox within a virtual machine or even in Kubernetes itself, and then expose only a certain amount of services from here to there through TailScale. So as you can see, I have a lot to figure out still, and I would love your opinion. If there are mistakes that you see I made along the way, let me know in the comments below and I will be sure to address them maybe in the next video. I think I'm going to continue this series where you help me build out this architecture in my colocation and who knows, maybe in the future even open sourcing all of my architecture so that you could open a pull request and you could run CI/CD and make those changes in near real time. Now I'm so excited for all of the possibilities. I learned so much about site to site VPN, migrating services between private clouds, and I hope you learned something too. And remember, if you found anything in this video helpful, don't forget to like and subscribe. Thanks for watching.
Info
Channel: Techno Tim
Views: 49,324
Rating: undefined out of 5
Keywords: techno tim, technotim, homelab, home lab, rack, data center, data centre, datacenter, colocation, colocate, colocating, planning, networking, racking servers, UniFi, UDM, LAN, WAN, firewall, ubiquiti, ping time, fiber, 511, 511 building, msp, isp, minneapolis, colo, architecture, infrastructure, architecture review, self-hosting, selfhost, rancher, udm, site magic, sd wan, site to site, vpn, proxmox, security, dns, backups, clustering, virtualization, vlan, virtual network, kubernetes
Id: cqFEP4sCkU8
Channel Id: undefined
Length: 27min 42sec (1662 seconds)
Published: Fri Apr 05 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.