Klustered #19 | Rawkode Live

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome to rockwood live on the rockhold academy today is clustered we have a couple of clustered and a couple couple of broken clusters and a couple of great guests joining us today as we do some live debugging to get these clusters back online and working we have a clustered application written in rust which has a small database that we'll do our best to fix now a little bit housekeeping before we get started please subscribe to the channel and like comment and share the videos as well this helps other people find them it keeps the youtube album rooms happy and just means that more people get to enjoy and learn together we also have some membership packages available on the academy that allow you to support the channel so please feel free to check them out and if you have any questions then you can jump into the discord and ask them i'm more than happy to tell you what we are doing the discord is available at rocklord.chat there's nearly 600 of us in there now talking all things cloud native kubernetes ebbf and everything in between so come and say hello and i look forward to meeting you all right last thing our sponsor teleport have been sponsoring us now for the last two months and i gotta thank them a whole lot and it's the easiest decision i ever made when it came to clustered because we use teleport every single week we've been using teleport since the first episode allows us to secure access to the clusters as well as share and pair in a same session which you're going to see us doing a whole lot today it also allows you to expose those applications securely through the proxy which you'll also see in action so if you want to support clustered android live check out rockwood.liveteleport it's a utm link but i would really appreciate it and of course go install teleport all your clusters you'll thank me all right let's jump over and say hello to today's guests i'm joined by barco and matt hey guys how are you good nervous no of course not these are pros these are pros all right no notice on both sides i think all right could you both do as a favor and introduce sales we'll start with barco there and then move over to matt and hi my name is porko i'm a big fan of the clustered series here i think it's a great resource for people learning kubernetes or preparing for cka so i'm really humbled to uh to participate um i know i've learned a lot through the series i hope others find value here as well thank you very much yeah hey i'm matt turner uh some kind of thing done a fair amount of kubernetes i guess i got my cka which was quite a quite a stressful thing i think this is going to be worse um i haven't watched any of these because i didn't want to uh cheat i won't win i didn't want to end up somebody else's it's a different difficult time zone for me i didn't want to end up just you know copying anybody else's breakages by accident it's hard to get them out of your head so yeah i'm completely new to this but we'll we'll see all right don't worry about it it's all in good fun so i see russell has a beer enjoy me and pop has joined us sparkle he's putting a lot on your shoulders he wants you to make me cry i'm just i'm just promised he promised some tim horton's coffee so uh i'm gonna try not to disappoint all right perfect well i have matt's cluster in front of me so we are gonna jump over to our screen share i am going to open a session onto the control plane node and barco i'm going to ask you to join and give us an echo hello to let us know that you're here we're in the same terminal and then we will do our best to get this application running as quickly as possible let me start our timer uh feel free to set up the cubeconfig and we will check if we have a control plane which is usually a pretty good first step so my uh i don't know how much i'm i should tell you whether i should just shut up my worry here to be honest i haven't broken it hard enough for uh for two pros like you i wasn't sure how close it would go this might this might be very quick but then we get to move on to barco's super hard super broken cluster and it's okay yeah the quicker it is i mean the sooner i get my dinner usually i mean dinner at 8 pm on a thursday so you know if i can get a better there i'm not going to complain all right because i just i've got a list of our guttering at the end of like four or five ways i tried to break it that didn't work um because i sort of you know came up with them from first principles but kubernetes is too resilient these days and they actually didn't have the effect they expected so that was quite interesting all right well it looks like we do not have a control plane um that's the correct ip i guess uh it definitely looks correct i mean i can't say for sure but i would i would argue that yes that is probably the correct idea and the chord seems okay would you like me to confirm somehow okay well we don't have a server running yeah or okay so um well let's just nice reminder in the chat from kevin that there are only three rules don't like teleport don't play with etsy dean no ebpf well there are two are unofficial through rage induced by me on previous episodes okay so [Music] i guess cube controller seems to have been played around with but uh um so we have no api server running and you've jumped straight into the static manifest right yeah i'm just going to see if anything seems off here [Music] sending jumper at you here no um that looks like a perfectly normal static manifest for the api server i'm pretty happy with that okay um i guess we can check uh systemd yeah let's confirm if we have a cubelet russell's also saying there's an utf rule yeah that is a real world we have not invited to die back after episode two nor will they come back i see an api server there cubelet yeah maybe it's just flapping yeah okay they want to check out the logs perhaps interesting okay so it seems to be restarting um okay have a known map for a wee while now i would not be surprised if there's just a crown doing a kill minus nine every three seconds yeah a little you know no no utf-8 [Music] simple but effective right wait we don't have var log if our log is okay um ah lcd okay there's always a panic when i don't see a cd running it it definitely shouldn't have lost its data like i haven't done anything like that you should when you get it back you should get a grade in place uh yeah i'm gonna have to find a way to disable that but there's always the remembered them cursor every time we open a file and it starts happening yeah i uh i killed my bash history but i i forgot about that damn every week we see it you're like oh yeah i didn't think of that i'm obviously i'm not a hacker am i i cleaned up my bash history oh good what do we have um i mean this doesn't look right yeah that one am is std is never going to be able to start with that that's for sure i can see a smirk on your face matt it wasn't uh it wasn't actually crashing when i left it i wanted to just squeeze it i should just squeeze it to just the point where it was timing out i mean averages remove the the limits to be honest we don't need them we never follow that advice in production if you're watching but you know for the sake of clustered i am happy for us to delete a little bit all right we may just want to give it 30 seconds make sure it is still running make sure api server is up it's always fun to see which flags people use with ps everybody's got their own little unique take into it oh come on russell six minutes 30 seconds and there's already a rockled smash of just deleting stuff from manifest i do like to delete stuff the sd dropped again right um well api server is not starting up but that's oh did did you is that suzy suzista right um well i assumed it was but um seems to be running okay so let's check the box again yeah yeah multiple ones that's from five minutes ago but i think that's the other vlog um okay i mean cubelets running that cd is running oh everybody there we go that was uh there we go okay so we have one worker that's now working that's good um but we have control plane that seems to be working and one worker that is okay let's check few things um already checking for malicious policies i mean i would maybe start with a get pods i'm just throwing that out there maybe it's running maybe maybe matt forgot to break it i i guess i'm just uh ex you're just assuming the worst right yeah yeah really just kind of checking what's next it's kind of difficult to know what to do because i could have just our disc right but that doesn't seem to be in the spirit of the rules so uh it's against the rules what's against the rules deleting the disk because that would kill teleport oh yeah you know i could have done something very dramatic i guess i mean i'm not doing very well here he's trying to delete all the parts i don't need to say i know that it's just a complete mistake i'm like thinking i'm thinking i have to smash everything um okay so so we can ignore ambassador mostly we can see that our clustered pod has completed which is not a good sign and our postgres is stuck in container creating and we've got some silly and flaps too oh there's our controller managers aren't running either okay okay let's uh control your manager yeah i think the controller manager is quite important we need that running for everything else to reconcile oh you spotted it i heard you i hope you'd get to the point of running the update changing the deployment and then being like where are my pods where am i where are my pods yes 54 clusters in at this point you think i'm going to miss that no i'm sure i'm not the most creative person so that looks okay and when we ran they get pods it was actually a status of pending and pending really only means one thing especially for a static port which doesn't have access to a lot of the kubernetes primitives did we have a scheduler running we do um yeah yes we may want to describe that controller manager then and see oh wait it's running there what's uh 11 seconds that wasn't me a lot of things are restarting so i wonder if there is something that's causing them all to restart yeah old forgotten limit range only known by openshift open shift operators yes can you run get pods all again i'm just curious if we have more things restarted okay that controller manager does appear to be all right was it just maybe at cd like i i don't know the fact that was done so yeah it could be lcd related so let's assume it's okay just now which means we probably want to focus on our clustered and postgres applications so i mean our clustered and what's this web 6x uh i think that's a decoy um what do you mean by like i think matt has just deployed them to confuse us russell asks how can the clustered pod be complete um so it generally means that the process is zero so it's either not our image it's been swapped out yeah hard to tell right now was that the web one you described there yeah yeah okay so that's just an engine export i think i would suggest passing just doing the big delete button on it nothing good ever came from random workloads in my cluster i mean it's uh okay i'm gonna delete it but i'm wondering what spawned it well the pod name was interesting right it wasn't a pod tuple it's more of a demon set to pull i think he's just deployed a demon set called web yeah there we go okay let's delete that okay matt you look sad about that that was my tumblr okay i mean we don't see any more restarts i don't think so i think um so let's look at uh um let's look kevin's on there with the maybe the image has been changed i think we're i think we're thinking that too okay so uh i guess add it sts or was it okay oracle slides through this i could just go have my dinner and come back and you'll have it fixed i don't know i'm just we i'm scared of that this huge comment before all right well we're just about halfway through our logic time so you've got just over 20 minutes you're doing well you're doing very well um so i don't am i missing something i am not seeing here request your limits did they just pass through them you are correct there's nothing and that manifests as certain so what else could be modifying them um another common thing some sort of admission web hooks or something yeah check the mutation weapon configuration me [Music] they're a breaker favorite unclustered yeah they are absolutely bristle yeah i know okay so there's not a mutating web hook so [Music] i'm not a terrible person so what are you looking for now barco can't you uh specified web static web hooks like here oh yes you can so but there's only the enable admission plugins so it's only node restriction which is which is okay yeah i don't see anything so i'm just thinking now what are let's take a look at that ear again uh what was the air and you checked for quotas and limit ranges didn't you and there was nothing yeah okay so so can we see the request and let me scroll up um actually the node was low on resource ephemeral storage um this may be something on one of the worker notes instead yeah okay so let's get let's see something you want me to open a new session or do you think you can do it from a control plane one sec so um okay so it's scheduled on worker one worker one is ready um what if we just like and see if it gets rescheduled to work or two it's gonna work yeah go for it i mean you could always edit oh we can come to that later if we need to let's do it this way first kevin says one of the workers was not really ready maybe it was scheduled there appears mats worker has fixed itself i'm not sure how no i'm not sure how they're surprisingly uh resilient things oh is it gonna uh what state was it in oh okay delegate okay naveen's pointed out the enter on the resource this year which we're looking into now i was really tempted to break yeah one has really tempted to break the finalizers just to uh just to really annoy you when you're trying to delete things but i didn't get around to it sorry i've got a nice alias for that now called nuke from orbit so it patches out all the finalizers um let's edit the deployment barco and just add node name to the spec well well it's running now what is it yeah i'm not sure why but it seems to be okay so um all right let's describe our customer yeah and see what happens okay go for it why not the deleting fixes everything i know my chat are laughing at us but it does work i mean we're gonna have to find out what the breaks were because i feel like we're not really fixing some of these issues like i don't understand what fix the yeah we'll see what happens running okay um right so we're gonna upgrade it to version two yep okay i think you're gonna have to describe that oh it just i'm guessing the one that's running is the old one huh i'm scheduled with no events i'm a little confused i'm trying to understand what's happening here okay so we have one running uh is our scheduler flapping again just yeah we'll we'll take a look just so this is version one um no we do have a scheduler what is happening with the resources i think and they're i wonder if i just do this it came back as well assume this should be v2 i hope oh no no because it's the replica set for the old one is still active so we have an unhealthy new replica set with v2 which can't be scheduled so the old replica set is continuing to schedule pods now our new replica set and the pause created from it is currently pending now that usually means that the scheduler hasn't assigned a node or a volume can't be mounted or there are resource constraints normally we would see something in the events for that we are not my intuition is telling me interesting and it's maybe look at the scheduler manifest i don't know if we did that yet um oh it started at the top so i trust that he's not modified it maybe that first one was a red herring maybe i am an elite hacker no our mission is i mean we don't really need to fix anything that matt's done like we can bypass a scheduler which i'm tempted for us to do just to see if it works and then we can always try and work out what the issue is so as you recall barco um i'm just trying to think it seems like something's happening with the resources i'm trying to think what what are the ways you can manipulate resources either available to kubernetes from a node or but like we checked we checked did you describe worker one like describe the node for the api server i don't know if we did that um let's um let's check this and let's look at some of the logs and then if we don't find something quickly there we'll uh we'll try it um i guess your way of just pass bypassing the scheduler um yeah that we may have a desk being filled somewhere um and if i remember correctly at the start worker one was unready was not ready so siri and and like class what you uh where are you looking at here uh so we've got a warning about the disk space although it is a couple of hours old oh okay but it does seem to be okay as of 19 minutes ago when we like because that was a not ready node and now it is ready which is particularly strange um but that was the worker not the control plane maybe we should jump onto worker one and all right look around um yeah let's uh let's do that okay i've opened the session feel free just to jump in we've got 10 minutes i see we have we snoop around for five more and then we bypass the scheduler just because it will kill me if i see matt wins i should have we drowned to just uh that's my success so if you're on a df does the desk look okay it does indeed and you want to add a dash i just to check the i know yeah it's fine okay i'm assuming you had something running on that note at some point matt that was doing something nefarious with the desk right yeah yeah yeah i'll give it away i was that's why i think um access to node two is is lost so i i tried to break both nodes i logged out of two and couldn't get back in and realized i broke a little bit too hard i still had a shell on node one so i undid the things okay yeah so that actually is a red herring and i actually can't explain why the thing is still having issues well i mean i did another couple of things on the nodes but it's not like it's not that okay um so i'm confused are you saying that you're expecting this replica set issues that we're having that that's a different problem i can't remember exactly why uh what would be added to um yeah i think you are expecting i think you are expecting what you're seeing yeah all right well we bypass the scheduler and then work at it or you you really want to work it out i can tell okay um no i think you can can you take it over i'm i'm kind of like stuck so let's edit the cluster deployment right and we'll jump down to the pods back straight out of the cka manual scheduling we're just going to do a mod name one two one six five because so you wanna schedule it to a specific like different note yeah i'm just going to hard code it so we don't need the scheduler like i'm assuming that's why all right but wasn't it already on worker one no it's not scheduled to an object okay i think uh i don't know see that was container created so there was something the scheduler wasn't assigning a node to it and i don't know why but like we can just bypass it and see if we get it working and then like yeah i know i i yes now i i for some reason i was thinking it wasn't um like there was something wrong with that node but i i yeah i missed uh mr wasn't him being assigned yeah of course i mean i have no idea why the scheduler wasn't doing that um but it was for the old replica set it's really like yeah i'm not entirely sure what happened shall we test our clustered application yeah all right so we cannot speak to postgres so we're not done yet and we have seven minutes to go um i wonder which one of my hacks did that because this was all done in i'm not gonna lie quite a compressed time frame i threw a couple of things at the wall i'm glad one of them worked um sorry the the connection string is hard-coded here yeah it just tries to speak to the service postgres so you'll want to make sure that we have that service that has endpoints etc just you know basics and and so we have postgres we do have an endpoint yeah and then so that's um that looks okay right yep um you've just been paged it's 3 a.m we're losing a million dollars a minute okay so so the error message is it can't resolve the host name so it may be dns in fact this chat is that everybody's saying dns russell it's always dns naveen is almost dns kevin even if it isn't dns there's always why is it always dns though um because it sucks and you should all use istio i've tried it okay i can't get it installed yeah yeah you probably driving up ram is cool dns running now if you've given me a cluster and we could have broken some things i'm uh i'm fatty fatty proud of this one that's correct right yes so we see ready we see kubernetes with cluster.local i believe that looks all right we do have two coordinate spots running um suggestions i don't know could be something on the node so my suggestion would be let's get inside uh clustered get some debugging tools and see what the hell we can work out uh bash yeah and now i have to every week i need to try and remember the name of the package i think it's dns you tells no space oh for dig yeah i always forget yeah i want to do that i want it i want to dig yeah it's different than every distro i always forget there's um commandnotfound.com which is really useful you just put in the name of the cli tool you want and for every distro it gives you the package name and like for bsd and darwin as well it's super useful uh okay so what deck is telling us here oh we're not using cluster dns this is the equinix metal dns was over uh what's the ip address oh we have up here right i'm gonna uh svc uh attending to sexy or ten yeah i can remember we can try and force this okay so it did answer us and that does work but i don't think a resolve.conf is using it so he's super impressed you found that they're buried in the fourth line of fourth line from the bomb very very good uh oh it's the cubelet configuration right you want me to keep driving this barcode do you want to ah no i'm i'm a little uh not following actually what happened here yes so i'm going to go on to the control the worker node right and we're going to do a cat of cubelet and there's a bunch of configuration files we can use and one of the things you can do with the cubelet is change the dns configuration and i can remember how here i don't know if that's right or not um cluster dns is here what what i thought he changed you hadn't changed but i feel like i'm close ever since i guess i mean we can check var lip um cube like config maybe there i mean he can't just just do that really you excuse my language uh yeah there we go this dns policy of default i don't know if that's correct but i feel like it should be cluster first and this is if i made that up 30 seconds super impressed integration won't have taken effect super super impressive yeah the pod had rescheduled the old one was still terminating but it was enough for the service to run so you you changed the cluster policy from cluster first to default right yeah because the default because this bit me once and i lost like a day to it the default is not default i thought that would pass the eyeball test i thought you'd read that i think the default was what it was meant to be set to did you think default would be the default but it's not it's cluster first yeah super impressive the reason one of the reasons they weren't scheduling is because the scheduler is called default dash scheduler and i change i overrode the scheduler to default so it would have been trying to be using a non-existent scheduler that's not the only reason either but there we go did you do anything with the resources or i was like completely off there then yeah that was maybe a bit of a red herring so i tried to uh fill the what i did fill the disk on both workers to um to get that disk pressure uh this cluster doesn't have auto tank turned on so i thought it would fill the disk it would get the dispressure condition which it did and then take the nodes and evict everything that didn't work because cubad must not do that by default uh but i did it by dumping a load of stuff into a file and if you on the work in those if you run team tmux attach you'll find that hidden in the background there's just like a python interpreter holding the file open so i made a massive file opened it and then deleted the file so if you use du you will not find the file that's filling the disk um the problem was because i filled it right up to the brim uh you couldn't fork a new shell so i broke ssh access like i said i managed to undo it i thought that was unfair so i managed to undo it on one of them but not the other because i was already logged out um okay well it's uh great there's also an anti-affinity between the clustered pods which i think is another reason it wasn't scheduling based on like a global one so based on it you can use any label as the topology key and you're meant to use regional zone i just used uh os so because they're all linux right it tries to bring the new one up before it turns the old one off and they can't do it because there's no applicable nodes to post it on because they've all they're all running the same os all right well well played matt but uh we tend to get it fixed solid team effort from marco and i so we're gonna swap places now uh so i have let me jump back over to our screen share i have uh barclays cluster here matt please join um the session that i'm about to open there we go uh just echo hello let me know that you're there barco feel free to sit back and relax and enjoy awesome links it's fun uh my invite codes expired david if i could get another one you're supposed to register before the episode met come on wait this is valid for four hours i only got the link two hours ago well yeah it should be good then yeah the the email i never made doesn't matter oh did i have to register for baka's cluster as well okay i thought i'd registered once and i could just connect so i didn't yeah i didn't know anything well i guess this is a nice opportunity to show everyone how easy it is to add new users to teleport right so uh users add rules admin logins mt let's hope you get to this before anyone in the audience does otherwise they're pairing with me and you get to go home no i stuck the link in our private chat just to pop that open and i'll just clear that there we go the chat on yeah oh that's the button yeah i should have started the timer when he told me the link was broken shouldn't i have let's see what we've got in the chat wrestle with mission impossible level timing nice to work yeah we did kind of far too close i've been saying the same super impressed kevin have a celebratory beer i wish i don't actually have any i will not make that mistake again no i should have brought one would have taken the edge off clustered guru rockwood not quite considered myself lucky all right managed to register yet uh yeah i thought i was in did you join my active session or did you open your own correct i'm not very going to be very good at this am i yeah you'll be fine let's just get in the same session now checkout activity then active sessions oh yeah cool but i can show the audience that so you join sessions on teleport by clicking activity active and now i can see that matt opened one yes feel free to join me so you're going to have to export a cube config check for the control plane and i am going to start our timer okay use whatever command you wish i do appreciate the key alias though okay interesting we're root it's allegedly readable by its owner and there's no dot saying there's an essay linux context or any extended attributes on it well that's interesting isn't it i mean i would try lsat or just in case but i thought you would get a little thing on uh yeah but okay it's a fair point well yeah i'm gonna look it up i don't remember what the e is yeah do you mind i haven't had um i haven't used this subsystem for a while oh but we we can count in qct interesting can we install packages onto this thing oh tell me what the layers mean you can always rely on the archlit archline that's wiki that's what exactly what we need first thing e is extend a format so i think that's actually safe so i don't think we need to worry about that are you running stress yeah why don't you see what's what it's doing he's obviously doing some low-level linux nonsense resource temporarily unavailable yeah what is uh you want to check the mount points oh maybe yeah it could have given us like a weird amount read on your file system over the top that's odd can we see the call to read the environment not hmm i can't see the access to the environment variable or any any stat or any open call on the file it doesn't so let's let's do the basics first before we go into the deep like it was put there at a plausible time it's a plausible size yep uh what have we actually broken it using this i put a read-only file system over the top that was one of my thoughts too yeah uh yeah as much remote are all but that's errors right so that's if there's a file system error it shouldn't all right you shouldn't freeze it it should remount it read only so i think that's okay i can see barclay laughing at us already i'm just laughing at the comments this is uh full google interview stuff um all right i've got confidence in this it's not a sim link it's only got one hard link let's try and change the permissions on it to 744 and see if it complains first right but the thing the thing is we can cat it so something's stopping cube ctl has like an sc linux okay i mean if there's bpf i'm going to be angry like it's like it's like um qct has got an sd lyrics conf text or something uh i don't know too much i think you're on the right track it's nothing with the files um it's something with the cube ctl command correct are you using ebpf uh i am not i don't think um okay i cannot do anything with evp no you're looking for lg preloads yeah yeah obviously um okay so if it's something to do with the control command we already validated this right let's check like so let's check if it's an alias run type alias adjust our alias okay so um could be a function uh i don't know because they would show up when we did this right and i did that to find the file the file's okay i'm gonna check anyway i mean there's many places it could be but still yeah i mean that we could always just do let's cheat okay so it's something getting in the way of cube ctl which looks okay wildly pre-loaded um i'm thinking it's gonna be like a yes this trace like one of the security mechanisms it's going to be g-visor or something you know something that blocks this calls although i never saw when i s traced it i never saw the access to admin.com so i wonder what i mean does it fork naturally uh you make this thing follow forks f [Music] ah there we go okay forked um i assume that's natural e access permission 9 well barclay's done something to maybe block is just opening a file at that path so why don't we copy it to somewhere else and see if it works that's a good point yeah um [Music] there's a new one there but i can't see this there's several pits but i can't see the fork oh because of course it's called clone these days uh i just wonder what on earth is so where's it exactly because something we've seen on a previous episode from no was there a bpf bypass that would restrict access to open in certain files on the disk so i would suggest we just try and copy it somewhere else and if that doesn't work we need to start exploring the processes on this machine for something malicious yeah i'll tell you what we'll do we'll do that so it gets a new i node number as well um food.com that was the best name you come up with what do i call right it remember it's not with that file now i'm curious if we copied cube ctl to a different location that would work okay so we can't read anything that's interesting yeah right so yeah cube ctl i wonder whether there's an season that's policy that's matching the the loaded file well kevin said in the chat you can't get sc linux on ubuntu but someone told me last week you can now okay then it's on up into it and that put the feeling to me wow that that was a works suggestion my part um app armor there is an app armor profile for cube ctl that doesn't allow you there's nothing in it really it wasn't generated so it's just preventing you from running the command ah sneaky nice okay that's ah palmer made the ban list all right let's continue it app armor and sd linux are those things that i really should learn better and have never invested any time whatsoever no normally i mean when i say se linux and people said it literally doesn't exist anymore that shows how much i know but you know one of those linear security modules oh look okay then back to where we were um it seems to be kind of running but they both my cluster was older than that so he's been in here since you did something this all looks good now i don't have any of my illnesses um this all looks suspiciously quiet except the network's up and down all over the place uh and wow so it's all of the control plane okay so we don't have an api he's okay the api says flapping because obviously it's been up at some point because we can we can do this lots of restart should we just try it and see what happens so what's the fix for the app armor thing then barco how would we fix it so you can um you can just basically there's a command um a log um sorry gen prof or something like that or lovecraft um that will generate um the the profile app armor profile for cubectl based on like the last time it was executed like all the syscalls it needs and it will allow you to do it or you can just disable the profile itself and then it should work nice boot into safe mode we haven't seen our camera before let's say clustered first yeah i've tried to kind of because i've seen a few episodes i've tried to put in things that i haven't seen before just to help people i guess audience like get exposure to some things i don't know um because i i found this like really useful to learn from like really liking to see how people think and different things they can think of breaking so i tried to put things that i hadn't seen before all right awesome matt we've got angry men and our authentication for postgres failed yeah but you'll notice is that it's v1 angry man there really is no there only has one angry man so this is the v2 image but we're getting an error connecting to the database oh is it i'm sure it's a v1 the the title of that webpage is v1 oh yeah i i i think that's cached there is a caching problem and sometimes i have to reload it 54 times before i get the new video so i don't i don't tend to open v1 anymore that's fair enough okay so we've upgraded and it can't auth to postgres that's interesting isn't it and then and the whole control plane's flapping that's all very odd okay how do you write this out how does the auth work does it get its password from a config man no it's it's hardcoded okay post squares and post squares one two three if i remember correctly but you should be able to take a look at the staple set and it's just environment variables passed into the postgres image well now our uh now uh i never lost degraded it's a static manifest so you won't find uh you'll still see it as a child of container ddot service right um except it's not there at the moment but okay um yeah let's do a quick quick visual inspection of the why do people never look at where the cursor is when they open the file oh no i forgot i was just whatever it was ah i see some sneaky stuff okay and there's all kinds of things you could subtly break in here yeah this way barco has added an encryption provider to our api server you might want to take a look in temp ece yeah embarrassing question what does that even do that's the encryption at rest in xcd right like the row encryption i believe so yes now what we have to be careful with here is if we remove this we can't we can't read it in an std anymore or he could have added it had not actually enabled encryption on lcd in which case this is the reason it's flapping so yeah what we're going to have to do is check the api server logs i think and see if we can get some information before we go all out smash on this what if we hog smash it and we just lose all of that cd we can just put it back right because we don't need api server access to do it yeah just don't delete the ec dot yaml oh no mustard you know one two three four five seconds nice i wonder if there's gonna be something spicy in there like i tried to put it the the thing is you have to uh limit exactly like 16 or a certain number of characters so you don't have much much room for creativity oh okay i was going to get the logs through through qctl through containers you'll need to go to firelog yeah i'll tell you the restarting is not caused by the encryption okay that's good to know okay containers uh yes then yeah cube dash api so where we go no match failed to list config perhaps and it was transformed this looks like an xcd registry configuration default keyboard no matching prefix found this does look like it's it's looking for a hard coded config map in ncd and the ncd tree has been mutilated go back up is there any precursors and that's just a reboot isn't it shutting down starter stops shutting down shutting down shutting down shutting down okay we want to check i mean lcd is running because that's why we'd see another that couldn't actually scd is actually not flapping is it well i don't know the only thing is ps api server is going by yeah yeah one times if it's restarted i guess we should be able to see the time the process started 72 minutes ago right okay yeah right so lcd is okay it does look like it's been [Music] okay so it looks like an important config map is gone we can't use the api server to inspect it uh we could use xcdctl well we do get an api server for a while so we could always wait for it to restart and then we're on a clock but maybe there's a nicer way to do it and he's back yeah get config maps all right um where did i start for conflict maps and not pods why is it trying to get at the ca for one and not the other i don't understand that a little confused that k gets c confidence all did not work i would have expected that to work okay yeah i love it and we have to ask don't we wait what was the air when you ran that something about i'll show you uh something about i love how you're joining in to fix this as well uh okay none of those try to get cm again unable to transform key right now yeah that does make sense i think actually yeah because there's a few especially config maps right out of the box there's a few of them that are like completely intrinsic and i think you know because you could there's always a service called kubernetes and there's always a config map i think with the cluster ca in and there's always something else and it looks like he's made me manage to delete it and like the error handling it's just always expected to be there so the error handling is just it just doesn't really wrap it nicely and it's saying well i'm trying to get this because this is this is the raw key in xcd right it keeps everything under slash registry and ncd for some reason so this is the raw scd path and it's just saying you can't can't find this or can't find some you know some prefix of it yeah so i always use this to configure lcd ctl let's copy this yes cool i can never remember how to do that okay what was it registry yeah i don't know if i should be seeing something or not i'm not on that you have to go um registry slash the namespace slash config maps or config map slash namespace one or the other i forgot it's it's configurable slash namespace by the looks of that error there is a red something like list resources or something like that to show you the tree if i remember correctly yeah there's some kind of like list command isn't there this was my least least favorite part of the cka yeah this is a bit interactive all this ncd yeah i don't remember no we might you want me to google some stuff well we ran get registered services and that worked right we actually got this value back i have an ncd cheat sheet right so this one doesn't work right so why can't we do config maps where did specs come from it's just part of that thing and i don't know how to show the tree so let's try that fcdctl country so i start copying and pasting from this cheat sheet oh yeah go for it it yeah i don't know where it is yeah there we go oh yeah there we go all right so we do have default cube root ca what's the error message again i'm gonna copy this but uh it's basically saying that doesn't exist is that the same uh did you copy a registry nice he wouldn't do that to us would he is it is it a utf-8 hack they're not allowed no it actually looks all right okay so that's interesting unable to transform key i assume we're reading this right i mean to me that just says it doesn't exist in the database but maybe that's maybe there's more nuance to that do you continue let's see the ctl to check what it what the value is in there right okay yeah i was assuming that meant the key didn't exist um well yeah we could try to get its value i guess no you tried that didn't you david yeah but we weren't doing it right so i mean i wasn't doing it right um looks like sir to me i mean it might be garbage but i like i'm not really into it i'm not getting my message the airbrushes to me says yeah no prefix found means that that path in xcd doesn't exist but it is going to be something subtle isn't it so the perfect this yeah that other message is really i'm gonna google it yeah do it because scd is honestly my least favorite part you try something else can you get uh config maps from a different namespace like just typically space just keeps uh dash dash keys only oh okay yeah oh we can it's just the default name space hmm all right well it was i had fun but i'll see you later um wow okay what sits between the api server and ncd like nothing that i'm aware of now this isn't like if our back wasn't allowing us then it would be a much more you know friendly kubernetes level error right the fact this is saying internal error the fact this is making it into the api servers logs means this isn't you know this this is an exception this this isn't uh like in our back or anything i'm going to take a look at which namespace is that meant to be here well this is the cube system one right but then what i'm wearing it means i think it's in every name space that was it okay oh right so you can always get it no matter what your outback is okay are those the same uh maybe i mean i i did a very simple test of the first five and last five characters but i'm going to assume yeah as hash algorithms go that's like nsa proof so yeah and it's i mean there yeah it's there's a few because they're on their serialized protos right so there's you can see the text i think barclay's winning a gift card at this rate yeah this is this is cannot transform did you find anything on the error message because what's the transform part about it is encryption related you told us it wasn't oracle yeah the restarts aren't encryption right but the rest are due to this issues with encryption yeah the thing spams this into its logs and then quit like the restarts did you do this surely surely okay i'm assuming you've enabled encryption but not if you're not on this key or you've put the old version of this key back into lcd i'm going to assume [Music] he's encrypted one namespace right he's encrypted one or the other because it works for one and it doesn't work i don't know oh no wait it's done the scrolling bug hasn't done how on earth are we gonna undo that in 15 minutes so i'm assuming he's maybe been nice to us and left us yeah i guess a script or a i wonder what he's done to psyllium with uh psyllium's deployment um i wonder if we can just do a get from the cube system namespace right and store it into the differential yeah that's encrypted the other one is plain text so we're going to have to in fact let's take a look at the default one one more time right so i don't think there's anything named spacey in it right it's just a certificate it's just the metadata so we can just write the cube system one to the default key yeah right yeah okay so encryption's on so it's trying to decipher everything and the one in default is not in ciphered so it's it's deciphering the garbage or it's not checks okay this is our encrypted secret but has he only encrypted the one everything else works i know you're right encryption's on then we can read the pods and we can read the deployments and stuff right so yeah okay i think everything but one thing is encrypted and that's a smart way to do it right because we get the error on the one that's unencrypted and we inspect it with scd-ctl and it looks okay that's that's a very smart way to uh send us down the wrong path so how do we this is an interesting workaround i mean the solution is actually much easier than this [Music] but now i'm curious if this will work let me get my cheat sheet up um let's try and read the default one again yeah there should just be a put you should put it i've put it yeah okay all right this is great awesome oh currently got it okay so we still need to fix the flapping but i think we might be able to get all the conflict maps so that api server comes back so let's let's deal with this flapping because that's getting me to annoying yeah okay so so that encryption error was uh was filling up the logs so yeah let's try to find the flapping uh grep v is your friend i think okay we'll just need to wait for that to come back maybe that is why it's flapping because there'll be watchers on that config map and that's just shutting down the api server there will but he said it wasn't which means he's done something else to the api server all right okay i missed that okay so great dash we less than watch well we might be getting a new log maybe not yet it takes also um you can take a look at in the chat the one from bala yeah the bala says please check the identity tag in the encryption config and this identities or maybe google like identity and that cd encryption configure something so you're telling me we've not fixed it is that what you're saying well you did fix it for that one one config map all right so i thought you were just being cruel and related to one thing this this will be you're right this will be the key id you're actually making it pick up the key or something where's our api server yeah does it do crashly back off on static pods because this is taking a long time no um okay that can just be a bit temperamental at times it can take up to like four minutes and you could always restart it there we go so okay yeah no all right we're gonna have to fix this properly then our hack doesn't work okay so yeah let's google uh guys that didn't fix it interesting i mean it's encrypted okay so what if you can you recreate that secret by like using through cube ctl uh yes so what your suggestion is we could do yeah let's try and do that first system yep [Music] and we can change this yeah so i never changed the mad day as we should probably break it and apply what all right we need to learn how this encryption oh apply yeah do a do just do a delete just do a delete and a create because it applies trying to read it to do a three-way merge and it can't read it so uh smash it hog smash it it didn't work unless i guess we do an lcd delete well you were trying to delete the file you were trying to apply right so that will now work right hey okay great there we go another post okay so we need to encrypt everything or uh do this identity hack to make it not really yeah let's let's google that identity thing that we talked about all right so we can set encryption field type to identity oh that'll write that kind of identity okay do you know what that means you want to go for it oh just as it means i like the identity function right as in it means just pass through and don't like encryption type identity just means don't do anything i guess but that doesn't make sense because a lot of what's in xcd is in seyfood right except some of it wasn't except something right but if we did both namespace config maps weren't encrypted right okay you're saying we can get default working again if we do this we might break some other stuff right so like if you google like xcd encryption config or something oh we can do it by namespace okay i love if you google for scd the first hit is kubernetes even if you don't specify it okay so there is an identity right there do i have that there no right so basically i encrypted all names spaces except default so it wasn't but because there was no identity there it wasn't able to read those because they were not encrypted oh right we needed a configuration and an existence the reason i did that is because if you decide to just turn off encryption that would break the rest of the cluster because then you would be able to read that that namespace but then all of the control plane config maps will now work because they are now encrypted and would break the entire class through they found that happening clever so there's a much easier way than me trying to retrofit the encryption then like so yeah which is clever because you basically just so one thing is if you recreate the secret or the config map or whatever resource then because then it will encrypt it using the ec config and replace it so that's why when you use for cube root from the other name space it worked um or you can just add that one line identity in the config and then now it's able to read unencrypted stuff uh by default i got it nice and i had no idea that was the thing it refuses to do no encryption that's great isn't it it refuses to i think it's optional right so stuff can be encrypted or unencrypted um i mean maybe maybe correct yes if you have identity in there then it's both okay so we have four minutes and we still need to fix this postgres so yeah so i'm going to cheat a little bit he was definitely messing with psyllium because there's some psyllium config in temp so oh that's me that's part of the customer strap process i believe okay yeah i did not mess with celia because all of that silly and stuff flapped like the fact that the password is wrong okay so the password is right here i think postgresql123 is correct i believe the user is postgres and the database is clustered in fact our health checks used to use our postgres so and it's hard coded into the into the rust app as it's not ready for the environment or anything he could fit in i should know but i i i will admit that i i wrote this a while back and uh i've abandoned it ever since so let's check because my immediate thought again was was ebpf right i mean what could make the password be wrong like an ebpf filter that's changing it on the network no the hardest thing with scd encryption everything else should be relatively easier to fix and figure out i mean it's weird because this is hard-coded assuming it's our image which it appears to be um password authentication failed for user postgres which means probably modify the stateful set what's the generation on this generation nine i would have applied this once um i hate computers yeah i hate people too so it's tricky isn't it um i mean that looks okay sql one two three it's not anything silly like it's not postgres past wd or anything silly is it i mean so this is a relatively stateless thing right so you're going to modify the end the pod i'm just going to delete it and have the database recreate yeah i i might be on our date docs but it looks like the environment variable is called pg password or one word you sure that's why i found some docs that said that for the latest version but have you recreated it have you got the original yaml sitting around can you just recreate it it didn't work oh yeah i do have the original yamaha because a lot of what i did was with cube ctl edit if you just thrown the original yaml over the top of all the mine like you'd have fixed like six of my hacks all at once oh don't do that can we take a look at the yaml again in the cluster there it is password okay that's what those dots run about okay let's work this there okay stable set 964 it keeps scrolling down oh you son of a bit oh what oh man that's great nice very nice you made me laugh out loud oh i like it our api server flap oh we still go to the bottom of that so what causes parts to restart health checks interesting yeah so like cube api server and cube scheduler and those they have health checks right right i mean lots of things right i mean you could have been you could have been making the node already you could have caused the thing to exit one because it doesn't like its config file but yeah health checks also is it possible that the health check is not healthy for some reason it's doing https is this i could have believed it was because it needed this ca bundle to be able to verify the survey certificate of the api server but that's back now yeah i don't think live c is the real end point is it is that what you changed i actually think is i think it is yeah i don't like enough every time i see it i'm like that can't be right i mean like yeah it's probably gonna be like hard there's just a small little change i mean go up um so there it is uh where is it right yeah okay so yes it's making the the healthy check feel that's nice actually because that's honestly like an error a user could make because if you're installing your own cluster you see something like that and of course you turn it off like it's it's actually a very yeah you're like right and launchable thing yeah to get wrong awesome i think that should be it should that be all of it well i've never actually got to edit our postgres oh you gotta oh okay and you guys failed yeah oh i don't think you took enough uh the api's ever died before i could save the alter i don't think you're taking enough lines out of that have a look at the indentation again i'll just try again oh yeah my whole idea was that you would fix this first and then try to recreate the part but because this part uses config map in this name space it would fail and then you'll have to go figure out what's going on with the lcd that was a hard one i've gotta say you still haven't fixed it oh postcards why is it not when i changed the spec oh you need to yeah you need to kill it right yeah i shouldn't have not when i changed the stiffness there oh you did the staple set should have rolled it well unless someone changed the semantics there we go we're two minutes late but we did get the dance not not bad all right those were two really tough clusters i gotta say um lcd is always i've had the app hammer lcd number things i'd like to see that was a tough one barco and that was that was very good i thought i'd squeeze that that's why i squeezed the cpu right because i'm like scd is just when it goes wrong you never know how to fix it so i thought if i can just make it sort of misbehave sort of be really slow and annoying everybody's just gonna be like ah lcd i don't understand it because it's always the thing that gives you cold sweats when you see this yeah because how often do you have to use lcd control like never right really yeah i didn't make a backup of it there's a hints folder yeah there was a backup of fcd completely unencrypted and there was a backup off there was the the yaml file for the encryption configuration with the identity in there so just in case we got lost um we could fix those in case some idiot starts writing random bites to etcd yeah you really could better keep it back yeah well i did it for me because i was doing stupid things with that city all right well thank you for for joining us today uh those were great those are really good you know what it's the best episodes are the ones where i walk away confused but learning so much and i got that from both of your clusters so i really appreciate the effort you just put in there and for joining me during the debugging as well uh thank you to the audience for watching us and all your comments they're always very helpful these are always one step ahead of us and then thank you to teleport for sponsoring clustered we will be back next week with more broken clusters and more pain for me pop if you're still watching you definitely all barkle that gift card there he definitely destroyed that all right any final words before i st goodbye to you both for today uh this is fine thanks a lot i'm honored to be here yeah nice time and thanks uh thanks for the breakage i uh lost some stuff as well all right have a wonderful day enjoy your weekend thanks a lot i'll [Music] [Applause] thank you for watching [Music] you
Info
Channel: Rawkode Academy
Views: 634
Rating: undefined out of 5
Keywords:
Id: -k5y2C6HNa0
Channel Id: undefined
Length: 95min 34sec (5734 seconds)
Published: Thu Sep 09 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.