Cloud Native Runtime Security with Falco - Kris Nova, Sysdig & Abhinav Srivastava, Frame.io

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we're gonna get started here look I'm gonna take my coat off we have about 90 minutes today and depending on how many questions folks ask or how often people raise their hand in the middle of my demo to ask questions about what I'm doing or to look at things we may or may not take up that Ferrel amount of time and just to kind of set expectations for what's going to happen here we have basically two correlated presentations that we are going to be sort of munging together into one longer session that's going to tell us a very interesting story about run time security and kubernetes and a tool that i'm working on that a lot of folks on the team and in the community build called falco and how that's being used concretely in production in a way that we we didn't expect it to be used which i think makes for a good talk anytime there's a unique set of constraints or something happens that you weren't you weren't planning for so I think the overall pattern of how we're gonna do this today is I'm going to do a very low level hacker style presentation here in the command line and I'm gonna be sort of setting the stage pun intended for for Falco and what we're going to be talking about today and so you when I'm up here you're gonna learn about the process the project the CNC F you're gonna see it running in kubernetes we're gonna do some scary things in kubernetes and we're gonna play with the tools concretely to kind of understand those open-source tool and where it came from and where it's going and then Ivanov here my co-presenter is gonna come up and he's gonna talk about how he uses this reading and production in a way that it's similar to what I'm gonna be talking about but I think when you hear from the from one of the maintainer of the project why we did what we did and then seeing how a ban of and folks at framed IO are using it I think it's a pretty good thought experiment just to to remind ourselves that when you are building your open-source technology there's things that you may not be prepared for it that could happen and that's a general sentiment that we're going to be going through today this concept of things that you weren't prepared for and for how do you prepare yourself for them right how do you how do you get ready for something that you you don't understand or you don't anticipate it's going to happen so anyway uh we're gonna run my slides in the terminal my my talk here all of my slides everything I do is going to be in the command line so if we want I don't know if we can maybe dim the lights or if folks can can see this does anybody want me to switch to a higher contrast that looks good a higher contrast view here you don't have to raise your hand you can just shout it out if you want to move to the front if you want to see more this is about what it's gonna look like on the screen for folks at home I see a lots of thumbs ups I see more thumbs up lots of lumps up so yeah okay everybody just put thumbs up last chance to shout out for higher contrast going once okay again if at any point you're missing something or you can't read something raise your hand I can zoom in or we can we can switch so I again I'm giving this talk to you cuz I want you all to walk away feeling successful at the end of this and speaking of you I'm about to talk about me for a few minutes but let's find out more about all of you lovely people in the audience and we're gonna kind of cater my presentation to folks here in the audience so I do this at the beginning of most of my talks and I've been doing this at cube con for a few years and it's really interesting to see how then that the ratios have changed over the years so hands in the air running kubernetes in production I am the entire room just put their hand up hands in the air who here has ran a docker container who here has ran a has written code to launch a container in sea leo my main man has his and uh we have another one back over here okay we might do that today if we if we get there we'll see how how this goes okay and who here knows what Falco is who here is run Falco okay who here's running Falco in production okay you should you should see me afterwards we would love to talk more and get you some free Falco swag and see if we can't get you in our official adopters MD file I'm speaking of Falco swag if you answer questions that I have I have a limited edition magnetic Falco pins up here and we can get you a Falco t-shirt as well sent out to you after the conference so if I ask questions there is some incentive to you to answer okay so without further ado we are going to view with these illustrious slides of mine that are in very high quality 64 columns of pure Unicode yeah our first sign okay so yeah this is actually cool I had to invent my own markdown language for this and you can sort of see how how this works oh and really quick let me just change something doo doo doo we're gonna write a start over go source github.com talk of security Falco slides I want to make sure that folks have a link on the screen and so what we're gonna do is we are going to do tribute you do do this and github Chris no pup speaking slides BAM and now you should be able to get your cell phones out and take a picture of the bottom left corner of the screen okay there's I just missed this bitly link here if you want a grab and photo of that everything now I'm gonna be talking about in a Vinod's going to be talking about you're gonna be able to reach from that bitly link and it's gonna be on every slide so feel free to take a picture of it whenever you want and then that's going to help us justify that what we're talking about today is important in people care so yeah so we're gonna talk about cloud native runtime security with Falco I am your host for the first half of this my name is Chris Nova I work at a company called cystic a little bit about me I am an o'reilly media author I wrote the book on cloud needed infrastructure literally if you have an opportunity to buy several copies of it for your friends and family please do and then buy some more copies and I am a maintainer I actually just got done doing a C&C F interview today about all the projects I've been maintaining or have maintained or currently maintained or I've worked on I've been in around cloud native probably since 2015 ish I've worked on Cuba corncobs keep admin sig cluster life cycles say AWS I just contributed to go there terraform there's a lot of things that I've worked on and I just like to help out the community and I'm constantly hacking and tinkering on things and over the years I guess people have enjoyed the things I have to say and Here I am so now let's talk about this thing called run time security who here thinks they have a good idea of what run time security means we have a couple of hands go okay pretty good hand so that hands go up so I'm gonna talk about what I think run time security is and I'm gonna talk about why I think that's important and you know my my opinions here aren't necessarily the right way they're just my opinions and I encourage everybody here to to question this and to start thinking about security in general because if you look at my career honestly in the past five years in cloud native there's been this pattern of bits and pieces of the cloud native ecosystem particularly around Cooper net that were incomplete the first one that I was very excited about solving was the infrastructure problem when kubernetes first came out we said no we're not gonna tell you how to install kubernetes that's that's outside his scope of the project and that was frustrating for folks who well frankly wanted to install kubernetes and so that was one of the first tools they worked on and if you look there's been this pattern of the the space growing and in different subsections or different components of cloud negative around kubernetes growing based on these gaps in what we we have easily accessible tooling around and I think we're kind of at this point where kubernetes are starting to level off and there's maybe one or two really big holes in the global pie chart here and in my mind one of the biggest ones is security how do you secure a kubernetes cluster and what does security even mean like where does it start where does it stop and like I've been on Twitter I've written a few blogs about this and I think you know just starting to question what security means and what runtime security means I think is important and so the three kind of buzzwords I have that come to mind when I think of runtime security is it's a fancy word anomaly detection which basically means in my mind something bad or confusing happened that we weren't prepared for and we just kind of want to know about it or at least be able to go back and get some information about this thing that we weren't expecting to happen forensics so this is like the detective work this is like you know like we've all seen detective Pikachu right with his hat and his magnifying glass going in and actually trying to solve the the mystery of the privilege escalated container of 2019 and then last but not least stability right runtime security in my mind is our last line of defense we're gonna talk a little bit more about that in a moment and it's critical for this layer of software ready to be stable right wewe don't want to have the the one piece of software that we're you we're counting on in an emergency to save us to break our computer or to crash or to not work or to to fail so I think it's it's absolutely critical that this is stable and we're gonna look at some interesting patterns in kubernetes and in the linux kernel that sort of ensure that we're writing some pretty stable code here okay so I I have been talking to a lot of folks here at cube con over the past few days and on the internet over the past few weeks and one of the big pieces of feedback that I've got is we're kind of in a very similar situation to what docker was in a few years ago which is we have some language and some naming confusion we all know naming is the hardest part of writing software so I'm going to spend a few moments here and just clarifying three very important actors in in the field here the first one cystic the company the company who pays me I work there and we are a bunch of low-level hackers we hang out all around the world we have an office in San Francisco and a bunch of really smart folks I get to work with and we write a lot of C and C++ and get to do fun stuff with the kernel so assisting the company is a effectively an evolution of Wireshark Wireshark show of hands who here is ran it yeah everybody just put their hands up so I you our founder loris a great friend of mine he was one of the original authors of Wireshark and if you look there's some similarities into what cystic does and I think the big thing to start to imagine here is Wireshark said hmmm if we want to find the ultimate source of truth right like in my mind this is like the quantum particle of computer observability like what's the most fundamental element that can always tell us what's exactly going on and the original thesis was the network if it's a TCP packet that's that's the ultimate source of truth you can't get any smaller than that and like good scientists we continue to push our field forward so now that we're in cloud native we have virtual networks effectively and we could have two containerized processes running on the same node talking to each other on the loopback interface who believe that they're talking across some network but really is just a synthetic network that the kernel is allowing us to create and so we have taken it a step further and we said the network is good but the kernel is better so that's sort of the philosophy behind the company and so we see that reflected in one of our open source tools that we call Cystic the same name very confusing right anyway this is a CLI tool show of hands who here has ran this before it's similar to s tres except for it sort of works globally and it has a little bit more enriched maida information about what's going on in the kernel but basically when you look at how a process interfaces with the kernel it ultimately translates everything it's doing through the syscall interface usually using a library like G let's see and those go into the kernel and what we do is we just simply track those we trace it's called kernel Tracey every event in the kernel is a function executed and we trace those events and we send those back up above the kernel into what we call user space and then with those metrics we have this crazy idea that we will be able to tell some sort of story with these metrics which is where falco comes into play here in a moment I'm going to do a demo and I'll show you the Cystic tool and the firehose of data that comes out of it and then we're gonna run falco and you'll be able to see it's actually much neater and cleaner and much more human readable and human friendly to actually put together a story or to start to do that detective work of what's going on on a system so Falco is a CNC F project it basically uses these two libraries there are open source called lips cap and Lib since and we are able to take kernel metrics and enrich it and go a step further with the docker context if you're using docker or cryo context we're basically able to pull men Mayda information out of the system in enrich our kernel tracing events with that and then we're also able to do that with the kubernetes made it information and audit logs as well so we're able to put together this really nice enriched object that spans multiple layers of the stack starting at the lowest most fundamental layer and going all the way up to kubernetes and we're able to tell a story using all of these inputs and we're going to explore Falco a little bit today so again it's a it's a CNCs project if you go to see in CF dot IO you can you can find more about Falco and there'll be some links here and that link in the bottom will get you there as well ok so the history of Falco it was originally created as an engine to pars libs cap in lib sense and when we say engine what basically what we mean is it is those two libraries wrapped up in a for loop and runs as a daemon with a signal handler very very fancy I know so this was donated to the C and C F it was originally created at assisting and I joined the company about three months ago and I've been working on making this as hygienic and healthy of an open-source project as possible so if you're interested in the security we would love to invite you to come join the party and hang out with us and start hacking on the kernel right now the majority of falco is written in c++ raise your hand if you write C++ and you like it okay two hands okay Eritrean if you like go Python other yeah everybody just put their hands up okay and we'll talk more about this later but we're working on building some handy API interfaces for Falco that's written in C++ for a number of reasons particularly when it comes to auditing the kernel and then how we're able to take that and we're able to have a effectively a client go or a client Python or a client Ruby so that you can start building tooling on top of Falco so that you don't have to worry about the nitty-gritty C++ implementation events here's my very fancy diagram in ASCII art so let's start at the bottom I feel like a lot of folks especially insecure you like to start at the top and go down we're gonna flip that around and we're gonna go from the bottom up okay so the bottom half you can see there's that line in the middle the bottom half of this it represents the Linux kernel and here you can see we have two main avenues of running code in the kernel one of which is the more conventional approach the kernel module the other one is a newer approach the ebps approach who here has written a kernel module who here is written an ABAP F probe okay I think Leo's put his hand up for every one of these and basically what these do is these we solve the problem of hi I'm a software engineer and I would like to run logic in the kernel in the same way that there's a somebody who said on Twitter the other day I forget who was I'm I'll add it to the comments after this but they said basically ebps is the job is to the kernel as javascript is to HTML and what that means to me is you're able to create custom logic and then run that in the kernel right and that's how we're able to pull these metrics up and start understand what's going on from the kernel level so the problem with the kernel module is that if you write a bad or a buggy kernel module you can break your system because the kernel is going to basically you're gonna hook into the kernels processes and then you're gonna take some action and if something goes wrong while you're taking action the kernel is gonna wait it's got a halt on that and you can put your computer into deadlock pretty easily just by a very simple bug so that's where you BPF comes into play and ebps is brilliant because what it does is it actually has code that's already pre-written for you in the kernel and it's been tested it's gone through the same Linux testing suite that we use for all for all of our releases and we basically just tell the kernel to turn on or off various pieces of EPF that are running in a virtual machine in the kernel so we're not actually writing custom code we're just saying hey kernel you weren't executing this code before but would you mind please also executing this little bit of code along the way and then we can pull that up into user land and we can start to make sense of it one of the features that ebps supports you guessed it kernel tracing so anytime you execute a sis call you are then able to begin to understand those syscalls up in user land so those are the two approaches the the caveat here is EB PF requires a newer kernel 4.14 ish to execute a lot of the stuff that we use in Falco whereas if you're running in an older kernel the kernel module is going to be the way to do that so obviously there's some pros and cons with each of those so then that goes into a ring buffer where we have 16 megabytes I know that's a lot 16 megabytes of memory per core of information that comes up from the kernel and this is just stuff that's going on your system and then we have these two libraries in user space that just pop events off of that ring buffer and the beauty of the ring buffer approach here is as it circles around imagine a clock right the hand going around to clock as it circles around it's leaning the previous events before it and it won't overwrite events until it circles all 360 degrees back around which means that any given instance if this has been running for you know a period of time we have 16 megabytes of history that we can go back and observe if needed so this is this is where we're starting to tell these stories and starting to gather information in doing what we call run time security so we're able to respond to events at run time effectively by looking back in time in this buffer that we have in memory and all that goes to Falco and then all that can ultimately go to gr PC endpoint that you lovely software engineers can use to to plug up and plug into anything you want effectively and we're gonna do a demo later so you'll be able to kind of see concretely how this works so let's talk about Falco so now that we understand how we're getting the SIS call information into Falco we want to talk about other input streams that we have so obviously the SIS calls that we're pulling out of the kernel traci and either through a kernel module or an EVP F probe are going to be important and that's probably the most exciting bit of what Falco can do but we are also like I mentioned earlier we're able to get the container context this is going to be the container ID the image that the container is running on all of this wonderful maeda information about what's going on and the last input stream we have there's actually two here that I kind of munched into one which is kubernetes maeda information what's the name of the pod what namespace isn't running in what are the labels as well as the audit log information who did what and when did they do it and again this is where we're starting to look at telling a story of being a detective of understanding our systems I had a fundamental full stack level instead of just at the kernel or instead of just in userland we're going in spanning the entire system with Falco we simply assert these input streams against a set of rules and we're gonna look at some of these rules in a moment and and then if one of those rules is violated we plummet through to an output and here we have three outputs that I put up on the screen and we have more coming and there's actually a lot more that Falco can do but I picked two the three that I think you're going to resonate with you the most the first one I've mentioned a few times already we have a TLS mutually encrypted G RPC in point and we have clients that generate code using protobufs that you can consume the same point with we have a web hook in the exact same way that kubernetes audit log will reach out and send a request to a configured web hook Falco can do the same thing Falco can curl an API Falco can I go through and effect take any action over HTTP or HTTPS that you want last but not least and this is great because a lot of the the C and C++ engineers I get to work with consider this a higher-level feature a Felco it can go to standard out which is what we're going to look at here in a moment but just as you're looking at standard out I want you to to imagine that this could be effectively plumbs through to to anything else as well and we have some demos on that if you want if you want to see those come find me afterwards okay so let's talk about the problem of runtime security now that you kind of understand Falco and how it works let's talk about how we got here right how did how did we go from this idea that I'm about to explain to that code in that system that we just looked at so I'm gonna ask a few questions here and these are the questions that we began to ask ourselves as we started to explore how we were gonna solve this problem what happens when access control fails and what we mean by that is a lot of security is is implemented by preventive action locking the door setting an alarm saying hey you know if we can prevent folks from doing things that we consider malicious we're going to secure our system and that is a fundamental part of secure system and probably the most effective way to get started with securing your system and the concern here is every time somebody has a catastrophe a car is broken into a bank is broken into you your house is broken into something happens that you aren't anticipating somebody does something to attack you they usually don't care if you have door locks set right they've found some way to bypass the door locks and this is where the second layer that maybe is only applicable a very small portion of the time but it's going to be applicable during the most important time which is when something happens you weren't prepared for so in my mind we need both of these in place to truly secure a system so what do we do in catastrophe strikes and then last but not least after catastrophe strikes how do we tell the story how do we begin to look at all of the data and actually make sense of it and actually come and be able to put together this person from this IP address came to this container and did this and installed a Bitcoin miner and here we are and I can I can give you all the information to to demonstrate that this is beyond the shadow of a doubt in fact what happened and that's what we try to solve with Falco so here are the three common approaches to solving both prevention as well as forensics analysis or surveillance after a catastrophe has happened so the first one is using a concept called LG preload which I'm gonna high-level this there's a lot of great resources on the Internet and I don't want to bore folks with like reading the LD preload man pages which I have done in the past and you can go to youtube and watch me read those on stage it's quite fun but basically in order for you to use LD preload what you would do is you would write some code usually in C and you would effectively hook into the G standard library deal obscene and when a executable program that was written by either yourself or another engineer tries to access the C standard library if you use LD preload and you wrote the LD preload library you're loading correctly you would effectively be able to intercept that function call take some action and in most cases you would then take a small amount of action hopefully nothing too invasive or blocking and then you would hand that off to the original function that the user was trying to call now this is great because this solves the problem of like all we have to do is wrap every g-load C publicly exposed function or method to a log file and we can start to see what's happening behind the scenes there's a problem here and who here is written go okay so go doesn't use DMC so write me right away that's not going to work here so that was one of the things that we we started to explore and we said okay so LD preload is good but it can be better so the next thing we went and we said okay let's talk about sidecars who here has ran a sidecar before great so sidecars run in me the same context in the system space as the the original process he so if you go and you read in the kubernetes documentation it says like a pod has shared networking and shared storage which means if i have two container processes running in a pod they can communicate to each other via localhost when 2700 to 1 which is great but if you look at how containers are actually created in what a container looks like at runtime that sidecar is effectively in the same system space as the original container in other words it's in the same set of namespaces and probably see groups as the original container meaning that if somebody was to escalate out of that container or do something outside of that context the sidecar would effectively be blind to it and we would be able to use the Linux kernel and all of its wonderful namespacing goodness against the container sidecar that we would be hoping on to secure our process so we said site cars aren't going to do it so then we looked at kernel modules this allows us to write arbitrary code in the kernel but like I mentioned earlier these could potentially crash a kernel so it's a little bit better than LD preload but there's still some risk here furthermore like if we're already not scared enough apt-get installing random package dot DP kg from the internet we're now effectively telling people to yeah just trust us and run our kernel module in your kernel we're a security company we we're not gonna do anything too scary here so there was some concern there particularly about loading custom logic into your Linux kernel probably the most intimate part of your system okay so let's kind of summarize this here so we looked at those three approaches we we've understood that kernel modules are the best of those three but EB PF is really where we want to move going for it and so to summarize the the problem again kind of looking at it differently now that we know the possible avenues of solving the problem the first thing we want to solve is prevention in the same way that selinux prevents you from executing various bits and pieces of the kernel we would potentially want to prevent folks in our system such as a kubernetes cluster from taking action this is where tools like our back or tools like OPA come into play because you can effectively say no you have your labels configured incorrectly or no there's some string that doesn't match this this pattern that we've defined and we're going to prevent you from taking action we're gonna tell you no and if you look at the way that controllers in kubernetes work they basically allow you to hook into a process and reject a request to take change or mutate an object in the API server and so that's prevention which again this is a the first step most folks take and they're probably the first step I would encourage folks to take but prevention doesn't always complete the story so we want to start taking security seriously in the enterprise in looking at how do we respond to events that we aren't prepared for which is where run time security and forensics comes in to play so again these are useful in the very small edge cases but if you have this installed this is gonna be the difference between a total catastrophe and at least a catastrophe that you can understand and prevent moving forward and draft some policy around and again this is sort of our last line of defense here so when all else fails where do we turn to what do we look to to try to solve our problems and make sure that this does not happen again so I'm gonna give a quick history of EBP eff and I hope folks here get excited about this and we're going to talk a little bit more about how you can find out more about ebps moving forward but I'm just gonna give you a quick high-level overview for my demo so that you kind of understand what I'm showing on the screen here so who here has heard of the Berkeley packet filter okay and who here knows the difference between that and EBP F cool handful of folks so basically I think the analogy I like to use is B BP of BP f is to e BP f as them is to or as V is to them right it's extended Berkeley packet filter right so it's just an improved version of the original BP f work that was done and as I always say like BP f we're gonna party like it's 1992 this has been around for quite some time so we're gonna explore kind of what it is and how it works okay so let's remind ourselves of our use case we want to trace all kernel events into and out of the kernel so every syscall has to two fundamental parts right the this is call coming into the kernel and the return value coming out of the kernel so there's two events there that we want to trace we eventually want to enrich that data about what was happening with metadata we want this to be stable we don't want to run the risk of breaking our system or running a potentially malicious or invasive kernel module and we want to be able to audit this in userspace so we had of a pretty tall order here if you think about what we're actually asking the kernel to do and EBP F in this case was a perfect solution for us so again let's look at the the problems with the kernel module approach number one it was invasive invasive folks were hesitant to load new kernel modules into production for good reason and it was risky buggy kernel modules can crash kernels and block kernel processing and no matter how well tested your code is as I've been saying this entire time things happen that you're not prepared for and there are always going to be edge cases and there's no such thing as perfect technology so again ebps gets us closer to that that zero level that we are so desperately trying to to reach so how he BPF works like I mentioned earlier the code is already written and we're just turning it on there's a virtual machine that is compiled into the kernel in fact you can go to github.com slash Torvalds slash Linux I think slash BP I think it's in the top-level directory and you can actually see the code that is that folks have been working on you can see it's been very recently committed to so it runs as a virtual machine it builds these dynamic maps in memory and we can access those from user space furthermore this is another interesting feature that is an improvement on kernel modules we can dynamically load via ABI BPF system call our BPF program that we want to start using and we can use bpe BPF for networking analysis for kernel tracing there's a whole arsenal of things that it can do for us and again the code is pre-written and we're just taking advantage of it if you're interested in finding out more because again I'm keeping this kind of high-level for for this talk because we have a lot to go through there's two links here if you want to snap a photo and these are in that bitly link as well the first one is a blog written by an engineer at Cystic and name is john luca he's the one who wrote the e BPF probe that we use to this day and the next one is a link to the BP PF observability in Linux book that one of our engineers Levin so Fontana that we've been giving away at the Cystic booth so if you want to grab a free PDF of that go to the bottom link and you can get it for free and if you want an actual hard copy of the book come find me or Leonardo or learn and so or join the thought go slack and we're happy to to figure out a way to facilitate you getting a hard copy of the book and that goes into great detail with runnable examples that are hosted in github that you can go and start to tinker with BPF ways ok so I need somebody to shout out a word because we're gonna run an experiment the silly or the word the better anybody say a funny word banana ok so we are going to run the banana experiment so what we're gonna do is we're gonna demonstrate using the open sis call this is a very simple syscall basically this just opens up a file for reading or writing using the kernel 10 minutes ok the next thing we do is we're gonna we're gonna replicate the banana experiment it's good word I like it already in 3 different environments one in raw Linux one in a container and one in kubernetes and we're gonna perform the exact same experiment in all three environments and we're gonna see what the kernel does using Falco Falco is gonna give us visibility into the kernel and we're gonna understand what kind of information were able to get and actually see me do something scary and we're gonna see what Falco has to say about it and so the goals here we want to understand the importance of surveillance of tracking these logs over time and we want to understand a complete system auditing right so like this should show you that auditing a system from the kernel up is going to give you the information that you can get regardless of how the banana experiment is ran regardless of if it's ran in a regular old Linux process in a container or in a pod and Amazon the banana experiment has some pretty interesting outcomes based on running the same thing over and over again ok so how we're gonna do this is I have Falco running and Damon set and kubernetes that's going to be our final example here and we're gonna do open in native Linux open in docker and open in kubernetes where Falco is running as a daemon set and hopefully this will show you how Falco works and you'll understand why we did what we did and then we're gonna get up and off here to come up on stage and tell you about how he's using this in a completely different way than we expected and how it's still useful for them and their team so let's look at selca so the first thing I want to do is I'm gonna run sudo Falco should I see men can folks see this okay yep cool okay so you can see here not very much is happening because Falco is taking the spire hose of syscall information oh this is chrome yelling at me right now this fire hose of syscall information and it's asserting if it's violating a rule and as chrome just did a set UID it violated a rule in Falco yelled at us and so if you want to see what the actual assist call information looks like coming out of the kernel without those set of rules filtering it we can run the Cystic CLI tool and all into my password and as you can see there's a small amount of data coming out at the kernel right now and most of this is my gnome terminal but I also have I probably have zoom or slack doing some crazy stuff behind the scenes but you can begin to see here we have reading we have writing we have syscall information we have the arguments that are part passed to these this calls and you can see the process ID whether it's got an event in the kernel that's going in or an event in the kernel that's going out so this is great but this doesn't really tell much of a story unless you know exactly what you're looking for so what Falco does is it allows you to write some rules that say if we detect patterns in this set of data we want to start knowing about it we want to start learning about it and so in this example when I ran Falco in the other terminal here you can see we have this notice right so this is chrome doing a set UID and Falco said okay this isn't super invasive but you probably might want to know about this in the future but we're gonna do something a little more invasive here on my laptop so in a third terminal I'm gonna do this and I'll zoom in I am going to touch user local Ben banana ba na na what does this ba na did I spell the right bug man okay yeah sorry I'm like singing Gwen Stefani okay so anyway we're gonna touch this and in order to create an empty file we of course have to call open so we need to escalate to root in order to do this because my user local business of course protected so sudo bang-bang enter my password here and now let's go see what Falco has to say we scroll down and we have some really interesting bits of output here the first one it says notice shell history has been deleted or renamed what's happening there is I'm running different terminals as the same user so whenever I create a new terminal session which I just did bash history resets in Falco doesn't like that when that happens because that's potentially somebody covering up their tracks so again using this call information we can start telling a story and if you look at the the end of this here you can see that we have some some in a fields and this is gonna be relevant the moment when we run in a container particularly the container image that we're gonna see down here notice we have no container image defined because we're not running a container right now but if you look at this air we got an error this time it says file below to a monitored directory opened for writing and then you can see we're actually able to get meaningful information out of the kernel the user for instance the command that I ran its user local bin banana and you can see that we're actually able to start painting a picture here and taking action and remind yourself that this is going to standard out in this example but you could plum this up to whatever software you wanted so let's replicate the banana experiment again so in this case we're gonna do a docker run IT somebody shout out the name of a container image bash I'll do I'll pine because mountains come on soccer run IT thank you is it bench allergist shell thank you okay there we go we're clear a screen banana experiment again and here mm-hmm the first error notice the shell was spawned in a container with an attached terminal yeah we would probably want to know about that and we're getting maida information from the docker context the docker socket and we're starting to plumb that into cell so Falco's smart enough to to start auditing and understanding what's going on and you can see here from the kernels perspective it's the exact same behavior because it's an isolated process running in a unique namespace the kernel process exactly like if it was on the host system so by auditing at the kernel level this is like the big lesson here you are actually able to the kernel is truly synthesized into believing what you're doing in a container is actually happening on a real true authentic Linux system so this is great because regardless if you're onion in kubernetes or on a Linux system having Falco you will be able to audit across the stack right so you can take a job of process and you'd be able to start getting similar metrics as this this one as you would if it was running in a container or Python or anything else it's so the next thing we're gonna do here is I have alias queue Bechtel is equal to K so if I do K get pods you can see here that I'm gonna list pods in my AWS cluster and you can see if I do K get nodes you can see I have a kubernetes version 1.13 running in ec2 two nodes and one master so let's list pods in the Falco namespace you can see here Falco is running as a daemon set and we have two Falco containers or pods rather running on each of those nodes so what we're gonna do here is we're gonna tale the Falco logs and you're gonna notice that we're actually getting a lot of information out of kubernetes right out the bat so how we're gonna do that is we're going to do K logs namespace Falco label app is equal to Falco minus F and you can see here we've got a lot of things in here let me clean this up a bit okay three minutes and then I'm gonna wrap up okay so you can see here we're getting really valuable information about what's going on in my cluster and I've already kind of demoed this before I walked on stage so you can see we've already got some alerts but we're gonna generate some new ones here but we're gonna get the kubernetes made it information as well as the audit logs as well so how we're gonna do that I'm just gonna turn just this demo and I'll be ready to go um how we're gonna do that is we're gonna do Kay run in genetics stash image engine X will Jeep in bash we did Alpine before will be engine X now and we do K get pods and we can actually exact into this pod here okay exact IT here again been bash what's going on here kid get pods okay let's do this k run Alpine image Alpine K get Co let's try this one okay exec IT the name of our pod SH can't hate or not found out pine okay oh I'm not gonna do pug pooping on you is live on on the screen here so we're gonna just look at some of the I did this earlier I have no idea I'm gonna blame it on conference Wi-Fi even though that doesn't make any sense but we can see here I did the banana experiment earlier notice the shell was spawned in a container in this case I ran a different command let's see if I could find it here well the point I'm trying to make here is if you look at one of these output logs you can see we're actually able to get the container ID the the namespace that it's running in and you can see here I ran a container earlier called scary and it's running the container ID the name of the pod is scary the namespaces default and we're starting to get information from the kubernetes audit logs about who's doing this and the takeaway here is you're starting to tell a story with Falco auditing at the kernel level with all of the information we're getting from kubernetes as well so anyway that's my little pitch here on runtime security and how we want to start auditing a runtime to understand problems that are happening with kubernetes or a Linux system behind the scenes and we have we've open-sourced this because we think it's a good idea and we want folks to get involved and we were very very surprised when our friends over at frame McDonough reached out to us and began telling us how they had taken this technology in implementing it on their in so I'm gonna let Abhinav get up here and show you all a little bit about what they were being able to do with the same basic piece of technology now that you understand how how it works yeah you can clap [Applause] and why he gets his uh his sessions put up does anybody have any questions for me here you go yeah what's up so it depends it's it really does so it's it's obviously directly correlated to the the amount of noise that your kernel is generating but usually a sixteen megabyte capture I would say like huge crew you'd probably look a good 60 seconds back in time um on a relatively like average size computer with a relatively average-sized workload but again those are the kind of that's like the down the middle answer depending on what you're doing expect that to go greater or less any other questions yep one right here have I thought have we thought about structure to output from Falco so yeah so if you looked at the standard out that was on the screen what we have is for every one of our rules that are defined in Hamel you actually define an output string and we have variables that you put in there and it works like a templating language in in that case it was just the tip of the iceberg we only pulled maybe six or seven metrics into that output string when we obviously have a lot more in memory so if you were to write it use the client you would get the full object with all of the information baked into it and you can do whatever you want to with it then good afternoon guys my name is Avinash yahwasua and I'm a VPN head of information security at frame IO in New York City startup before joining frame might spend six years in AT&T research working on cloud security Sdn and other projects and before that I spent five great years in Georgia Tech working on my PhD in computer science where I worked on virtualization security operating system and system security and today I'll be talking about how we enhance Falco Bible building custom tooling surrounded and customize it to our environment and Lord you have seen crystals hacking flight and let me take it the world of marketing slides so so what is frame IO let me give you with two minutes overview so frame IO is a video review and collaboration platform we allow users to upload their media contents of work-in-progress contents into a platform then white team members and clients to collaborate on the project so we are essentially building the video cloud we are not a creation tool we are a collaboration tool so think of us as a github for videos and we are entirely hosted in Amazon AWS so the first thing that we allow we allow teams to communicate clearly they can communicate or collaborate on the video itself right not just about daily video or not not using long email threads so users can annotate the video and insert comments on the video itself that can get embedded with the frame and you can easily click on those comments and you can do easy navigation to find out who is doing what second thing that we allow we are we act as a secure hub for film productions so teams can control that who can see what and what they can do the plateaus very O's and it also allows them to easily organize the different versions of the single video finally we allow teams to share content because sometimes when you work with certain collaborators they are outside the platform right sometimes executives wants to see the cut off the video before it gets released and you can use secure sharing feature to share content with external partners and you can also decide or customize what and how they see those content so as I said that we are not a video creators creation - we are collaboration tools but we have integrated ourselves to variety of video creation tools for example Adobe Premiere effects Media Composer DaVinci Resolve and we have a native integration with the Final Cut Pro X as well and this value proposition has enabled us to accumulate a great set of logos over the past few years and a bunch of logos you can all recognize and these all are enterprise customers so what we so this is a high-level overview of the platform and N and let's see how actually gets translated into actual platform that enabled all these features so we are a containerized platform use eks and cluster ECS cluster and you can see that it as I said we are entirely hosted in Amazon AWS so when I use a censor request it goes to AWS Web Application Firewall then goes to load balancer and depending on the request type it either hits the KS cluster or easiest cluster behind the scene we use post-grad Postgres or a database for storing the transactions and all uploaded videos are stored inside the s3 bucket and we process two kinds of workloads very important to understand for the future slides as well that first kind of workload is related to web application requests right when the user sensitive requires about for creating a project creating a team member and those kind of simple API exercise that's one kind of request that we process in our platform but and a specialized request that he process is for is related to video transcoding so anytime when a customer uploads a high-resolution content what it what we do we we take that video and we transport that video and generate a lot of rows low resolution proxies so that even the devices of clients that are resource constraint of language constraint still they can collaborate on the video and that entirely is a separate pipeline running issues cluster so with that knowledge with that background let's see some of the metrics related to containers so every day we boot almost three hundred thousand containers right so we don't know that's it too many containers or two less containers we haven't checked with other industries but yeah in our platform on a peak day we are running three hundred thousand containers which roughly translates to thirteen thousand containers per hour and as I said that these containers are either processing web requests or doing transcoding jobs so based on so that's when they span from seconds to hours right depending on the length of videos that is being uploaded if let's say user is uploading very short video or the transcoding can finish very quickly and and that containers may large just for a few seconds but if the video is that is being uploaded it's very long it might take hours to transport that video and that's when those containers runs for hours and and any container that is processing user and web request is also long running containers and a fun fact that we every are we do ten days amount of work because of even that many containers and we are running running all of them concurrently so we are container is platform we're in that many containers so how do we protect those containers so for that we have two requirements right we want to have a complete visibility inside our containers to know and then we also want them to be secure right so that if there is an attack going on you would like to know so what do you mean by visibility visibility means the you would like to know what processes are running what files are being open inside the container or what connections are being created right because that's when we will not find out if somebody has uploaded something or and that basically compromise the container or some user request has bypass or far wall and all those things and our containers are compromised so you would like to have a visibility inside the container to know what is happening while security miscibility go hand-in-hand to same questions can be asked from the security perspective as well but a few security related specific questions that we have that is somebody is running privileged containers or not because as the old order if the privileged container is compromised then basically your host can compromise as well and also what kind of containers are executing we read all the images that are being run through our ECR registry but somebody is running some containers that we haven't vetted before or if you haven't scanned through our scanning system and that could compromise that could be compromised so we tried few tools to solve these problems right we tried vendor based solutions as well and we tried open source solutions as well and and my background is in system security in my PhD I worked on lot of system call monitoring and developing system so we were looking into a lot of tools to find out whether who can support our environment and we decided to use Falco Falco as Chris mentioned a lot about that that is a container native runtime security it is designed for the container and and it can provide all those things that we need that we needed and without going much much into the detail but that mean based on what Chris mentioned that Falco has three main components right rule an alerting engine that can tell you something wrong is happening with the platform the agent that basically received the system call and create metadata related to containers or or events and basically give you a very clear picture of what is happening and the kernel modules or ebps bass interception method that can actually capture those system calls so this is what Falco provided and Falco can give you alerts right and something bad is happening it doesn't tell you what events but it can tell you that the shell was opened or not as Chris demoed but we had few different requirements that we wanted to how I wanted to we were looking in a tool basically he wanted raw events we didn't want to just get alerts we wanted to get actual system calls so the idea was that if we have actual system call then we can build a lot of tooling around that right we can build machine learning AI based systems on top of sitting on top of the system call to give us lot more attack detection capability or rather than just a signature based alerts that where the shell was spawned or or file was accessed and so on and you wanted to enrich those events right so Falco does a pretty good job in terms you in terms of collecting metadata related to event but we wanted to have more events associated with our system call for example what kind of cluster it is running with a decisioning in these years or eks what kind of VP CR what what which PPC it was running on or what load balancer it is accepting the request and so on so we had we needed a we wanted to have lot more context around those events we also wanted to have a easy to update rule process right for example what happens that Falco rules sit on each machine all right each instances if you want to update those rule either you have to use ansible scripts to update those rules at runtime which can which if it goes wrong it can crash or cluster or vm or you have to rotate virtual machines every time the golden image with a new rule set in order for you to have new rules so we didn't want to have that kind of setup and we wanted to have a centralized alerting engine right Falco has a rule engine but we didn't want to use that rule engine we wanted to have a one rule engine across all services so single rule engine that can support god duty AWS concert Falco inspector all those services so these were the requirements that we we were looking for in a tool along apart from these sort of hard requirements we had few soft requirements we didn't want to manage our own cluster to support all those features that we wanted or extra features that we wanted right otherwise we had to manage or patch those VMs and we have to worry about our own cluster we wanted to be very cost effective we didn't want to boot big infrastructure to blow up the cost we didn't want to look into scalability issues right we didn't want to go into auto scaling group setup and so on right so and we want and last but not least we wanted to have a very easy way to to look into alerts to this point to errors and before going into the hard requirement that we were and that's like a roi vents and all those things let's go into let's look into soft requirements little more and in order to realize these soft requirements we realized to use we we decided to use serverless based technology we are using ABR in AWS so we started using lambda functions very heavily and several function is came out in last you know two three years ago where if you look into the offering in the beginning cloud providers who are offering infrastructure as a service model right there they were managing hardware and hypervisor and customers were about managing virtual machine operating system and application that model got changed little bit in in the world of platform as a service we are cloud providers started managing VMO as well and then containers and applications were the responsibilities for responsibilities of customers and this model got pushed to an extreme in the world of function as a service right where cloud providers started managing hardware hypervisor virtual machine operating system and the runtime right and the customers or tenants were just tasked with coming up with their own code so if you want to want to run Python based server less function or Java based service function you don't have to worry about setting up JVM or Python environment you just bring your own code tell the cloud provider the entry point of your function and route or it will start invoking it and and this you know and this decided to use service based technology for our even to realize our hard requirements and benefits of several strands directly translate to our soft requirements that with the server laws you don't have to manage any servers right you you know everything is managed by cloud provider continuous scaling you again you don't have to worry about and as on Google will worry about it scaling those functions so if you in a off in a in a peak time if you want to run thousands of functions concurrently cloud provider will run thousands of functions concurrently for you fine grained metering right you only pay when the functions are invoked so if in a in a off big time if the functions are not getting involved you won't pay even a dime and when the functions are invoked that's when you have to pay and faster time-to-market right because we are bringing our own code we don't have to we don't have to worry about setup and infrastructure related issues so it's very easy to get up to get yourself up and running so there are many ways you can run server less functions yeah and there are different different models of running it one is the one that default a very very common model where you the way humans are rather function is in is called even reverse event-driven serverless function so idea is that these lambda functions or Cerberus functions will subscribe to some resources and then wait for events to get rigored and as soon as events are triggered they they get context for those events and with the context they also get necessary information to process those events so for example lambda functions can subscribe to s3 bucket and wait for put objects to happen as objects are inserted into that three bucket the event will SNS event will be generated and lambda function will be involved and lambda function will get the context as the context actual information about the object that is being uploaded and lambda function can process that object do whatever it is supposed to do and then it can go back to sleep there are many others so we are using even driven model a lot in our [Music] extensions that we build for Falco there are many other ways you can invoke service functions and so we are using server lesson function a lot of a transcoding pipeline is written in server layers and we do lot of we use a lot of service function in the security team as well and they actually came up with the six design pattern and we wrote a research paper last year in the conference called hot cloud and if you guys know more if you guys want to read about it check it out so let's look into what are the security components that we built for Falco so there are three main components one we call it Falco and host basically the way Falco runs but we change little bit around about it then we developed a system called driftwood which is the analytic system that basically consume the data generated by Falco and process it and then we developed a lighting engine called babi this Lobos looks very crappy because I had to change it because of copyright issues last minute because we were using some other ones internally and I was told that I cannot use it in the talk so let's look into Falco on host architecture so this is the glorified version of what Chris showed you so again this is a single ec2 instance and we are running bunch of containers all containers are generating system calls a normal system calls normally system calls is the way for you to interrupt from user mode to kernel mode so that's what kernel is system call goes to the kernel and kernel module is actually interest acting all those system calls and as soon as it is collecting the system call it is pushing back to through the ring buffer to the Falco agent and Falco agent job is to get these system calls get metadata associated with those system calls for the country for the system call what continent is coming from what's the image name and so on and also it reads a rule file and I'll tell you what the rule Phi does but but in a simplified way this rule files tells Falco what system calls to collect and wash what to drop and based on that Falco decided okay the system call needs to be collected then basically the rights to into dialogues I've got a JSON file and as soon as file is returned each of these instances run cloud watch agent whose job is to read this file and push the data to AWS cloud watch logs now this is the set up even if you use Falco in a default way as well I mean you can have a BPA for instead of Cardinal module but you will have a Phalke of rule file you have a Falco agent the way we are the in this design the design that we are using we are different in only few ways forth first of all our rule file does not contain any alerts if you use Falco in a vanilla mode this rule file contain all the alerts that we showed you whether somebody is creating a shell within the container or some where or or or /bin directory is being written those kind of alerts normally are in the rule file but our rule file does not contain any of those alerts our rule file just contains few instructions for Falco on what system calls it should collect what system call it should not and we also have some filters to avoid the to reduce the volume of data that is coming from from Falco and then we are using cloud watch and and to read the data and push it to cloud watch logs so normal so what what the rules what those system calls are so we wanted no process we wanted to capture process creation file operations and network connections so we are monitoring exactly for process creation file open open system call and network connection connect system cause we are monitoring it it would it it will tell more system cause like a rename rename at open open add unlink and look at those kind of things that can tell you if somebody is is is doing some kind of anomalous operations but again rule file just contains what system calls to collect so now we have Falco started generating the data data is coming in a very high volume I will show you some numbers so next up is driftwood comes into the picture now Triffids job is to collect the data that is being generated by falco enrich the data further because that was one of our requirement to have more contacts associated with the data and then visualize it create loss graph and charge sawed off so this is how driftwoods pipeline looks like again it's entirely server less this pipeline so the goal was to create a generalized system that we can just not not only use with Falco but with others services as well so driftwood cluster is our analytic thought that intelligence cluster for our for for all the data that said that we have so even ELB log VPC flow log oddity log all kinds of log goes through our driftwood cluster and falco log is one of the inputs and we have a multi account a strategy we have multiple AWS accounts so you can see there on the left side we have a different different accounts running different different cloud watch because each of those accounts can run some vm in containers and they all will be pushing the data through this interpret pipeline so now cloud watch is is basically storing Falco data graph their Application Firewall data there are more data sources and all these accounts will start sending data to the firehose that say ws message bus and that when data goes to the firehose Farwell stores the data into the three bucket and this as three bucket has a event-driven lambda associated with it with it as you can see that as three is generating SNS events and this event-driven lambda we call it in ritual lambda is getting involved every time the log is written when the log is returned it abuse lambda takes it enriches it I'll show you what kind of enrichment that we do and then it stores in the elastic search and a copy of enriched data is also stored inside the another s3 bucket for alerting purposes and these enrichment that you get in C is customized for each type of data and data source so for example for valve data might go through include it goes through different enrichment process compared to Falco data and based on the data type at runtime we can decide which processes to use to to enrich the data so some of the enrichment that we do first of all iptg or mappings right we get a IP data but we associate geo locations with that then IP to resource data for each IP that we are getting we can so see at that weather this IP belongs to s3 or not or whether this means this traffic is going through s3 or this traffic is going through that or this traffic is going to load balancers so that kind of enrichment that we do as well and then instance of cluster type that whether this instance belong to easiest cluster or UK's cluster or path is part of the transcoding pipeline or is it the Web API pipeline so and we also attach VP cid and all those things with the data set so that basically realizes so driftwood basically full full service requirement of raw data and the enrichment data so some numbers right so in a peak day we process close to two hundred forty million raw events that goes through our pipeline these are just raw system calls right that we are collecting every day on a given weekend on Saturday Sunday we it comes down to 76 million or seventy-five millions right now it's like a 10 million even spur are the deviation and and standard deviation and is almost same so it's not that we see 100 million events per hour in a day it's it's very consistent but yeah but we are processing 250 million close to 250 million events every day that is flowing through driftwood and getting stored in the elasticsearch and Industry bucket so this is how a single record looks like so if you have a open system call right that Chris showed you this is all we are collecting for that single open we are collecting which instance is coming from its private IP instance type container name contain an image repository VPC flow ID subnet ID and what file was opened what process opened it what was the parent process and all those things so as you can see that it's a quite a bit of information and there pids as well awesome so now we have data that is being generated by Falco and now driftwood has processed that data I stored into the elastic search for as it because we uses assets such as our same and also now data is also got stored into an s3 bucket the next step is how we can use the data that we collected in that process by driftwood for actually alerting purposes where we can get some security or we can get some visibility if something bad is happening and for that we developed and again it's terrible as this an alerting engine for the cloud idea is that Bobby is a centralized alerting India that we use for all the services and again this is we developed in-house it can take multiple input sources so guard duty aw configure inspector Falco valve all these are sending their alerts through Bobby and a name Bobby that we came up with is like a UK police is called Bobby and my my developer was from UK so he used that name and it has a multiple output channels as well so it can send alert at various places like a slack ops Genie elastic search page or duty so it has all all those integrations as well and it can allow very easy white listing and throttling features as well so that if the others are keep coming you can whitelist them if you know it says a benign alert and you can if alerts are coming with it let's say that you are any hundred VMs and all of them saw the same event and all of them send our alert at the same time you can use startling feature to tell them hold it back for you so this is Bobby architecture so again it's using lot of server let's even given lambda functions but but now recall that B's we store the Falco data in s3 bucket the English data right so so that s3 is pointing to that s3 bucket now this has three bucket as soon as data is stored it again since the event-driven lambda as function and and invoke the rule engine lambda and this rule engine lambdas job is to take the data that is sent by trip wood and read the rule engines file our alert file that that that we have a centralized file in the database where not centralized alerting database where we can have all those rules crafted the rules are very similar to what rules you see in the Falco file but then you can it's like a command ran out of time or shell was created crypto mining is happening and so on and all those errors and rule in general gets a data from Chris would get the rules and applies that and decides whether it needs to it needs to be a la it needs to allow the security team or not and what we have also the Dino to be a set of whitelist that we have already developed over time so it also checks with the white listing option saying that well this alert needs to be generated or not if there's that noise it's not in the vital it's a new alert then it can send it to multiple output sources so slack off jeanne and elasticsearch with every day appropriate priority so whether we have to subject somebody up in the night then it's a p1p2 otherwise you can look into up look at it when you are in the office and from slack when you get a message on slack you can actually take action on those alloys so you can either whitelist or you can ignore those alerts and then you press the whitelist there's a api gateway and part from the slack web gets fired and it goes through api gateway and invoked another lambda and that lambda basically stores that bite listen to that into the dynamodb so that rule engine can look into the dynamo DB white lace and when the net same alert happens next time it doesn't have to alert us so this is how concept file look like right so it's based on object path language so let's say alerts call come on out of tamp right if that is the alert then we have some conditions that and that is the system calls these needs to be exactly e and which is like event type and we need to know some execution path or something and then the notify tells you what are the notification channels like ops Jeannie or Bobby ll channel on the slack and all those bottom fields are telling you that what are the different combinations of whitelisting we support for this alert so this is how actual alert scene by scene on slag by our security team so since it's the p1 so command run out of time so maybe somebody is going to be be waking up in the night this account is production instance is instance name is gear given the process is AWS logs execution path if you can see the where that was executed and that's what a lot got fired because this Python program was executed inside the temp directory which it was not supposed to happen given that it is doing AWS log so it is false positive so but then we we have an option to whitelist this alert and you can see that well the entire command that was executed in order for in order to run this this thing here we are not showing you the white listing option because generally we don't want to have a discolor of of white listing P ones right we want to go and look into P ones we want to make a complete postmortem why it is happening Wow who did what and so that we have a better idea of so that we don't have this kind of p1 zouri that needs to be downgraded to p4 or if it's p1 we should know this should not happen at the end so let's look let's look at different alert we have this alerts called unexpected processes spawned right and since unexpected could be anything it could noisy that's what we have at p4 and it says that this Cuban it is VM was running our program so again ork is not very security sensitive but we have not seen odd before from this VM and this is the development VM so we when we see alerts from the development VM we basically go ahead and and and either whitelist or take some action so that we can learn from this behavior and same behavior should not be alerted when it happens in the production and here you can see the white listing options and these are the three white listing options that we have and these white listing option changes depending on race and networking a lot or file alert or process alerts and this is a process alert we allow you to we have this thing that you can whitelist just by being processed name so based on problem and so it will be vital steady using process in execution path if not then it will be by if you do process in parallel then process execution path and parent field will be used to generate a md5 hash so the next time and the same alert comes all those fields will be used to decide the others are all that is in the whitelist or not and you can also go with the parent process and command line that is very extreme that you just want to may match a very specific behavior otherwise you want to get alerted on that attack or on that event so that's how basically with the with the changes the way Falco runs on the holes with the trip put and probably we basically change the way Falco works in our environment we have some other use cases as well other than alerting security and visibility we use Falco a lot during debugging process whenever because we are collecting raw system calls so we can see what is happening just by when something happens we can go and look into our elasticsearch logs to see who has run what command at what time we mainly use a lot during our game days and to red and blue team exercise when somebody runs something and we look into the slack slack channel obviously anywhere that those alerts are coming or not and again since we can see the raw events we will know what is happening I have now again during the chaos engineering as well then we want to make sure that we want to know or get visibility on what is happening that's pretty much what I had questions yes actually I'm supposed to come to you so I'll do this oh yeah sure yep I'll upload that we were just editing till the last minute so any other questions yes here I'm gonna that's a good question how long did it take to go home so one developer three months yeah yeah few seconds yes it's very fast because we are doing in the real time we as soon as events are so only latency that we have is is the actual latency within the head of those services so from Falco as soon as logs are returned sister things are happening pretty fast events if systems is being used and you can see files getting filled very fast they get pushed to the cloud wash log cloud watch log - right so no sauce okay so depending on the type of allure that you are talking about right few LS you can see a few seconds you will at least a little bit maybe a few more seconds of the seconds to midnight between right because depending on how much events are being generated at that time right so that's not if the log if you are the edge of the log you can see very fast but if you otherwise you see a tease but we never see anything goes beyond a minute right so false positive real problem right so depending on I think that's when you need to have a good system where you can run the same system and development that we are running and same system we are in production right Falco solves positives are it's not a machine learning based system so it's a signature based system so it's your appetite on what you want to see for example our alert an expected process is spawned by nature it's a false positive right because orc is not an alert right but the thing the same thing is that we wanted to see that we wanted to get the visibility that's why we have these alerts but VI we have kept it at p4 so it means that we don't want to be waking up in the night just for some unexpected process point because we have other rules more targeted rules are in place that can tell us or something bad is happening so it depends on what kind of rules so rule crafting is very important in any security tools right that if you if you craft a very general generalized rule saying that any right into user local bin should be notified then you will find that most of the time even AWS agents and all those benign programs even during the yum update you might find that a lot of things are operations are being performed in the directory right so it means that you have to you have to craft your rule in a way that either time you're looking for very specific operations or you don't develop your white list over time so that you know that oh I'm not gonna worry about if young does that I'm not gonna worry about rpm does that and that's when you can narrow down or you reduce or false positive for example even an expected process is spawned we narrowed down to one alert in a day because we trained our system so much during the development phase that every time something happens we then be banded and we whitelisted we went in revitalized ed and that's when our system was completely stabilized so by design it it lives with the Falco agent yes so a little bit of just looking to the future on how we're managing rules right now so right now we're using Lua Andy amel to to encapsulate the rules the next big push for the next six months of the Falco open-source community is where we're going to be pulling out a lot of the the native rule parsing that actually like looks for configuration on disk and we're gonna expose that via an API so basically the patterns gonna be as Falco starts it's gonna start with no configuration and it's effectively just gonna be a very empty loop doing virtually nothing and then you'll be able to start sending rules to it dynamically using a client of your choice and you'll be able to you know create rules that way and of course we're going to increase backwards compatibility by just having a simple lightweight CLI program that parses rules as they're written today and then just send that out to the client accordingly so this is uh this is a big design here I'll get back on stage for this one because it's going to take me a second this is a big design choice that we made on day one so I'm gonna draw a quick comparison to selinux right and we're gonna use selinux to the kernel versus prevention to kubernetes and this is what I was talking about really where I talked about prevention so by design Falco does not it's not in the business of preventing anything from happening right like we don't want to confuse surveillance with access control those are two independent systems that need to work independent of each other that are both necessary to have a secure system so the the pattern that we see most folks doing is detecting some sort of anomalous behavior understanding what happened coming up with a plan of like understanding what the attack vector was what they did and what they were after and then either through an automated process or through a manual process responding to that event in generating prevention policy moving forward if you would like to use Falco to say for instance Kayla pod or draft some OPA policy or email someone or to destroy a cluster or Nuka p.m. or whatever you certainly could write it to hold it do that most of the time what we're seeing is Falco goes directly into a stateful aggregate type series database and then there's some sort of like response to detection on top of that correct and so like for instance assisting the company we work for I work for we have like we have patterns that we suggest in some of our offerings that say like here's how you one could respond to these events but the the Falco open-source project we try to keep the scope as limited as possible and it's basically a surveillance tool brain and you can as you can achieve some of these automation by yourself let's say in our design if then you do on slack say rightly so it's an attack you can run a lambda function which is can go and kill the pod or which can go in and take some remediation action the way alw sponsor does that right and if you look in me there's a train behind me if you look into Falco repository we have a tool called the Falco sidekick and we also have one called the Prometheus exporter as soon as we have the client go for the G RPC endpoint and Falco folks begin contributing very lightweight small programs to do arbitrary tasks the Prometheus exporter just basically relays the information to Prometheus Falco sidekick plug connect up to about a dozen different well-known open-source tools and take action from there so if folks are interested in contributing or using the client we would love to get you involved with the Falco community yes we are actively thinking about it so we have to go through the process we will be talking with Chris as well and you have to do internal deliberation on that as well yeah we want to try to get them to donate it to the Falco project but that's like our own secret agenda here sorry yep last one one to two more so you what do you mean right so yeah so right now Cerberus is a glue component right that basically taking actions right but yeah I mean we can and maybe that's the reason for us to get raw events is to be more creative right we can do a lot more things because we have raw events right so maybe we can correlate with the VPC flow log we can correlate with the web logs firewall logs to build a better correlation model right to bet about holistic picture of what is happening in the cloud right we can stitch events if we haven't done that but it's in our roadmap so the question was can you do anything with EPP f that could basically have the same effect as a kernel module so EB PF is read-only right so we're not actually able to generate like new logic and implement new logic into the the kernel with e BPF we're able to read virtually any piece of address space from the kernel and then all the code does is just simply echo that up in a meaningful map to user land whereas with the kernel module we could actually go in and we could change things in the kernel or we can mutate memory and there's a lot of things we could do that could potentially harm our system so that's why e BPF really is getting the hype it is because it's fundamentally safe because it's read-only sure no right now we're learning in the color model mode no no since deep comenzó falco comes with the kernel module and EBP of designs so yes we are using kernel module design yeah so one of the things that we're doing right now so we're in sandbox like I mentioned we're moving to incubation vote for Falco as we move into incubation here over the next month or so hopefully we'll get it we're gonna be moving the infrastructure that builds these kernel modules for different kernels and operating systems there's many permutations there as I'm sure you can imagine and building those and storing them and hosting them publicly for folks to download and there's actually a hook in the Falco code that when in Falco starts it'll check to see if the the slash dev device is mounted and if it's not it will attempt to actually do a dynamic URL generation to try to pull it out of s3 and installed on the system using dynamic kernel module loading cool thank you so much thank you
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 2,605
Rating: undefined out of 5
Keywords:
Id: Z4POV5IXnHQ
Channel Id: undefined
Length: 92min 37sec (5557 seconds)
Published: Fri Nov 22 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.