Real-Time Threat Hunting - SANS Threat Hunting & Incident Response Summit 2017

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(chords ringing) (audience clapping) - I've been a threat hunter for a long time. And specifically, before I was with my current employer, I worked with one of those consulting firms, we'd do hunt missions, right. And we'd go out. And we'd typically spend somewhere between two weeks and a month with an organization. We'd spend a half day, here's the principles of hunting, and then we'd hunt with them in their data sources, right. Because hunting is just one of those things that you really need to learn by doing. Right, experience is where you get it. And it was there that the idea for this really germinated for me. Because we've got all of these use cases. Several have been eluded to, right, where we've got situations where time and again, your experience, you're a hunter, you find that oh the bad guy's love to use this technique. And so to use this technique, I go look for this data source. And in this data source, I look at these places. Use cases based upon, as Rob talked about in his talk, the intel of the actual adversaries, right. Well that's where machine learning can benefit us. I am not a data scientist. I do not claim to be a data scientist. But what I'm hoping to convince you today, I'm actually releasing a new tool to help with this. There's another one out there that I'll also refer to. But data with the machine learning has come to the point where we can run it as professionals as a black box. You do not need to be a data scientist anymore. And hopefully I will convince you of that. But I'm gonna start off with a quiz. I know it's not lunchtime yet. I apologize. But somebody must recognize what this looks. What's this a Wireshark capture of? How 'bout this one? Anybody recognize this one? This is our old friend, Hancitor. And then how 'bout this one? This would be good old Cob-der dropping lah-kee on your environment. So, the reason I brought these up is because in this room, we probably have a cross-section of some of the top talent in the world at incident response. And even us, looking at these Wiresharks, are hard-pressed to what is this? How do we use this, right. That's what I think the root challenge, if you're at most organizations and you want to get rolling in cyber hunting. You've got some challenges. You gotta find experienced practitioners. Because I can tell you that an experienced threat hunter versus a non-experienced threat hunter, your results are going to be night and day. So two, it really takes a long time to develop these skills. I would consider myself a moderately capable threat hunter, and I've been focusing on this for at least a decade. Often we have too much data to look for. I work at an organization with 350,000 employees. We have hundreds of thousands of computers. I have a Bro sensor grid of over 200,000, or 2,000. 200,000 would be really impressive. 2,000 Bro network sensors. That's a lot of data to sift through. And so what we do in our case is we have a weekly hunt mission. So one of the things we've found works really effective and one of our team members is actually gonna talk about this tomorrow in a lot more detail. So I don't want to steal his thunder. But we rotate folks through. So we have a week where the portion of the C-Cert team, some of our other groups that support us, they spend the week just hunting on a particular hunt mission. But till they cycle back around, so a particular use case gets hunted. So maybe last week, we hunted OWA logs for indicators of particular threats. By the time they circled back around to those OWA logs, it might be three months from now, just because of the sheer quantity. Then on the other side, there's some really, really strong benefits from hunting. So one of the questions earlier I think for Rob was well how do you quantify the value of threat hunting? I personally, actually think the output of finding the unknown malicious activity is a secondary benefit to threat hunting. I really do. I think the really strong benefits for threat hunting are one, it drives continuous improvement in our detection capabilities. Right, if we find something in a threat hunt, then we should immediately be pivoting to how do we find this on an ongoing basis? How do we find this on a constant, iterative basis? And third, and I think ultimately most important, because if you didn't get from Rob's talk on the keynote that the most important thing is the people, it's an incredible mentoring vehicle. So in our case, we have one of our really experienced incident handlers, our top level analysts, leading the hunt mission, backed up by a level two and a level one. Not because the experienced person really needs the level two and the level one, but because it's just a fantastic way to help the level one analyst become a level two analyst, and the level two analyst become a level three analyst. 'Cause we have to grow our capabilities. So, that's where I see the real opportunity for machine learning. Now, I'm gonna differ from all of the vendors that you might have talked to at RSA, and some of these shows that say artificial intelligence and machine learning is these magic silver bullets that's gonna solve all your problems. That's bull pucky. Just gonna put that out there right now. There's nothing magical about machine learning. It's just giving a computer an ability to do some of the type of analytical work that we do. And then specifically, I think taking particular use cases. So for instance, the tool that I'm gonna demo here in just a second, which I've called Assimilate, 'cause somebody keeps referring to my team as the Borg Collective. Assimilate is going to ingest row HTTP headers, looking for unknown malicious activity. Right, and then it kicks out hey this is interesting. Now I want to be clear here. This is not intended as a replacement for your IDS stuff, your other detection systems. This is a hunting tool. What machine learning is gonna put out is going to be useful for you as a hunter to go through. This is not gonna replace people, anything like that. But what this does, is it allows you to take specific use cases. The last talk, where he talked about taking those use cases and defining those, and those become inputs for further, that's what this allows you to do. So enough of me pontificating. Let me switch over to a demo here. So this is an actual screen capture. I did a screen capture because I was trying to avoid the... And it looks like our resolution is off a little bit. But I think we'll still get the main thing. So in an attempt to stymie the demo gods, what I did was I just screen captured. So I'm running Assimilate here. You notice I've got a directory of a whole bunch of Bro logs. I mentioned I've got over 2,000 Bro sensors. So I have pretty sizeable Bro logs to work with. We get a few terabytes a day which is kinda handy. And so what I'm doing here is I'm running Assimilate Assess. So there's two components to machine learning. First, you've got to train a model, okay. And then you've got to apply that model against your data. So in this instance, what I'm doing is Assimilate is first loading up the models. And I'll explain how we build the models and all that in a minute. And it's testing that specific snort log. I'm outputting my output into a file called findings.txt because of course I'm gonna go want to chase those down later. And then I did a dash v, just so it'll list out as it's going a little more verbosely so that you can see that it's actually doing stuff. And let me just speed things up a little bit there. So I jump through. Horribly exciting, isn't it? Like every tool. So literally, what this is doing though is really straightforward. It's just ingesting that Bro output log and just parsing it looking for anomalous activity. Stuff that you may want to go hunt on this. You may want to go chase this down. Because this is potentially bad in your environment. So what are we looking at there? Let me go a little more particular. So in this case, I'm using an algorithm called Naive Bayes. Anybody recognize Naive Bayes algorithm? Right, that's used for what? Shout it out, don't be bashful. Spam, exactly. Naive Bayes has been used since literally the '90s to find spam. Naive Bayes algorithm is really, really excellent algorithm for looking at textual content. And what are HTTP headers? Textual content. And now, this is not by any means the only algorithm. I picked this one actually because I've got several of these tools that I've developed that I'll be releasing here over the next bit. But the original one I was actually gonna release for the talk used another technique algorithm called Random Forest. But Dave Bianco and Chris McCubbin released a tool called Clearcut that I'll talk more about in a second that uses Random Forest. So instead of duplicating, we've already got a great tool that uses that. Let's do a new tool. So I went with Naive Bayes. So, in the process here, had I just showed you the whole demo or video, it would have parsed through 37,440 HTTP headers. In the end, it found 46 things that looked suspicious, which is about a tenth of a percent. How long is it gonna take you to parse through 37,440 headers? And what do you think your accuracy is going to be in that? This is where the tool can be helpful to help us find stuff. So this is an example of its output. So what it's showing here is just the individual lines that had suspicious entries on them. And at this point as a hunter I would go chase those down. So what do we need to do this? It's pretty straightforward. You're gonna need Python. There's two modules you need. Scikit Learn and Pandas. You're gonna need some packet captures of non-malicious activity. This is actually, in my experience, the hardest thing to come by. So in your environment, the beauty of machine learning is that its accuracy will be best if you build the model in your environment on your traffic. Because your environment's traffic is unique. Even though it's using the same RFCs and the same protocols all the rest of us are using, the actual day to day, hour by hour, minute by minute output is unique. And so, capturing some non-malicious activity, right. You just needs some pcaps. I've got some scripts and stuff to convert it here in a second I'll show. But the trick though is, you want to get it without malicious activity in it. (laughing) So that can be tricky. 'Cause otherwise you're gonna teach the machine learning algorithm that your bad is normal and good. And that's probably not what we're looking to achieve. Packet captures of malicious activity, that's easy. You can get 'em from your environment. There's some other, I'll give you some spots. You're gonna need Bro. You're gonna need a customized Bro HTTP header script. So, I'm not using the Bro HTTP module. That's what Clearcut uses. Great module. So if you're familiar with Bro, Bro has a standard module on by default called Bro HTTP where it breaks out all of the individual HTTP headers. Bro HTTP header is actually a separate output within Bro. It's been available for years. It's not on by default. So I've customized a script to turn it on. All of this is packaged up, by the way, up on the GitHub link there that I've got at the bottom, which has got this tool. What I customized is really simple. I added the Bro Con reference. So everybody's familiar with Bro and the Bro Con reference? That gives us the ability when we're hunting this data to pivot back to the full session so we can do the full investigation. And then, we need of course Bro itself, and the Assimilate Python scripts. All of that. Well Sci-kit Learn modules are not available. And the packet captures are not at the GitHub. But the rest is at the GitHub there. All right, so how do we do this then step by step? Well the first thing you've gotta do is you're gonna need to go out, and you're gonna need to collect and process some data. So you need to go out, capture those pcaps. Then you're gonna convert them into Bro HTTP headers. That, we'll use to train the model. I'll show you how to do that in just a second here. So the training will just literally ingest the Bro output, both the normal and the malicious, and it will build a model file, it'll save it out. From then on, all you need to do is run it. That should result in some suspicious entries. You look at those suspicious entries. When you get those suspicious entries, you're gonna find, especially initially, oh no, this is legitimate traffic, which is fine. You just feed that back in and retrain your model to tighten it up a little bit and you win. That's the overall step by step. So literally, you can do this without having a bunch of data science expertise, without having deep, deep understanding of machine learning. Now, that said, I do recommend at least getting a basic premise. Primer maybe is a better word for it on machine learning. But given that we only had half hour windows for these talks, that's a little bit too much to bite off for this. And indeed, if you go back to David Bianco's B-Sides, 2016 DC talk, he does a great job of explaining the fundamentals. He did that when he released Clearcut. So, no need to reinvent the wheel. Go Google Clearcut and Bianco, you'll find it, and profit. So, simple diagram of what this looks like. So we take Wireshark, we collect some normal traffic from our environment, get those pcaps, right. If you're not familiar with malware-traffic-analysis.net, shame on you. He does just a fantastic job of keeping an up to date of the latest, greatest fun that is being thrown at us by a bunch of the adversaries. Internal malicious traffic, so take your zoo, run some pcaps out of Cuckoo, stuff like that. Grab those pcaps. And then what you're gonna want to do, I'll show you the script in a second, you're gonna need to label it, right. So separate those pcaps. Here's normal traffic, here's malicious traffic, okay. And then, you'll need to install the customized HTTP headers Bro module that's up on GitHub. And then just process all those packet captures with dash r. Right, Bro dash r, your pcap file, that'll spit out all of the headers that you need. Just collect hose headers into a directory, run Assimilate on it, boom. It's that easy, okay. So this is what the customized Bro HTTP headers script looks like. Like I said, really the only thing I'm doing here is putting out the HTTP headers function from Bro. What that is, if you're not familiar by the way, is it's just a serial string of all of the headers together. So it's got the user agent, the yer-ee, all of the HTTP headers in one single string, which is convenient for this. And then, again I'm trying to make the barrier to entry as low as possible. So this is a shell script that I wrote for personally processing all the pcaps. So, if you don't want to even bother with understanding how to run Bro dash r, this'll just iterate through a directory, take all of the pcaps, runs them through Bro r, extracts out the Bro HTTP, and Bro HTTP headers. I take both of those out because I also run Clearcut. And so that puts it then into a folder so I can train Clearcut as well, so I can double the bang for my buck. And then it looks like this running. Pretty straightforward. So here, I'm running the shell script. Notice, I've got there a directory full of pcaps. And I'm now seeing an HTTP.log, and an HTTP headers.log. What I had my script do is just name the log files the same name as the pcaps, so I could just easily keep track of what log files went with what pcaps. And just calls on Bro to process through. It runs pretty quick. Bro's really fast. So a folder of pcaps, quick shell script, and Bob's your uncle. So then, you process the pcaps. So once you've got that, what we now end up with is I've got two folders. One that's my training data, my normal training data, and one that's my malicious training data. So now at this point I simply run assimilate-train. So... assimilate-train.py, there's the parameters. And in this case, notice the dash n is pointing to a directory of HTTP headers with normal traffic, and the dash m is malicious. And so literally, it's gonna ingest all of those headers, use that to train my model files. Obviously, there's all kinds of parameters there you can add, like custom naming your model files and stuff like that. But the defaults should work for you pretty well. And that'll take a bit to run through. And then it's gonna spit out, spit out the model files. I'll jump ahead in interest of time. Now in this case, I trained some massive files. Again, I have a lot of data. So my model files are over 500 meg in size. Yours probably won't be that large. That's not a bad thing. I'm just a big believer in being overly thorough and trying to be as accurate as possible. So then once we've got it trained, we're good to go. Now there's a few gotchas that you want to bear in mind here. So the more data, i.e. the more normal traffic and the more malware you have, the better accuracy you've got. But unfortunately, on the flip side, the bigger your models are, i.e., that higher accuracy, it will run slower. So on my to-do, I'm actually gonna redo some of the Assimilate functions in Lua. There's an inline Lua function for Python that's really nice for speeding up Python, side trick. But I didn't get that done in time for the event here. So that will come in the not too distant future. Another big gotcha is that Bro headers, by default, and I ran into Seth and a bunch of the Bro folks out at RSA this year. And I gave them crap about not giving me an option natively in Bro to turn off header normalization. So Bro, today, when it takes all the headers, it upper cases all of the headers themselves. That actually dramatically reduces your accuracy. My personal copy of Bro that I used for actually training, I modified the Bro source code and recompiled it to take off the header up-casing. The accuracy dramatically goes up. The reason it goes up is because... The reason I love HTTP headers for hunting is because in the real world, our developers and our software folks, the RFCs for HTTP don't say, well this is standard HTTP header order, and you're supposed to use camel case here, whatever. None of that is specified. But, most legitimate software is using either the Windows libraries to generate their HTTP, the MAC libraries, or the Linux libraries. And those libraries absolutely do have conventions. They have just kind of standardizations. And so, the bad guys don't know what those are, for the most part. I'm sure there's some bad guys out there that are smart enough to have cross-referenced that. But most of them don't. And so they'll make little, tiny typos, and little tiny errors in their implementation, 'cause they're hand rolling their comms. They're trying to make it look like our legitimate HTTP. But because they're hand rolling it, they'll make little nuanced mistakes. And that's where your differentiation. Unfortunately, because Bro upper cases the headers, we lose a lot of those nuances. And so, if you're gonna do this for real, I highly recommend... Now Seth promised that he's gonna have an option to command line or option to turn off header normalization at some point. We'll see. In the meantime, it's just literally one line code change in Bro to fix it. Now it does break a bunch of other stuff in Bro, which is why they haven't done it generally, just to be clear. But when you're running Bro locally just to process pcaps, it doesn't matter if the other stuff doesn't work right. And then, another thing is tighter scoping on those. So what I mean by this is the malicious. So if I go out to malware-analysis.net and I download some of those fantastic pcaps, if you haven't looked really closely at how malware operates, it does a lot of the same things that legitimate software does, right. So a lot of times, when the malware first goes to connect, it'll try to connect to Google, just to see if it has internet connectivity. Or Microsoft's certificate validation function in Windows will automatically kick off, right. And so if you take the time to go into the Bro header conversions of the malware pcaps and just nuke those standard lines out, again, your accuracy will go way, way up. Okay, so few tips. And then after that, it's just literally a matter of running it. So again, the intent here of this tool is just to be able to, right, run it and assess your Bro files. Once you've got that, collected your pcaps, processed them with Bro to get the headers out, use those to train your models, from then on, you just run it. And it's pretty straightforward. The tool supports, you can run it against an individual log file, or you can run it against an entire directory of log files. Both are supported. And that's what I demoed at the beginning. All right, so. So, we run our Assimilate. So, I mentioned earlier that I think this has gotten to the point where this is a black box. And the reason I say that is because when you download Assimilate from GitHub, what you're gonna see is these are not a big program, right. They're very small programs. Indeed, the actual code to do the data science part is like four lines. Most of the actual work is parsing the Bro log files. And so, ultimately, what I want you to take away from this, I hope, is that like I said, because of all of the stuff that's been done by some phenomenally, you know, people way, way smarter than me, that allows me to go out and take a specific use case, literally Assimilate is taking one specific use case of looking for particular types of malicious activity in HTTP headers. Right, so a hunting use case, and turning it into a tool that I can then use on an ongoing real-time basis. (upbeat drumming)

Info

Channel: SANS Digital Forensics and Incident Response

Views: 31,065

Rating: 4.9056048 out of 5

Keywords: digital forensics, incident response, threat hunting, cyber threat intelligence, dfir training, dfir, yt:cc=on

Id: TTbZd0he94U

Channel Id: undefined

Length: 28min 9sec (1689 seconds)

Published: Sat Oct 14 2017