(chords ringing) (audience clapping) - I've been a threat
hunter for a long time. And specifically, before I was with
my current employer, I worked with one of
those consulting firms, we'd do hunt missions, right. And we'd go out. And we'd typically spend
somewhere between two weeks and a month with
an organization. We'd spend a half day, here's the principles
of hunting, and then we'd hunt with them
in their data sources, right. Because hunting is just
one of those things that you really need
to learn by doing. Right, experience
is where you get it. And it was there that
the idea for this really germinated for me. Because we've got all
of these use cases. Several have been
eluded to, right, where we've got situations
where time and again, your experience,
you're a hunter, you find that oh the bad guy's
love to use this technique. And so to use this technique, I go look for this data source. And in this data source, I look at these places. Use cases based upon, as Rob talked about in his talk, the intel of the actual
adversaries, right. Well that's where machine
learning can benefit us. I am not a data scientist. I do not claim to
be a data scientist. But what I'm hoping
to convince you today, I'm actually releasing a
new tool to help with this. There's another one out there
that I'll also refer to. But data with the
machine learning has come to the point where
we can run it as professionals as a black box. You do not need to be a
data scientist anymore. And hopefully I will
convince you of that. But I'm gonna start
off with a quiz. I know it's not lunchtime yet. I apologize. But somebody must
recognize what this looks. What's this a
Wireshark capture of? How 'bout this one? Anybody recognize this one? This is our old
friend, Hancitor. And then how 'bout this one? This would be good old
Cob-der dropping lah-kee on your environment. So, the reason I
brought these up is because in this room, we probably have a cross-section of some of the top talent in
the world at incident response. And even us, looking at these Wiresharks, are hard-pressed
to what is this? How do we use this, right. That's what I think
the root challenge, if you're at most organizations and you want to get
rolling in cyber hunting. You've got some challenges. You gotta find
experienced practitioners. Because I can tell you that
an experienced threat hunter versus a non-experienced
threat hunter, your results are going
to be night and day. So two, it really takes a long
time to develop these skills. I would consider
myself a moderately
capable threat hunter, and I've been focusing on
this for at least a decade. Often we have too
much data to look for. I work at an organization
with 350,000 employees. We have hundreds of
thousands of computers. I have a Bro sensor grid
of over 200,000, or 2,000. 200,000 would be
really impressive. 2,000 Bro network sensors. That's a lot of data
to sift through. And so what we do in our case is we have a weekly
hunt mission. So one of the things we've
found works really effective and one of our team members is actually gonna talk
about this tomorrow in a lot more detail. So I don't want to
steal his thunder. But we rotate folks through. So we have a week where the portion of the C-Cert team, some of our other
groups that support us, they spend the week just hunting on a particular hunt mission. But till they cycle back around, so a particular use
case gets hunted. So maybe last week, we hunted OWA logs
for indicators of
particular threats. By the time they circled back
around to those OWA logs, it might be three
months from now, just because of
the sheer quantity. Then on the other side, there's some really, really
strong benefits from hunting. So one of the questions
earlier I think for Rob was well how do you quantify
the value of threat hunting? I personally, actually
think the output of finding the unknown malicious activity is a secondary benefit
to threat hunting. I really do. I think the really strong
benefits for threat hunting are one, it drives
continuous improvement in our detection capabilities. Right, if we find
something in a threat hunt, then we should
immediately be pivoting to how do we find this
on an ongoing basis? How do we find this on a
constant, iterative basis? And third, and I think
ultimately most important, because if you didn't get
from Rob's talk on the keynote that the most important
thing is the people, it's an incredible
mentoring vehicle. So in our case, we have one of our really
experienced incident handlers, our top level analysts, leading the hunt mission, backed up by a level
two and a level one. Not because the experienced
person really needs the level two and the level one, but because it's
just a fantastic way to help the level one analyst
become a level two analyst, and the level two analyst
become a level three analyst. 'Cause we have to
grow our capabilities. So, that's where I see
the real opportunity for machine learning. Now, I'm gonna differ
from all of the vendors that you might have
talked to at RSA, and some of these shows that say artificial intelligence
and machine learning is these magic silver bullets that's gonna solve
all your problems. That's bull pucky. Just gonna put that
out there right now. There's nothing magical
about machine learning. It's just giving a
computer an ability to do some of the type of
analytical work that we do. And then specifically, I think taking
particular use cases. So for instance, the tool that I'm gonna
demo here in just a second, which I've called Assimilate, 'cause somebody keeps
referring to my team as the Borg Collective. Assimilate is going to ingest row HTTP headers, looking for unknown
malicious activity. Right, and then it kicks
out hey this is interesting. Now I want to be clear here. This is not intended as
a replacement for your IDS stuff, your other detection systems. This is a hunting tool. What machine learning
is gonna put out is going to be useful for you as a hunter to go through. This is not gonna replace
people, anything like that. But what this does, is it allows you to
take specific use cases. The last talk, where he talked
about taking those use cases and defining those, and those become
inputs for further, that's what this
allows you to do. So enough of me pontificating. Let me switch over
to a demo here. So this is an actual
screen capture. I did a screen capture
because I was trying to avoid the... And it looks like our
resolution is off a little bit. But I think we'll still
get the main thing. So in an attempt to stymie the demo gods, what I did was I
just screen captured. So I'm running Assimilate here. You notice I've got a directory of a whole bunch of Bro logs. I mentioned I've got
over 2,000 Bro sensors. So I have pretty sizeable
Bro logs to work with. We get a few terabytes a
day which is kinda handy. And so what I'm doing
here is I'm running Assimilate Assess. So there's two components
to machine learning. First, you've got to
train a model, okay. And then you've got to apply
that model against your data. So in this instance,
what I'm doing is Assimilate is first
loading up the models. And I'll explain how
we build the models and all that in a minute. And it's testing that
specific snort log. I'm outputting my output into a file called findings.txt because of course
I'm gonna go want to chase those down later. And then I did a dash v, just so it'll list out as it's
going a little more verbosely so that you can see that
it's actually doing stuff. And let me just speed things
up a little bit there. So I jump through. Horribly exciting, isn't it? Like every tool. So literally, what
this is doing though is really straightforward. It's just ingesting
that Bro output log and just parsing it looking
for anomalous activity. Stuff that you may want
to go hunt on this. You may want to go
chase this down. Because this is potentially
bad in your environment. So what are we looking at there? Let me go a little
more particular. So in this case, I'm using an algorithm
called Naive Bayes. Anybody recognize
Naive Bayes algorithm? Right, that's used for what? Shout it out, don't be bashful. Spam, exactly. Naive Bayes has been used
since literally the '90s to find spam. Naive Bayes algorithm is really,
really excellent algorithm for looking at textual content. And what are HTTP headers? Textual content. And now, this is not by any
means the only algorithm. I picked this one actually because I've got
several of these tools that I've developed that
I'll be releasing here over the next bit. But the original one I
was actually gonna release for the talk used another technique
algorithm called Random Forest. But Dave Bianco
and Chris McCubbin released a tool called Clearcut that I'll talk more
about in a second that uses Random Forest. So instead of duplicating, we've already got a great
tool that uses that. Let's do a new tool. So I went with Naive Bayes. So, in the process here, had I just showed you
the whole demo or video, it would have parsed
through 37,440 HTTP headers. In the end, it found 46
things that looked suspicious, which is about a
tenth of a percent. How long is it gonna
take you to parse through 37,440 headers? And what do you think your
accuracy is going to be in that? This is where the
tool can be helpful to help us find stuff. So this is an example
of its output. So what it's showing here
is just the individual lines that had suspicious
entries on them. And at this point as a hunter
I would go chase those down. So what do we need to do this? It's pretty straightforward. You're gonna need Python. There's two modules you need. Scikit Learn and Pandas. You're gonna need
some packet captures of non-malicious activity. This is actually,
in my experience, the hardest thing to come by. So in your environment, the beauty of machine
learning is that its accuracy will be best if you build the model
in your environment on your traffic. Because your environment's
traffic is unique. Even though it's
using the same RFCs and the same protocols all
the rest of us are using, the actual day to day, hour by hour, minute by minute
output is unique. And so, capturing some
non-malicious activity, right. You just needs some pcaps. I've got some scripts
and stuff to convert it here in a second I'll show. But the trick though is, you want to get it without
malicious activity in it. (laughing) So that
can be tricky. 'Cause otherwise
you're gonna teach the machine learning algorithm that your bad is
normal and good. And that's probably not what
we're looking to achieve. Packet captures of malicious
activity, that's easy. You can get 'em from
your environment. There's some other, I'll
give you some spots. You're gonna need Bro. You're gonna need a customized
Bro HTTP header script. So, I'm not using
the Bro HTTP module. That's what Clearcut uses. Great module. So if you're familiar with Bro, Bro has a standard module on
by default called Bro HTTP where it breaks out all of
the individual HTTP headers. Bro HTTP header is
actually a separate output within Bro. It's been available for years. It's not on by default. So I've customized a script to turn it on. All of this is packaged
up, by the way, up on the GitHub link there that I've got at the bottom, which has got this tool. What I customized
is really simple. I added the Bro Con reference. So everybody's familiar with
Bro and the Bro Con reference? That gives us the ability
when we're hunting this data to pivot back to
the full session so we can do the
full investigation. And then, we need of
course Bro itself, and the Assimilate
Python scripts. All of that. Well Sci-kit Learn
modules are not available. And the packet captures
are not at the GitHub. But the rest is at
the GitHub there. All right, so how do we
do this then step by step? Well the first thing
you've gotta do is you're gonna need to go out, and you're gonna need to
collect and process some data. So you need to go out,
capture those pcaps. Then you're gonna convert
them into Bro HTTP headers. That, we'll use to
train the model. I'll show you how to do
that in just a second here. So the training will just
literally ingest the Bro output, both the normal
and the malicious, and it will build a model file, it'll save it out. From then on, all you need to do is run it. That should result in
some suspicious entries. You look at those
suspicious entries. When you get those
suspicious entries, you're gonna find,
especially initially, oh no, this is legitimate
traffic, which is fine. You just feed that back
in and retrain your model to tighten it up a
little bit and you win. That's the overall step by step. So literally, you can do this without having a bunch of
data science expertise, without having deep,
deep understanding
of machine learning. Now, that said, I do recommend at least
getting a basic premise. Primer maybe is a better word
for it on machine learning. But given that we only
had half hour windows for these talks, that's a little bit too
much to bite off for this. And indeed, if you go back
to David Bianco's B-Sides, 2016 DC talk, he does a great job of
explaining the fundamentals. He did that when he
released Clearcut. So, no need to
reinvent the wheel. Go Google Clearcut and
Bianco, you'll find it, and profit. So, simple diagram of
what this looks like. So we take Wireshark, we collect some normal
traffic from our environment, get those pcaps, right. If you're not familiar with
malware-traffic-analysis.net, shame on you. He does just a fantastic job of keeping an up to date
of the latest, greatest fun that is being thrown at us by a bunch of the adversaries. Internal malicious traffic, so take your zoo, run some pcaps out of
Cuckoo, stuff like that. Grab those pcaps. And then what you're
gonna want to do, I'll show you the
script in a second, you're gonna need
to label it, right. So separate those pcaps. Here's normal traffic, here's malicious traffic, okay. And then, you'll need to install the customized HTTP headers Bro
module that's up on GitHub. And then just process
all those packet captures with dash r. Right, Bro dash r, your pcap file, that'll spit out all of
the headers that you need. Just collect hose
headers into a directory, run Assimilate on it, boom. It's that easy, okay. So this is what the customized Bro HTTP headers
script looks like. Like I said, really the
only thing I'm doing here is putting out the HTTP headers function from Bro. What that is, if you're
not familiar by the way, is it's just a serial string of all of the headers together. So it's got the user agent, the yer-ee, all of the HTTP headers in one single string, which is convenient for this. And then, again I'm trying
to make the barrier to entry as low as possible. So this is a shell
script that I wrote for personally
processing all the pcaps. So, if you don't want
to even bother with understanding how
to run Bro dash r, this'll just iterate
through a directory, take all of the pcaps, runs them through Bro r, extracts out the Bro HTTP,
and Bro HTTP headers. I take both of those out
because I also run Clearcut. And so that puts it
then into a folder so I can train Clearcut as well, so I can double the
bang for my buck. And then it looks
like this running. Pretty straightforward. So here, I'm running the shell script. Notice, I've got there a
directory full of pcaps. And I'm now seeing an HTTP.log, and an HTTP headers.log. What I had my script
do is just name the log files the same
name as the pcaps, so I could just
easily keep track of what log files went
with what pcaps. And just calls on Bro
to process through. It runs pretty quick. Bro's really fast. So a folder of pcaps, quick shell script, and Bob's your uncle. So then, you process the pcaps. So once you've got that, what we now end up with is I've got two folders. One that's my training data, my normal training data, and one that's my
malicious training data. So now at this point I
simply run assimilate-train. So... assimilate-train.py, there's the parameters. And in this case, notice the dash n is
pointing to a directory of HTTP headers
with normal traffic, and the dash m is malicious. And so literally, it's gonna
ingest all of those headers, use that to train
my model files. Obviously, there's all kinds of parameters there you can add, like custom naming your model
files and stuff like that. But the defaults should
work for you pretty well. And that'll take a
bit to run through. And then it's gonna spit out, spit out the model files. I'll jump ahead in
interest of time. Now in this case, I trained some massive files. Again, I have a lot of data. So my model files are
over 500 meg in size. Yours probably
won't be that large. That's not a bad thing. I'm just a big believer in being overly thorough and trying to be as
accurate as possible. So then once we've
got it trained, we're good to go. Now there's a few gotchas that you want to
bear in mind here. So the more data, i.e. the more normal traffic
and the more malware you have, the better accuracy you've got. But unfortunately,
on the flip side, the bigger your models are, i.e., that higher accuracy, it will run slower. So on my to-do, I'm actually gonna redo some of the Assimilate
functions in Lua. There's an inline Lua
function for Python that's really nice for
speeding up Python, side trick. But I didn't get
that done in time for the event here. So that will come in the
not too distant future. Another big gotcha is that Bro headers, by default, and I ran into Seth and
a bunch of the Bro folks out at RSA this year. And I gave them crap about
not giving me an option natively in Bro to turn
off header normalization. So Bro, today, when it takes all the headers, it upper cases all of
the headers themselves. That actually dramatically
reduces your accuracy. My personal copy
of Bro that I used for actually training, I modified the Bro source
code and recompiled it to take off the
header up-casing. The accuracy
dramatically goes up. The reason it goes
up is because... The reason I love HTTP
headers for hunting is because in the real world, our developers and
our software folks, the RFCs for HTTP don't say, well this is standard
HTTP header order, and you're supposed to use
camel case here, whatever. None of that is specified. But, most legitimate software is using either the
Windows libraries to generate their HTTP, the MAC libraries, or the Linux libraries. And those libraries absolutely
do have conventions. They have just kind
of standardizations. And so, the bad guys
don't know what those are, for the most part. I'm sure there's some
bad guys out there that are smart enough to
have cross-referenced that. But most of them don't. And so they'll make
little, tiny typos, and little tiny errors
in their implementation, 'cause they're hand
rolling their comms. They're trying to
make it look like our legitimate HTTP. But because they're
hand rolling it, they'll make little
nuanced mistakes. And that's where
your differentiation. Unfortunately, because Bro
upper cases the headers, we lose a lot of those nuances. And so, if you're
gonna do this for real, I highly recommend... Now Seth promised that
he's gonna have an option to command line or option to
turn off header normalization at some point. We'll see. In the meantime, it's just
literally one line code change in Bro to fix it. Now it does break a bunch
of other stuff in Bro, which is why they haven't
done it generally, just to be clear. But when you're
running Bro locally just to process pcaps, it doesn't matter if the other
stuff doesn't work right. And then, another thing is tighter scoping on those. So what I mean by
this is the malicious. So if I go out to
malware-analysis.net and I download some of
those fantastic pcaps, if you haven't
looked really closely at how malware operates, it does a lot of the same things that legitimate
software does, right. So a lot of times, when the malware
first goes to connect, it'll try to connect to Google, just to see if it has
internet connectivity. Or Microsoft's certificate
validation function in Windows will automatically
kick off, right. And so if you take the time to go into the Bro
header conversions of the malware pcaps and just nuke those
standard lines out, again, your accuracy
will go way, way up. Okay, so few tips. And then after that,
it's just literally a matter of running it. So again, the intent
here of this tool is just to be able to, right, run it and assess
your Bro files. Once you've got that, collected your pcaps, processed them with Bro
to get the headers out, use those to train your models, from then on, you just run it. And it's pretty straightforward. The tool supports, you can run it against
an individual log file, or you can run it against an
entire directory of log files. Both are supported. And that's what I
demoed at the beginning. All right, so. So, we run our Assimilate. So, I mentioned earlier
that I think this has gotten to the point where
this is a black box. And the reason I say that is because when you download
Assimilate from GitHub, what you're gonna see is these are not a
big program, right. They're very small programs. Indeed, the actual code to do the data science part is like four lines. Most of the actual work is
parsing the Bro log files. And so, ultimately, what I want
you to take away from this, I hope, is that like I said, because of all of the
stuff that's been done by some phenomenally, you know, people way, way smarter than me, that allows me to go out and
take a specific use case, literally Assimilate is
taking one specific use case of looking for particular
types of malicious activity in HTTP headers. Right, so a hunting use case, and turning it into a
tool that I can then use on an ongoing real-time basis. (upbeat drumming)