Applied Data Science and Machine Learning for Cybersecurity - SANS Tactical Detection Summit 2018

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
  • Original Title: Applied Data Science and Machine Learning for Cybersecurity - SANS Tactical Detection Summit 2018
  • Author: SANS Institute
  • Description: SANS Summit schedule: http://www.sans.org/u/DuS Presenter: Austin Taylor, IronNet Cybersecurity; Community Instructor, SANS Institute Determining which ...
  • Youtube URL: https://www.youtube.com/watch?v=m2AgYbbXz8k
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Apr 10 2019 🗫︎ replies
Captions
all right so good afternoon everyone my name is Austin Taylor I am a security researcher I'm the director of our cybersecurity research and development team at a company called iron net cybersecurity and this genisys for this talk was i've seen in the community a growing gap between data science and cybersecurity but not just I've been to a few data science conferences where they'll talk like very high-level theoretical data science but how does that actually apply to cybersecurity analysts and in cybersecurity so the goal for this talk is to leave you with some actionable items you can take back to your Security Operations Center and apply these cybersecurity principles to help you identify we're going to talk about three use cases specifically so DGA phishing and just general anomaly identification that you can use for threat hunting so as I mentioned I am the director of cyber security at a company called iron net cyber we specialize in Network behavioral threat analytics so anything from beacons lateral movement DGA DNS tunneling we have use cases and it's all network base so a lot of challenges there and it's been a long hard road to the path of detection in the network space but I'm happy to share a lot of those lessons learned in this presentation I'm also in the Maryland international guard RA served as a cyber warfare operations officer in my spare time I contribute to a few open source projects so Justin and I wrote bone whisper which is a vulnerability aggregator it a grits all of your vulnerability reports and riches it and then makes the data actionable and then flare actually I put together and we're going to talk about that but that has some analytics built into it and it also has some enrichments you can use for for doing data science and then blue walls the portable so I'm on a cyber protections team anytime we get to a new environment we have two rican figure our firewalls blue walls of framework a firewall framework in Python I have a blog that I haven't updated in a year there's the URL and you care about Twitter I'm on there as well okay so I'm gonna give a high-level overview of machine learning and what is machine learning try and break down those communication barriers then I'm going to show you how to apply machine learning to some use cases specifically DGA fishing in anomaly detection I'm gonna leave you with some questions you can ask of vendors who are making the machine learning claim which I think are fair questions and then finally I'm gonna try if there's enough time to conclude with the demo of everything I'm talking about so a high-level overview of machine learning there's basically three types supervised unsupervised and then reinforcement learning or semi supervised and what we run into most the time is the supervised learning arena we're not going to cover unsupervised or reinforcement learning here today but those are other areas to be familiar with in the community though anytime you have a set of label data so if you know that it's malicious or benign that's something called binary classification and there's two types of supervised machine learning so you can do classification which is a little diagram here shows you know this is a student and a classification problem it's very binary they either passed or failed their final exam but if you're doing regression it has a continuous value and then that value is gonna fluctuate from you know a range like zero to a hundred percent so those are the two types of machine learning we're going to cover both in this presentation the machine learning process is very iterative I saw a few presentations that called that out but you know at first you want to think about what problem you're trying to solve with machine learning and then you want to go and grab your data so whether there's fishing or DGA or whatever the use case is there's a you want to make sure you have an abundance of data so that way you can accurately label it then you get into that data pre-processing phase or that's the data jet data phases I'd like to call it then you create some kind of model based off all the data you've gathered and then it's time to evaluate your model so there's something called the test train split essentially if you have a hundred percent of the data you want to save thirty percent of that data and not expose that to your machine learning model that way you can run it back through your model and see how well it's performing you already know the answers to the test you just want to see how well your models doing and then that'll output some performance scores and then finally you just want to continue to improve the performance whether it's adding features to your models not going to get too much into that I am gonna cover what features are but it's the idea here is it's a very iterative process so this is something I created this is called the data science hunting funnel and way it rep the top part represents the you know 100 percent of your network traffic so you know without doing anything if you're hunting for a threat or something malicious inside of your network environment you're gonna start off with just a hundred percent you come in and these are just kind of arbitrary numbers like you know approximations but after you apply some type of machine learning that's going to reduce your whatever use case you're focusing on so if it's DGA or dns tunneling or something you're working towards or whatever answer your question you're trying to answer in your environment that's gonna reduce your data down to about ten percent from there I am a big believer in applying domain knowledge on top of machine learning and pairing it with machine learning because machine learning when done well it can get you in the right direction but the you're still going to have that problem of false positives as it relates to networks so you have to have that domain expertise and ask those right questions of the subset of data and then even with that after you apply your domain knowledge and your machine learning it's still very difficult to find actual malicious things in your organization so this isn't really this is this diagram is really just to like help set expectations that your apply machine learning and all your domain expertise and you're gonna find some anomalous stuff but it's gonna be very rare that it's gonna be bad like actual bad traffic so the first thing we're gonna cover is DGA just the show of hands who in here is like more data scientist and they're kind of getting into the security industry few people and then everyone else is security and this is who in here security and has some familiarity with data science awesome okay I consider myself a security researcher first and like I just kind of picked up data science on the side just to help me do my day job so DGA if you have ever done any type of security analysis you've probably run into some form of DGA why because it's very cheap to register and you know you can go on GoDaddy I just it's a deterministic value so the attacker knows what the seed is it's you can use it to generate for a large number of domains and it's very cost effective if you're an attacker especially if you're using those free domains so this is a quick Python snippet on how to generate your own BGA don't get too wrapped up in the code but just know that if I'm an attacker I know that I can create I know what the deterministic value is focus in on that so that's a date that I'm just passing through the function and it generates this random domain name for me right that's DGA in a nutshell where the attacker knows at some point in the future there's going to be I all need to go register that one domain but if I'm author of the domains but I only need to register one to be successful so if a computer is infected and it's trying to reach back out to see to server it calls back to that single domain that's all I have to register that's what we're going to try to detect quick I'm going to run through this fairly quickly it's going to be a quick DGA refresher put a lot of effort into making sure these arrows were coordinated let's give it a shot so first the user clicks here's the malicious website or domain then the malware is served back to the user finally yep this is the synchronous arrows the user goes to their computer is now infected and they're requiring a seed to execute the DGA however the attacker already knows what the seat is beforehand just like I showed you at the date example then the malware fetches the time and number passes it back to that the domain generation algorithm and then generates a domain algorithmically generated domains are generated and then DNS queries are sent off to the server as as a defender this is what we're gonna be familiar with we're gonna look at those algorithmically generated domains and we want to delineate if something's good or something's bad so if I'm a bad guy I'd probably generate 50,000 of them because I only have to be successful once the malware receives an IP address of the registered domain and starts the command-and-control channel the attacker just registers once one domain he's successful he or she and has command and control we're gonna focus on the DGA domains in that section so I used to think okay data you know I can just pick up something if especially when it comes to DGA I can just look at for very long labels or there are some nice rules that will help get you in the right direction with DGA but I found data science to be an extremely effective measure for for identifying and breaking up my data so we're going to go through this scenario together a piece of malware has infected a computer on your network it's making requests to domains using DGA in an attempt to communicate back to a c2 server so there's this network and a analytic framework called flare it comes with a lot of you know I found myself when I was going through this machine learning journey of trying to become better at data science and better at machine learning of writing a lot of these little snippets for enrichments so one example is you know there's there's a Levenstein formula in there there's there's Markov models in there and I found myself asking a lot of the same questions when building these these models so I just put it into this framework and just released it you know made open source back to the community to help try and help bridge the gap so inside flair you know this is all in the Python notebook if you saw my lightning talk earlier we cover dupit or notebooks a little bit but that's what this outputs from so here I'm importing I know it's a little hard to see but from flare data science features I'm importing a DG a classifier there's a classifier for DG a just built right into flare what's cool about it is it has uses random forests and I'm gonna take a little tactical pause here these what's what I love about this talk right now is I'm going to walk through yes it's good I'm going to show you use cases and how you can use machine learning but I'm gonna talk a little more to the data science and the machine learning so we can try and better understand it as a community so with so random forest is one variant it's a algorithm you can use for a supervised machine learning and then it uses something called engrams and then it uses of course label data there's plenty of label data out there for for DGA for the benign data it uses alexa top 1 million i also just recently added majestic 1 million to it just because it's free and it's the the score went up a little bit and then it also calculates it does domain tod extraction so you know there's subdomains maybe there's like some random sub domain and then google.com it'll extract out just to google.com portions for you and then compare that up against alexa so what is an Engram you hear that a lot in the community but so an Engram is its looks at the transitional properties and then calculate some type of value so in here I have two examples on the left hand side it's showing an order of two or this would also be known as a bigram because two and you can see the way that this is working so it says that the sentence is this is an example of how engrams are generated with an order of two then it takes the first two words this is is n an example and then when the same thing happens when you say do it for three this is is an example and so on and so forth that's showing you how engrams apply to words but you can also apply the same logic to characters inside of a ward and it start inspecting the transitional properties for what's normal in your environment and a sign is scored to that DGA labeled data sources I'd like to thank apt 33 and apt three but there's all kinds of malicious data sources and the DGA the actual algorithms are available so what that allows you to do is a defender is pull those DGA algorithms down and then just generate your own domains and now you're building your label data set for benign data the English dictionary works well the alexa top 1 million Cisco umbrella and majestic million and I'd even encourage you to put your own dns domains in there but someone brought up a point earlier well if you're already infected then you'd be training your your model with bad domains so that's a good starting point to get my data I just exported all of my DNS records for my personal server let me just do a quick sanitization check all right that looks good yeah so this I just went into elasticsearch exported all my DNS requests so part of this mission is we're trying to find this DG a domain right so one of the things we can do is strip out the top-level domains and this is as easy as I'm going to do an example of how to apply Flair's prediction method but on the left hand side here you have the DNS name how many times its present in the environment the top-level domains stripped out and then the DGA prediction algorithm it has two outputs it'll either say this is DG a or this is legit and that's pretty nice as a defender because now we're doing binary classification and we can say well just show me everything that's DG a right so I'm applying a filter in that column that has a t4 next to it or the room this is still a lot of results so I started with I think 200,000 results I'm down to 240 but as a defender 240 alerts is way too many for me to process in a day and this is just for one use case right so I imported Alexa I filtered out anything that has the anything that's in Alexa I said alright I'll just make that my whitelist and not show that and that got me down to 78 results and here we have the domain so 57 results there's the domain we were looking for that's the one we generated early a few slides earlier and you can see like this is all the results that were left so we have 57 but I also included the false positives in there on purpose because I could have made this presentation like yep look at that it just caught the VGA and but real world is you're gonna have false positives machine learning and especially when you get into engrams you know choosing between two or three or however far out you want to go that's going to have an impact on how well your model does so I've noticed with words or shorter domains it has a harder time calculating entropy and like change so you'll get some false positives with that so I could have applied a filter you know assuming shorter domains cost more to register and it's a.com I could have probably filtered those out but wanted to include it in here just to set the expectations so as far as our case goes we are going to pass that domain over to the analysts isolate infected host and plug all the wires and there's our domain begin the endpoint investigation case solve nice job everyone so that's good for the analyst but if you're a manager like well can you explain what happened and sometimes if you're in the machine learning we're all you're like well I'd use this machine learning model and then it made a prediction for me it kind of led me in the right direction but what if you could explain your learning models thought process and that's where decision trees are really cool because you can actually visualize the steps that the machine learning model took and how it arrived to its conclusion so they're very easy to interpret highly recommended anytime you're doing like regression or ance omble you can there's usually a decision tree you can use for it the challenge here is decision trees can be very difficult to read and render visually as you add more features so I'll break out what the features were in a second actually this is a really good example so of a basic decision tree so the dependent variables play you want to know if you should play or not for the day alright so the input is you're gonna play then you have a few features well if it's sunny outside then okay I will the likelihood of me going to play is to the likelihood of me not playing is three but then I have some more qualifying features like well what's the humidity like is it greater than 70% or less than or equal to 70 you know and that's going to influence a decision of whether or not you outside to play for the day this was the results visually from the DGA prediction that I just showed you now you can show that to your manager and they might say hey can you make that a little bigger so naturally you're gonna say enhance and then you're getting down to the area of the tree or like man I still can't see that so you gotta enhance and finally you get down to this bottom part but we can do a little bit better and this is the bottom the bottom part of the tree where the DGA that the models making the final determination of whether or not something's DGA so it's the features we have our entropy now you can see I'm going through I have the features on the left and then the red boxes as they're coming down the tree here so it's like okay the entropy is less than or equal to two point four move on to the next one the Alexa engrams are equal to the value and it just goes down and further and further and you can see at the very bottom I know it's a little hard to see yeah so you can see how it has the class where its DDA or class legit it's making that classification based off those values and just filtering its way down the tree it's pretty cool huh so that's DGA we're gonna move on to fishing so streaming fish is an application fishing something that's near and dear to all of our heart I think it's the number one attack vector and probably arguably the most damaging Wesley Wes Connell created a framework called streaming fish he also does behavioral analytics he's over at pattern X and the framework uses supervised machine learning to detect phishing domains and it runs against the certificate transparency log network to validate the results I actually have it running in the background right now so I are not familiar with a certificate transparency log what that is I'm gonna do a quick refresher in a second you can download this for free at the URL so there's this thing called search stream and what it does is it provides access to all of the newly registered certificates and then just creates a feed for you so that's the those are certificates being registered under a certain certificate authority and you can run streaming fish live and it'll output scores against those so you can see how well it's performing but you can also run this live in your network so as a refresher for certificate transparency flow our current TLS or SSL system you have certificate authority they issue an exhibit for example.com then when you go and visit example.com your client does the exchange for SSL cert that's what we have today chrome has enforced certificate transparency so if you if a domain is not compliant with certificate transparency now you'll get that that air like hey be careful this is a this could be a dangerous site so with certificate transparency all it's doing is it's taking the log server and then saying okay we want the certificate authorities submission and it generates this thing called the log response all that essentially is doing is generating a log entry into a certificate transparency server so that makes it available to you as a defender to go through and like validate is this a legit certificate like did this certificate authority actually issue the certificate to to this domain because if you're a bad guy and you've hacked the certificate authority you can just start generating your own certificates to Facebook or whoever owns that and then you can do man-in-the-middle type of tax or phishing becomes a lot easier so what I love about this it uses logistic regression and it will make scores for you and put it into different categories so it's either high suspicious or low and you can see here this is the example on the github stream it either says it's fishing or it's not fishing based off the range so it uses a logarithm logarithmic range and it's so it's between zero and one the closer it is to one the higher likelihood of it being fishing this is what it looks like I ran this yesterday and I made my own classifier and a few seconds and you can see it has the certificate authority on the left hand side and then these are the domains that were being registered pulling it from search stream and you can see some of these are pretty fishy I had to I'm sorry so you can see like the score is point nine nine five point nine nine zero and then you got a Twitter comm - OD but it has a lower score so those are things you can just filter out right away moving into anomaly detection so there's a lot of ways to do anomaly detection this diagram is a good starting point to know like which application first you have to think about your use case and then there's these things called like univariate or multivariate variables if if it's like a domain for example and that's what you're trying to make a determination on that be univariate but if there's a lot of extra features then you can get that's when it's probably a good time to look at clustering or doing some type of clustering and Markov chains we are gonna focus on just doing Markov chains for domains more I'm sorry user agents so we're gonna identify anomalous user agents in our environment using Markov chains to get the data I just went into my elastic search stack and I just exported all of the user agents using the quick little value count that Kabana provides but with with since we're analyzing transitions between characters we want to prep the data this gets into the data janitor role so this is the this is a little hard to read but what it's showing is back here I just have one user agent so it's Mozilla 5.0 and that's been seen in my environment 2700 times I wrote a quick little python script that exploded that out so inside of my user agent big strings text file I'm gonna see that same user agent 27 hundred times and it's okay it kind of goes against like what we're trained to do is just like minimize but when you're using the mark-up model or anything that's looking at transitions you want the raw data so that way it can say okay this is normal and these these character transitions happen more often in the environment so to train the data you can use flair there's a Markov model class and then that number is the order or you can think of it as the engrams that we covered earlier we're loading the data from this file this is what the file looks like and then each transition receives a log likelihood score like anytime I have the letters M oszi the likelihood of it being followed by an eye for my user agents is 1 which is 100 percent of the time because you know this is mozilla so in the user agent land like there's a high chance for that so these are two user agents I have what's really neat about the Markov model is you can start simulating things once you've built and trained your model so here's an actual user agent I pulled down and when to once you've run the simulated command this is a generated user agent from the Markov model so it took all of those transitional properties and then made its own user agent so how do we operationalize this well it's nice that we can make fake user agents all day long or simulate them but we need to assign some kind of score so there's two user agents here the top ones the Google Chrome Mozilla user agent which is a legit user agent can anyone tell me what the bottom one is it's uncommon for sure is it legit yeah I heard it a few times so that's the shellshock it's the start of a shellshock and then it's got some other goodness tagged onto the back there so what does that look like when we assign a likelihood score to it well on the top you can see the closer to the score is 2:0 the on a like a log rhythmic scale the closer it is to zero is the more common it's seen in your environment whereas the further out like I think this scale goes to 40 it'd be it's more uncommon so what you're doing is you're enriching the data and positioning your analyst so they don't have to do any machine learning but they can just hey show me everything that has like sort all of these scores in descending order and then I want to look at the least common user agents in my environment I'm sorry ascending so some recommendations on how to operationalize Markov model you can use it for a network in for doing network detection look at fields like the subject name issue or until s user agents for HTTP DNS labels I found some really cool stuff using cipher suite detection inside of the network I found some like really old computers and outdated browsers doing that and then TLS common name and then for that you can also apply this for the hosts as well things like file names host names registry keys usually your host names assigned to your organization have some type of schema that follows so if there's it's it's a great way to find ROH host and of course command line that one varies the mileage varies I have found some it was pretty easy to detect a an ie X like droplet for in PowerShell through this but that's the coolest thing I found so far I'm ending with the questions for vendors I didn't get into like performance metrics on things like how do you know if your model is doing well but I wanted to add this to the presentation because I feel like that's where the industry is at right now and this isn't like an interrogation of the fender but these are questions they should probably be able to answer if they're making the machine learning claim so you as a consumer can can go through here and these slides will be available and ask the vendor like hey are you doing supervised or unsupervised learning are you doing classification or regression you know have a little more of an informed conversation with them tracking true positive rates and false positive rates how can they make a recommendation of their product if they're not tracking like how successful it is so fair questions it's just really breaking down that communication barrier so that way we can start speaking the same language and talking to each other so I'm gonna do a quick demo I have three minutes left so I've been letting streaming fish run and this is what it's found so far see if anything else comes in so people are trying to register fake domain names all the time some of the features it uses too is it will look for association with like Apple or Microsoft or well-known domains and then do it uses the levenshtein distance to determine if something is like how close is to that well no and brand name and then make a prediction on there so new things are rolling right in you can see Instagram just got hit and these are actionable lists imagine if this is running in your environment you can just say hey these are the phishing URLs or links that we've got for the day here you go take a look to see if any were successful if any had like 200 okay's or did any of our employees actually go to one of these sites it's a good way to start an investigation so that's the demo for fishing for DGA so it's as easy as setting up you know importing the DJ classifier you if you're not familiar with Python it's okay I mean this this will be available to you I just want to show you how quick and easy it is to make a determination if something is DGA or legit so once you have your model trained you can start making these predictions like I can do facebook.com/ that's legit as soon as I start adding some some randomness to the domain then it just outputs yep that's DGA so in this example we'll do google.com does DJ I can you know just keep adding to the list and that's it that's what you get back so imagine if you have that in your sim where you're tagging everything that comes in this is either legit or DGA now you've given your analyst a much smaller list of things to look at instead of just like hey go to town and look at everything the last thing I want to share is the Markov chain example so I love this demo because you can simulate Shakespeare or Sherlock Holmes so I actually downloaded all the Shakespeare sonnets and it's like well let's point the Markov model at this and see how well it does so in Shakespeare's sonnets I had to clean it up a little bit I'm using Engram or a trigram so order of three training up my Markov model will do print and simulate do a thousand and it's it's this is the machine generating these sonnets I could have took the numbers out but looks show say be disgrace whose a so flattered let me generate a new one thus love thy Brian a borough meant that end on child of day and you get the idea that you can apply that same thing to to user agents so that was the example I gave earlier where I have my common user agent and then a shell-shocked user agent and I could just keep generating new user agents based off everything I've given it earlier so that's why I really love Markov chains you know if you're bored in a Friday night you just want to play around with the Markov chains so tool chart I broke that broke it down big shout out to West Connell for coming up with streaming fish it's a big contribution of the community I don't think it's received a lot of attention on github but it works really well if you're doing DGA have random forests example with flair fishing using logistic regression streaming fish and then anomaly just general anomaly detection you can use Markov chains or you can download freaked out pie which Eric Conrad covered earlier covers something very similar to that have links at the bottom in summary we performed a machine learning overview how to apply machine learning to given use cases good questions to ask some vendors and did a quick demo [Applause]
Info
Channel: SANS Institute
Views: 17,628
Rating: 4.956284 out of 5
Keywords: sans institute, information security, cyber security, cybersecurity, information security training, cybersecurity training, cyber security training, SANS Summits, Tactical Detections Summit, SIEM, Tactical Detection
Id: m2AgYbbXz8k
Channel Id: undefined
Length: 34min 53sec (2093 seconds)
Published: Mon Feb 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.